Monitoring External Systems with Platform Agents
Licensing
Process-based Contracts
All monitoring is included in the license.
License-based Contracts
A UNIX or Microsoft Windows agent can be configured to be for monitoring only by not assigning any job definition types that run on agents and not assigning any file events. Such process servers do not consume any licenses. On OpenVMS you must assign the DCL job definition type and thus there is no free monitoring.
Prerequisites
- Monitoring must be enabled
- Monitoring process server must be configured for monitoring
Configuration
Process server and queue monitoring is disabled by default for performance reasons. You set the /configuration/jcs/monitoring/enabled
configuration entry to true
to enable monitoring and for each process server you wish to monitor, you set the MonitorInterval
process server parameters.
Default Monitor Nodes
The platform agent will report CPU busy
, IO page rate
and disk capacity
by default. You can tune how often it does this by changing the MonitorInterval
process server parameter. The data is stored in the monitor tree in the following paths:
System.ProcessServer.${PSName}.Performance.Load
- By default the number of processes the process server is currently processing or a representation of the load factors as configured.System.ProcessServer.${PSName}.Performance.LoadThreshold
- By default the maximum number of processes allowed to run simultaneously or the maximum load specified on the load factor tab.System.ProcessServer.${PSName}.Performance.CPUCount
- The number of CPU's the system has.System.ProcessServer.${PSName}.Performance.CPUBusy
- The CPU usage on the server.System.ProcessServer.${PSName}.Performance.PageRate
- The amount of memory paging that is taking place.System.ProcessServer.${PSName}.Performance.NetworkResponseAverage
- Average communication overhead with platform agent per transfer, in seconds.System.ProcessServer.${PSName}.Performance.NetworkResponseMaximum
- Average communication overhead with platform agent per transfer, in seconds.System.ProcessServer.${PSName}.Performance.NetworkResponseMinimum
- Minimum communication overhead with platform agent per transfer, in seconds.System.ProcessServer.${PSName}.Performance.NetworkTransferCount
- Number of transfers exchange with platform agent.System.ProcessServer.${PSName}.Performance.NetworkTransferRate
- Volume of network traffic sent and received by platform agent, in bytes per second.System.ProcessServer.${PSName}.Performance.NetworkUptime
- Time since last network error or startup, in seconds.System.ProcessServer.${PSName}.FileSystem.${FileSystemPath}.Free
- The free space on the specific file system.System.ProcessServer.${PSName}.FileSystem.${FileSystemPath}.Used
- The used space on the specific file system.System.ProcessServer.${PSName}.FileSystem.${FileSystemPath}.Total
- the total size of the file system.System.ProcessServer.${PSName}.FileSystem.${FileSystemPath}.UsedPercentage
- Percentage of used space on the file system.System.ProcessServer.${PSName}.Checks.${Check_Name}.${Monitored value}
- Custom checks.${PSName}
- process server name, for example System.${FileSystemPath}
- the path to the local filesystem, for exampleC:\\
or/home
(SAN file systems may be considered local if, for example, they are mounted via iSCSI.{Check_Name}
- the name of the check or its description, if the latter is set.{Monitored value}
- the name of the check that is performed; depends on the type of check.
The Load and LoadThreshold are calculated for all process servers, not just for process servers that include a PlatformAgentService. The LoadFactors for a process server point to a MonitorCheck such as CPUBusy or PageRate. All load factors are added up into a particular load. If the summed load is higher than the maximum allowed by the process server's LoadThreshold attribute the process server will be overloaded. Besides showing this status you can also create programmatic actions by defining a condition that checks the summed load and raises the appropriate events.
note
The file system statistics are reported for all local disks, network shares are not taken into account.
Network Statistics Logging
The logging is done at least every 24 hours, but usually every hour if there is anything to report, and takes the following from in the platform agent log files:
INFO 2023-10-02 16:34:48,663 CES common.statistics - The agent started 0 job processors in the last 359 minutes, with at most 0 in parallel
INFO 2023-10-02 16:34:48,663 CES common.statistics - Performed 1 HTTP requests in the last 359 minutes, average 0.124s, max 0.124s, min 0.124s
INFO 2023-10-02 16:34:48,663 CES common.statistics - Performed 1087 HTTP requests (scheduler) in the last 359 minutes, average 0.052s, max 0.204s, min 0.030s
INFO 2023-10-02 16:34:48,663 CES common.statistics - Performed 19 file reads in the last 359 minutes, total 25024 bytes
INFO 2023-10-02 16:34:48,663 CES common.statistics - Performed 173947 file writes in the last 359 minutes, total 24063781 bytes
INFO 2023-10-02 16:34:48,663 CES common.statistics - Performed 8 network connections in the last 359 minutes, average 0.010s, max 0.029s, min 0.001s
INFO 2023-10-02 16:34:48,663 CES common.statistics - Performed 12 network name lookups in the last 359 minutes, average 0.013s, max 0.126s, min 0.000s
INFO 2023-10-02 16:34:48,663 CES common.statistics - Performed 7565 network reads in the last 359 minutes, total 890417 bytes
INFO 2023-10-02 16:34:48,663 CES common.statistics - Performed 2948 network writes in the last 359 minutes, total 475673 bytes
The "network connections" statistics (average, max, min) are usually way below one second. In the above, the average response is 10 ms with a worst case of 29 ms. Note that this includes both the pure network latency as well as the time the network takes to do data transfers. The latter factor is usually negligible, but be careful in cases where large files are sent over the network.
The "network name lookup" statistics show how the customer DNS service is performing. You can see that the spread is a little more than the internet connections themselves!
HTTP requests not marked as HTTP requests (scheduler) were requests where the request was either to a different HTTP service than the pure agent to server communication. Note that no HTTP request failures happened in the above log, so they are not reported. Such failures would show up like this:
INFO 2023-10-02 16:34:48,663 CES common.statistics - Performed 1 HTTP requests (failed) in the last 359 minutes, average 30.03s, max 30.03s, min 30.03s
Note that only failed HTTP requests are logged separately, not failed DNS requests.
Check Styles and Platforms
- Eventlog - Windows only
- Logfile - UNIX, OpenVMS & Windows
- Process - UNIX & OpenVMS
- Service - Windows only
- Socket - UNIX, OpenVMS & Windows
Process Server Checks
A process server with an attached platform agent service can monitor system operation when it is of the UNIX, Microsoft Windows or OpenVMS family type.
You can add the checks on the Checks tab of the Process Server edit dialog. See the Creating Monitor Checks topic for more information.
The monitoring system has three general severity grades (green, yellow and red), and levels from -1
to 100
. -1
means disabled, 0
usually means everything is as it should be whereas 100
usually means there is a critical problem (red). Values 50
until and including 74
translate to yellow
, which is meant to be a "warning".
When you implement a check, you want to set levels and grades accordingly so that operators can immediately analyse the situation and react accordingly. You should create at least two checks for everything you want to monitor, one to match green and one to match red grades. You do this with the Severity Expression.
The fields you can add per check are:
Field | Description |
---|---|
Name | Name for the check. |
Description | A description for the check. |
Documentation | A comment for the check. |
Enabled | when ticked, the check is enabled. |
Style | The type of check. |
Object Name | The first attribute of the check (compulsory). |
Attribute 2 | The second attribute (compulsory for Logfile and EventLog). |
Poll interval | The interval at which to check. |
Severity | The severity of the condition expression. |
Condition Expression | An expression that describes a state, for example =Count > 0 . |
Delay Amount | Number of Delay Units to wait before firing the ad hoc alert or submitting the Reaction Process Type process. |
Delay Units | The delay units. |
Ad Hoc Alert Source | Ad hoc alert source to fire. |
Process Definition | Process definition to submit. |
Address | Address to be used for the ad hoc alert source or parameter. |
Message | Message to be used for the ad hoc alert source or parameter. |
Data | Data to be used for the ad hoc alert source or parameter. |
Example
You want to make sure that the Oracle database is running.
Name | Value |
---|---|
Description | Check Oracle running. |
Documentation | Check that Oracle is running. |
Style | Process |
Object Name | *ora*_orcl |
Attribute 2 | |
Poll interval | 3 |
Severity | 0 |
Condition Expression | =Count > 10 |
Add another check, so that the severity is set to high when less than 2 processes are running for Oracle.
Name | Value |
---|---|
Description | Check Oracle Not running. |
Documentation | Check that Oracle is not running. |
Style | Process |
Object Name | *ora*_orcl |
Attribute 2 | |
Poll interval | 3 |
Severity | 75 |
Condition Expression | =Count < 2 |
Check if the Oracle Listener is working:
Name | Value |
---|---|
Description | Check Oracle Listener is running. |
Documentation | Check that Oracle Listener is running. |
Style | Socket |
Service | 1521 |
Poll interval | 5 |
Severity | 75 |
Condition Expression |
The Name is used as an identifier to distinguish checks of the same process server in the log files. They also determine what the path of the checks in the monitor tree are. Depending on the Style the path will be:
System.ProcessServer.${PSName}.Check.$|CheckName}.Count
System.ProcessServer.${PSName}.Check.${CheckName}.Message
The Style can be selected from the drop down box, and is one of Process
, Socket
, Logfile
, Service
, Eventlog
.
Object Name is always required, what it determines depends on the style.
- For the
Process
(UNIX, OpenVMS) and theService
(Microsoft Windows) styles it contains a pattern using GLOB matching that selects the name of the objects. Matching objects are counted. For OpenVMS the matching record is the process name. For UNIX the matching record is the output of a line ofps -ef
or its equivalent. For Microsoft Windows services the matching record is Displayname (Servicename) which means that you can check on both names of the service, if desired. - For
Logfile
it contains the filename of the logfile that is to be checked. - For
Eventlog
(Microsoft Windows) it contains the name of the log. Typical values are System and Application, but other Microsoft Windows logs are allowed. - For
Socket
it contains the service port to be checked. You can specify a port number in decimal or a reference that will be resolved by the agent on the target system.
Attribute 2 is only used for some styles.
- It is not used for the Process and Service styles.
- For the Logfile and Eventlog styles this contains a pattern using GLOB matching that selects records. The Logfile records are the lines in the file. The Microsoft Windows Eventlog records are the complete message expanded using the locale defined for the agent.
- For the Socket style this contains the network address that the socket should be bound to. The default is 0.0.0.0 (all IP addresses of the server).
note
GLOB matching means that you can use *
to search for any number of characters and ?
for a single character, just as you do on Microsoft Windows Command prompt or in Microsoft Dos, for example. Use *
at the beginning and end of the pattern if you want your pattern to match a particular string somewhere in the record instead of the whole record.
The Poll Interval
is used as the upper bound for how often the check is performed. This is not a pure interval because the agent can often check multiple checks of the same style using a single pass over whatever it checks. In such cases the check may be performed more often than set here.
The Severity
and Condition Expression
are used to create a default condition in the monitor tree. Normally, a condition named Default will be created on the monitor check that is created as a result of the process server check. This condition will set severity 50 (Yellow) and Condition Expression = Count < 1
unless you set other values in the process server check. You should not edit the Default condition as the values in there will then be overwritten with those from the process server check.
If you want to use more complicated conditions than the simple single condition allowed by the Severity
and Condition Expression
fields you can do so by adding your own Conditions on the MonitorCheck
with a name other than Default. As soon as you create such a condition the Default condition will not be updated or recreated.
Examples of valid ProcessServerChecks:
OS Family | Style | Object Name | Attribute 2 | Explanation |
---|---|---|---|---|
UNIX | Process | ora_dbwr_ | Process matching on UNIX is on the output of ps -ef, so wildcards are needed | |
VMS | Process | NETACP | Process matching on VMS is purely on the process name, so no wildcards needed | |
UNIX | Logfile | /var/log/system.log | dhcp: | Log messages written by the DHCP service |
UNIX | Socket | 21 | Check that the FTP service is running | |
Windows | Service | W32Time | Check that the Windows Time Service is running (by its service name) | |
Windows | Service | Windows Time | Check that the Windows Time Service is running (by its display name) |