The Data Store monitors and records the state of many hundreds of variables. All of these are potentially useful, but only a small proportion of them are really worth monitoring on a constant basis, and may not be useful without other contextual information. This article describes the basic information that should be monitored, and how that information is made available.
First let's take a look at some of the ways the UnboundID platform provides monitoring information.
Accessing Monitoring Information
In the UnboundID platform, monitoring of information falls into two main categories:
Monitoring of event related information
This type of information is more related to events such as server shutdown, startup, disk full, JVM garbage collection events rather than performance. These events are usually more critical in nature to the operation of the server and could cause the server to shutdown or be degraded, which would then impact performance.
Monitoring of performance related information
This type of information is specific to how well is the server performing around the types of operations that it is performing. Typically you will be monitoring things such as response times, throughput and other variables around how the Data Store is handling each operation.
With the use of Gauges you can setup thresholds to track certain aspects of the servers performance.
The UnboundID Platform provides a very robust and flexible monitoring framework that exposes monitoring information in a number of different ways.
cn=monitor: Is an in memory backend on each instance of a server that tracks performance and other server related information. This backend can be queried viaLDAP commands.
SNMP: supports real-time monitoring using the SNMP. The Data Store provides an embedded SNMPv3 subagent plugin that, when enabled, sets up the server as a managed device and exchanges monitoring information with a master agent based on the AgentX protocol.
JMX: supports monitoring the JVM™ through a Java Management Extensions (JMX™) management agent, which can be accessed using JConsole or any other kind of JMX client.
Stats Logger: A built-in Stats Logger that is useful for profiling server performance for a given configuration. At a specified interval, the Stats Logger writes server statistics to a log file in a comma-separated format (.csv), which can be read by spreadsheet applications such as Splunk.
Metrics Engine: is an invaluable tool for collecting, aggregating and exposing historical and instantaneous data from the various UnboundID servers in a deployment.
UnboundID LDAP SDK: You can use the commercial edition of the LDAP SDK to interact with the monitoring backend and embed monitoring capabilities into your own applications.
In addition to monitoring information, the UnboundID Data Store also provides delivery mechanisms for account status notifications and administrative alerts using SMTP, JMX, or SNMP in addition to standard error logging. This article will not go into detail on these items but are mentioned for your information.
Alerts and events reflect state changes within the server that may be of interest to a user or monitoring service.
Notifications are typically the delivery of an alert or event to a user or monitoring service.
Account status notifications are only delivered to the account owner notifying a change in state in the account.
Alerts and Alarms
The server has a set of pre-configured alerts for essential factors. These are delivered in a variety of ways as described above, the most basic being that they are written to the server logs (error log in particular).
Note that in the out of box configuration, these are only delivered to logfiles.
Delivery other than to the server logs requires configuration of both the Data Store and the monitoring application. Delivery via SMTP requires the configuration of the SMTP server, email address to which to deliver etc. Other alert delivery mechanisms can be created.
In addition to Alerts, the server also has Alarms. The underlying mechanism for these is a "gauge" mechanism based upon the International Telecommunication Union CCITT Recommendation X.733 (1992). These monitor continuous variables, such as disk space, CPU usage, work-queue length etc. and when the gauge reaches a pre-determined level an alarm is generated. Another "alarm cleared" event may be generated when the value falls below the trigger threshold. For additional information on alarms and gauges, see the "Working with Alarms, Alerts, and Gauges" section of the UnboundID Data Store admin guide.
Several of these are pre-configured in the out-of-box server.
Virtually all critical aspects of server operation are covered by the server's own Alert/Alarm system. All that is required is to monitor for these.
The chapter entitled "Managing Notifications and Alerts" in the Administration Guide contains detail on configuring the delivery via SNMP, JMX and SMTP. How the monitoring applications themselves should be configured will depend upon the particular system in use.
The server Alert/Alarm mechanism does require that the server and its supporting infrastructure actually be working to generate/deliver these notifications. It therefore makes sense to provide some level of external monitoring to ensure that the server and its environment are healthy.
This monitoring will typically be implemented by existing data center monitoring systems. These systems may well be the same ones configured to receive and respond to internally generated alerts and alarms. Only the mechanism of monitoring would be different.
Is the OS running?
Is disk-space available?
Is CPU load reasonable?
Is there network connectivity to the system?
Data Store monitoring:
Is the LDAP interface responding?
Is the HTTP(S) interface responding (if in use)?
Is replication running?
Are there any significant replication backlogs?
Are request e-times below threshold (e.g. 1 second)?
How much Java heap is being used?
How much cache is being used?
Some of these are duplicate of the monitoring that the Data Store itself does, but they are rather fundamental, so duplication is not a bad thing.
Configuring external monitoring thresholds may need to be based upon Data Store configuration parameters to be effective. For example, when monitoring disk space, there are three levels of alerting configured. There are also actions associated with these. The first level generates a low disk warning, with no action associated. The second alert level generates a serious low disk alert, with an associated action which prevents the data store from accepting any further requests, except from a directory administrator account. If disk space continues to decrease, a third level trigger generates an alert indicating that disk space is critical, and shuts down the data store to prevent damage to the database.
There is little point in having the external monitoring system generate an alert after the second or third stage of internally generated alerts have passed, so coordinating the disk space levels at which these trigger is almost essential.
In addition to basic monitoring, running the Metrics Engine can provide valuable data to an operations team in at least a couple of contexts:
Predictive Analysis (capacity planning)
The Metrics Engine stores monitored parameter values over a long period (up to 20 years), and can display this data in graphical form very easily. Without too much analysis, it is quickly evident if various parameters are increasing overtime, and to project at what point in the future these are liable to become problems if nothing changes, or additional resources are not allocated (this sort of data-driven analysis is often invaluable in convincing management of the need to allocate more resources).
Root Cause Analysis
When problems occur, especially those impacting performance/availability in critical systems understanding the root cause of that problem is essential in taking appropriate corrective action to ensure that it doesn't happen again. However, root cause determination often takes a back seat to restoring service, and the act of restarting servers often destroys information needed for a full RCA.
Data accumulated by the Metrics Engine can often fill in those gaps.
A related case in which the Metrics Engine is invaluable is that when (for example) an application owner complains of slow response, when you look, things are totally normal, the application owner then tells you that is true, but it wasn't half an hour ago.
Rather than trawling back through potentially gigabytes of logs, and ending up being able to see the effects in the logs, but no closer to understanding why, the Metrics Engine graphs will clearly show any spike in response (etime) times, and the cusror functionality (placing a cursor on one graph places a cursor at the corresponding point on all graphs) will quickly allow you to see what was happening internally - perhaps the filestore was slow, perhaps there was a burst of heavy traffic from a bulk upload etc. etc.
What were previously very had and time consuming questions in the past can become much easier to tackle wit the Metrics Engine in place.