8.3. Alerting rules¶
An alert is created when alerts script calls function nw2functions.alert()
. To put this
into context, here is how all parts of this setup look like when deployed to NetSpyGlass SaaS.
First, we create simple script cpu_load_alert.py and place it in the directory /opt/netspyglass/home/scripts/alerts/ on the management server. All alert scripts go into this directory and that is where the server looks for them. These scripts, together with __init__.py, form module alerts that is built dynamically.
Here is the script cpu_load_alert.py:
from nw2functions import *
def alert_busy_cpu(log):
alert(
name='busyCpu',
input=query('FROM cpuUtil'),
condition=lambda mvar, value: value > 75,
description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min',
duration=300,
percent_duration=50,
notification_time=300,
streams=['log', 'slack_happygears'],
fan_out=True
)
see Data Processing Rules for more information on rule processing scripts and how to configure them. Alert scripts can use functions from the same module nw2functions.
The alert is activated immediately after you upload it to the server. There is usually small delay because the server scans the filesystem every 10 sec or so, but you do not need to restart anything or issue any command to force it to load your new script.
The function alert() takes its input data from the monitoring variable referred to by the parameter input.
In this example this is cpuUtil and just like everywhere else in the rule processing script, the variable
must be “imported” by calling simple NsgQL query nw2functions.query("FROM cpuUtil")()
(see Data Processing Rules
and NetSpyGlass Server Query Language). In this example the query is quite simple and gets all instances of the variable cpuUtil,
in other words, we want the alert to analyze cpu utilization of all cpu-like components of all devices.
If necessary, you can apply matching criteria using WHERE (see NetSpyGlass Server Query Language).
See nw2functions.alert()
for the description of the parameters of the function that implements
alert.
NetSpyGlass calls function nw2functions.alert()
every time it processes collected monitoring data,
which happens periodically with interval defined by the configuration parameter monitor.pollingIntervalSec.
Every time the script runs and calls function alert(), this function is given the latest monitoring variable
instances. On the very first call function alert() creates alert objects corresponding to instances
of the input monitoring variable and assigns their state depending on the result returned by the condition
function. Exsiting alert objects are updated on subsequent calls to alert() and their state always reflects
the result of the latest call to the condition function.
Lets look at our example again:
alert(
name='busyCpu',
input=query('FROM cpuUtil'),
condition=lambda mvar, value: value > 75,
description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min',
duration=300,
percent_duration=50,
notification_time=300,
streams=['log', 'slack_happygears'],
fan_out=True
)
Alert created by this rule has name busyCpu and becomes active if the value of the input variable cpuUtil is above 75. Alert has description which can appear in the notification messages, as well as additional parameter details that can also be passed to the notification stream.
Important
Alert name can not have space or special characters. The name can contain letters (upper and lowercase), numbers and underscore, and must start with a letter. If you want to pass additional information, e.g. priority, use field description which does not have these restrictions.
You can access alert objects using NsgQL NetSpyGlass Server Query Language (table alerts)
In addition to the alert object, function alert() also creates new monitoring variable with the name matching
the name of the alert (in this case - busyCpu). This variable appears in Graphing Workbench under category
Alerts and is stored in TSDB. The value of the observations in this variable an opaque large number when alert
is active and zero when it is cleared. This variable has set of tags that is a copy of that from the input
variable, merged with tags supplied via argument tags used in the call to the function alert()
.
There are many uses for this variable, here are just a few ideas:
- Alert monitoring variable provides historical view of the alert - when it was active and when it was cleared. You can graph alert monitoring variable in combination with other monitoring variables for visual comparison and to search for correlations.
- you can configure NetSpyGlass to show a variable like this in maps and configure color thresholds for it. This way, you can select it in the map legend and immediately see devices with active alerts colored red
- This variable can be used to pass information about alerts to Nagios (see Using alerts with Nagios). Nagios works by polling NetSpyGlass using provided plugin script. Nagios can be configured to poll NetSpyGlass to read any monitoring variable and then use matching rules configured in Nagios to trigger alerts. Alternative set up is to implement condition matching in NetSpyGlass using calls to alert() and just let Nagios poll alert monitoring variables and check if the value is greater than zero
8.3.1. Alert state¶
Depending on the value returned by the condition function, created alert can be one of the following states:
- cleared - alert is in this state when Python function referred to by the parameter condition returns False. This means input monitoring variable instance did not satisfy the condition. Note that the alert is created anyway and will appear in the output of the JSON API call json_api_list_alerts
- active - alert is in this state when function referred to by the parameter condition returns True.
Active alerts send notifications to the configured notification streams, in our example these are PagerDuty and log. This means whenever alert enters state active, a log entry is made in the log file /opt/netspyglass/var/logs/alerts.log and trigger event is sent to PagerDuty using their web API. See more on notifications below: Alert Notifications.
Cleared alerts, on the other hand, can resolve corresponding opened incidents in PagerDuty and ServiceNow. This is not a default behavior and, if required, must be enabled per alert. See more on notifications below: Alert Notifications.
8.3.2. Fan Out¶
Parameter fan_out allows administrator to create an alert that “fans out”, that is, the system creates
separate alert object for each instance of the input variable. Since separate instances of monitoring variables
describe a metrics collected for single component of a device, this way we create separate alert object for
each component of each device. Notification messages can include macros $alert.device_name and
$alert.component_name to expose device and component name in the log, email or other notification. See more
on macros in nw2functions.alert()
If parameter fan_out has value False, the system creates one alert object regardless of the number of
MonitoringVariable instances in the input variable. Every time the script calls function nw2functions.alert()
,
the alert object is updated with a list of devices and components that matched condition. This information is
placed in the field details of the alert object and can appear in notification messages (macro $alert.details)
and can be accessed via JSON API.
8.3.3. Conditions with timing¶
As mentioned previously, condition function is called on every monitoring cycle. However the logic that switches
alert state to active can take timing into account in addition to the value of the input variable. In the simplest
case you can call nw2functions.alert()
as follows:
alert(
name='busyCpu',
input=query('FROM cpuUtil'),
condition=lambda mvar, value: value > 75,
description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min',
notification_time=300,
streams=['log', 'slack_happygears'],
fan_out=True
)
Here we have omitted parameters duration and percent_duration which means we instructed function alert() to only analyze the latest value in the time series of the input variable. Condition function will be called only once with the value of the latest observation in the time series and alert will become active if this function returns True. In this example, this means alert activates whenever CPU load of any device exceeds 75%. This alert does not take into account how long CPU load was that high and will trigger even in the case of a short spike that was measured on just one monitoring cycle.
It is often more useful to build an alert that skips short spikes like that but activates when the value of the input variable goes over threshold several times during specified interval of time. For example, we may want to trigger alert if CPU load is over 75% for 4 measurement cycles out of 5. Here is how this might look like:
In this example, CPU utilization measured on monitoring cycles 1,4,5 was over the threshod, but the value measured on cycles 2 and 3 was below threshold. We can write the alert that would activate in this situation using parameters duration and percent_duration which are expressed in seconds (assuming monitoring interval is 1 min in this example):
alert(
name='busyCpu',
input=query('FROM cpuUtil'),
condition=lambda mvar, value: value > 75,
description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min',
duration=300,
percent_duration=50,
notification_time=300,
streams=['log', 'slack_happygears'],
fan_out=True
)
Parameter duration tells the system how many monitoring cycles to use to analyse the value of the variable. At 1 min interval, 300 sec is 5 cycles, so the function alert() is going to call condition function five times, feeding it values of the file latest observations one after another. Parameter percent_duration tells the system how many of the observations must match the criteria (i.e. condition function must return True) for the alert to activate. In this case the value is 50, which means over 50% of the observations must match the condition for the alert to activate. In our example 3 observations match, therefore the alert is going to be activated.
Note that the order in which observations match the condition does not matter, only the percentage of the numer of matching observations matters. This means a call to alert() with parameters duration and percent_duration configured this way can identiy “flapping” variables, that is, those that quickly change their value between that below and above the threshold. Setting percent_duration to a value greater than the equivalent of one polling cycle will make alert skip single spikes in the input variable but activate when it starts flapping.
You can differentiate between situations when variable is “flapping” and completely goes over the threshold by setting up two alerts, one with percent_duration equal to 100 (all observations within duration interval are over the threshold) and the other with percent_duration equal to 50 to catch the variable in “flapping” state.
Matching only the latest value in the time series is easy, just call alert() with parameter duration equal to 0 (the default) or polling interval in seconds. If duration is 0 or equal to the polling interval, parameter percent_duration is ignored becase we are analyzing just one observation. This is the same as calling alert() with parameters duration and percent_duration omitted.