10.2. Alerting rules

An alert is created when rule processing script calls function nw2functions.alert(). To put this into conext, here is how the call to nw2functions.alert() may look like (see Data Processing Rules for more information on rule processing script and how to configure it):

import nw2rules
from nw2functions import *


class UserRules(nw2rules.Nw2Rules):
    def __init__(self, log):
        super(UserRules, self).__init__(log)

    def execute(self):
        super(UserRules, self).execute()

        alert(
            name='cpu_load_high',
            input=import_var('cpuUtil'),
            condition=lambda mvar, value: value > 75,
            description='CPU utilization is over 75% for 50% of time for the last 5 min',
            details={},
            duration=300,
            percent_duration=50,
            notification_time=300,    # send notification once in 5 min
            streams=['pagerduty', 'log'],
            fan_out=True
        )

This function call takes its input data from the monitoring variable referred to by the parameter input. In this example this is cpuUtil and just like everywhere else in the rule processing script, the variable must be “imported” by calling nw2functions.import_var() (see Data Processing Rules). In this example we use input monitoring variable without any transformations, but if necessary, you can perform some calculations and feed temporary monitoring variable to the call to nw2functions.alert() (see bgpSessionDown: alert with dependencies) Function nw2functions.alert() applies function referred to by the parameter condition to every net.happygears.nw2.py.MonitoringVariable objects it takes from input and creates separate net.happygears.nw2.alerts.Alert object for each. Condition function is given two arguments: net.happygears.nw2.py.MonitoringVariable object and the value to be analyzed (a number). The function in our example ignores first argument and returns True if the value (second argument) is greater than 75. Reference to the net.happygears.nw2.py.MonitoringVariable object can be useful if you want to compare the value (second argument) to mean or average value of all observations in the time series of the monitoring variable instance (first argument) or make comparison conditioned on the information about the device, component or tags. See more on the alerting rules below: Alerting rules.

NetSpyGlass calls function nw2functions.alert() every time it processes collected monitoring data, which happens periodically with interval defined by the configuration parameter monitor.pollingIntervalSec. Every time the script runs and calls function alert(), this function is given the latest monitoring variable instances. On the very first call function alert() creates alert objects corresponding to instances of the input monitoring variable and assigns their state depending on the result returned by the condition function. Exsiting alert objects are updated on subsequent calls to alert() and their state always reflects the result of the latest call to the condition function.

Lets look at our example again:

alert(
    name='cpu_load_high',
    input=import_var('cpuUtil'),
    condition=lambda mvar, value: value > 75,
    description='CPU utilization is over 75% for 50% of time for the last 5 min',
    details={},
    duration=300,
    percent_duration=50,
    notification_time=300,    # send notification once in 5 min
    streams=['pagerduty', 'log'],
    fan_out=True
)

Alert created by this rule has name cpu_load_high and becomes active if the value of the input variable cpuUtil is above 75. Alert has description which can appear in the notification messages, as well as additional parameter details that can also be passed to the notification stream.

You can access alert objects using JSON API call GET /v2/alerts/net/:netid/alerts[?active=true|false][rule_spec]

In addition to the alert object, function alert() also creates new monitoring variable with the name matching the name of the alert (in this case - cpu_load_high). This variable appears in Graphing Workbench under category Alerts and is stored in TSDB. The value of the observations in this variable an opaque large number when alert is active and zero when it is cleared. This variable has set of tags that is a copy of that from the input variable, merged with tags supplied via argument tags used in the call to the function alert().

There are many uses for this variable, here are just a few ideas:

  1. Alert monitoring variable provides historical view of the alert - when it was active and when it was cleared. You can graph alert monitoring variable in combination with other monitoring variables for visual comparison and to search for correlations.
  2. you can configure NetSpyGlass to show a variable like this in maps and configure color thresholds for it. This way, you can select it in the map legend and immediately see devices with active alerts colored red
  3. This variable can be used to pass information about alerts to Nagios (see Using alerts with Nagios). Nagios works by polling NetSpyGlass using provided plugin script. Nagios can be configured to poll NetSpyGlass to read any monitoring variable and then use matching rules configured in Nagios to trigger alerts. Alternative set up is to implement condition matching in NetSpyGlass using calls to alert() and just let Nagios poll alert monitoring variables and check if the value is greater than zero

10.2.1. Alert state

Depending on the value returned by the condition function, created alert can be one of the following states:

  1. cleared - alert is in this state when Python function referred to by the parameter condition returns False. This means input monitoring variable instance did not satisfy the condition. Note that the alert is created anyway and will appear in the output of the JSON API call GET /v2/alerts/net/:netid/alerts[?active=true|false][rule_spec]
  2. active - alert is in this state when function referred to by the parameter condition returns True.

Active alerts send notifications to the configured notification streams, in our example these are PagerDuty and log. This means whenever alert enters state active, a log entry is made in the log file /opt/netspyglass/home/logs/alerts.log and trigger event is sent o PagerDuty using their web API. See more on notifications below: Alert Notifications.

10.2.2. Fan Out

Parameter fan_out allows administrator to create an alert that “fans out”, that is, the system creates separate alert object for each instance of the input variable. Since separate instances of monitoring variables describe a metrics collected for single component of a device, this way we create separate alert object for each component of each device. Notification messages can include macros $alert.device_name and $alert.component_name to expose device and component name in the log, email or other notification. See more on macros in nw2functions.alert()

If parameter fan_out has value False, the system creates one alert object regardless of the number of MonitoringVariable instances in the input variable. Every time the script calls function nw2functions.alert(), the alert object is updated with a list of devices and components that matched condition. This information is placed in the field details of the alert object and can appear in notification messages (macro $alert.details) and can be accessed via JSON API.

10.2.3. Conditions with timing

As mentioned previously, condition function is called on every monitoring cycle. However the logic that switches alert state to active can take timing into account in addition to the value of the input variable. In the simplest case you can call nw2functions.alert() as follows:

alert(
    name='cpu_load_high',
    input=import_var('cpuUtil'),
    condition=lambda mvar, value: value > 75,
    description='CPU utilization is over 75% for 50% of time for the last 10 min',
    notification_time=300,    # send notification once in 5 min
    streams=['pagerduty', 'log'],
    fan_out=True
)

Here we have omitted parameters duration and percent_duration which means we instructed function alert() to only analyze the latest value in the time series of the input variable. Condition function will be called only once with the value of the latest observation in the time series and alert will become active if this function returns True. In this example, this means alert activates whenever CPU load of any device exceeds 75%. This alert does not take into account how long CPU load was that high and will trigger even in the case of a short spike that was measured on just one monitorign cycle.

It is often more useful to build an alert that skips short spikes like that but activates when the value of the input variable goes over threshold several times during specified interval of time. For example, we may want to trigger alert if CPU load is over 75% for 4 measurement cycles out of 5. Here is how this might look like:

../_images/aafig-4777e565414db9c9f986fc7fd176845e71096486.svg

In this example, CPU utilization measured on monitoring cycles 1,4,5 was over the threshod, but the value measured on cycles 2 and 3 was below threshold. We can write the alert that would activate in this situation using parameters duration and percent_duration which are expressed in seconds (assuming monitoring interval is 1 min in this example):

alert(
    name='cpu_load_high',
    input=import_var('cpuUtil'),
    condition=lambda mvar, value: value > 75,
    description='CPU utilization is over 75% for 50% of time for the last 5 min',
    details={},
    duration=300,
    percent_duration=50,
    notification_time=300,    # send notification once in 5 min
    streams=['pagerduty', 'log'],
    fan_out=True
)

Parameter duration tells the system how many monitoring cycles to use to analyse the value of the variable. At 1 min interval, 300 sec is 5 cycles, so the function alert() is going to call condition function five times, feeding it values of the file latest observations one after another. Parameter percent_duration tells the system how many of the observations must match the criteria (i.e. condition function must return True) for the alert to activate. In this case the value is 50, which means over 50% of the observations must match the condition for the alert to activate. In our example 3 observations match, therefore the alert is going to be activated.

Note that the order in which observations match the condition does not matter, only the percentage of the numer of matching observations matters. This means a call to alert() with parameters duration and percent_duration configured this way can identiy “flapping” variables, that is, those that quickly change their value between that below and above the threshold. Setting percent_duration to a value greater than the equivalent of one polling cycle will make alert skip single spikes in the input variable but activate when it starts flapping.

You can differentiate between situations when variable is “flapping” and completely goes over the threshold by setting up two alerts, one with percent_duration equal to 100 (all observations within duration interval are over the threshold) and the other with percent_duration equal to 50 to catch the variable in “flapping” state.

Matching only the latest value in the time series is easy, just call alert() with parameter duration equal to 0 (the default) or polling interval in seconds. If duration is 0 or equal to the polling interval, parameter percent_duration is ignored becase we are analyzing just one observation. This is the same as calling alert() with parameters duration and percent_duration omitted.