10.8. Examples

10.8.1. busyCpuAlert

Simple alert that activates when cpuUtil goes over the threshold of 75% for 20% of the 5 min interval:

alert(
    name='busyCpuAlert',
    input=import_var('cpuUtil'),
    condition=lambda _, value: value > 75,
    description='CPU utilization is over 75% for 50% of time for the last 10 min',
    details={'slack_channel': '#netspyglass'},
    duration=600,
    percent_duration=50,
    notification_time=300,
    streams=['slack', 'log'],
    fan_out=True
)

This alert catches episodes when CPU load of a device goes over 75% but ignores short spikes when this high load does not last long. Parameter duration makes the alert analyse the value of the variable for the past 10 min but parameter percent_duration says the value should be over the threshold at least 50% of the time. In other words, the alert activates only if at least half of the samples collected during the 10 min interval are above the threshold. The alert “fans out”, that is, it generates separate notification message for each device that matches its condition. Notifications are sent to two outgoing streams: ‘log’ and ‘slack’ and are sent no more often than every 5 min (parameter notification_time has value 300 sec).

10.8.2. deviceDown

Next alert watches packet loss to devices measured with ping and activates when it goes over threshold (this is a simple way to alert on “device down” condition):

alert(
    name='deviceDown',
    input=import_var('icmpLoss'),
    condition=lambda _, value: value > 75,
    description='Packet loss to the device measured with ping is over 75% for the last 5 min',
    details={'slack_channel': '#devices_down'},
    duration=300,
    percent_duration=100,
    notification_time=300,
    streams=['slack', 'log'],
    fan_out=True
)

Alert activates when the value of the variable icmpLoss is over 75% for 5 min and sends notifications to streams log and slack. Note that Slack channel is passed via alert field details, this way you can override the default channel configured in the alerts.streams.slack configuration parameter.

10.8.3. bigChangeInVariables

The following alerting rule watches monitoring variable numVars and activates alert when it notices big change in the value:

def compare_to_mean(mvar, value):
    """
    compare `value` to the mean value of the variable and return True if the difference is over 30% of the mean

    :param mvar:    MonitoringVariable instance
    :param value:   value
    :return:        True if abs(mean - value) > 0.3*mean
    """
    assert isinstance(mvar, MonitoringVariable)
    mean = mvar.statistics.mean()
    return abs(value - mean) > 0.3*mean


class UserRules(nw2rules.Nw2Rules):
    def execute(self):

        alert(
            name='bigChangeInVariables',
            input=import_var('numVars'),
            condition=lambda mvar, x: compare_to_mean(mvar, x),
            description='Big change in the number of monitoring variables; last value=$alert.value',
            duration=300,
            percent_duration=100,
            notification_time=300,
            streams=['pagerduty', 'log'],
            fan_out=True
        )

This alert is well suited to catch big changes in a value of the variable that lasts long time:

../_images/aafig-44b6c5bb3cc61dc0df79ca06e1f7f1d034351da6.svg

To detect change in the value of this variable, the rule calls function compare_to_mean() that calculates mean value of the variable, compares provided value with it and returns True if the value deviates from the mean over 30%. This alert activates when the value of variable numVar changes by over 30% and stays like that for 5 minutes (because parameter duration has value of 300 sec). This alert will ignore changes in the value that are smaller or last shorter than that. Alert sends notification to two streams: log and pagerduty that should be configured in the configuration section alerts.streams in the main configuration file nw2.conf.

Because this alert compares current value of the variable with mean, it will continue to be active for a while as long as the new value satisfies the condition. It can clear in two cases: either the value returns back to mean, or, if the value does not change much around its new level, the new level becomes new mean and that also clears the alert.

Note how this alert uses macro $alert.value in its description. See nw2functions.alert() for the list of supported macros.

10.8.4. lagPartiallyDegraded

Next example is an alert that watches combined bandwidth of LAG interface groups. Monitoring variable portAggregatorBandwidth computed by the default rule processing script has value of 100% if all configured LAG members are online and are passing traffic, that is, LACP bits collecting and distributing are set. The value drops below 100% if a member is down or misconfigured, in which case one or both bits become cleared. This alert is a good way to catch LAG groups that become degraded because of misconfiguration.

The alert uses condition function that checks if the value of the variable is below certain limit rather than above as in previous examples:

alert(
    name='lagPartiallyDegraded',
    input=import_var('portAggregatorBandwidth'),
    condition=lambda _, value: value < 100,
    description='$alert.deviceName:$alert.componentName :: One or more LAG has members failed, combined bundle bandwidth is below 100%',
    details={},
    duration=300,
    percent_duration=100,
    notification_time=300,
    fan_out=True
)

The only requirement for the function passed as parameter condition to nw2functions.alert() is that it should accept two arguments and return a boolean value. The first argument is net.happygears.nw2.py.MonitoringVariable object and the second argument is observation value to be examined. If parameter duration specifies interval of time longer than polling interval, this function will called multiple times with the same object as its first argument and different values as second argument.

10.8.5. interfaceDown

this alert watches variable ifOperStatus that has value 1 when interface is up and 2 when it is down. There are other values too, such as 3 (“testing”) or 7 (“lower layer down”) but they all are greater than 1:

alert(
    name='interfaceDown',
    input=import_var('ifOperStatus'),
    condition=lambda _, value: value > 1,
    description='$alert.deviceName:$alert.componentName :: Interface is down',
    details={'slack_channel': '#netspyglass'},
    duration=300,
    percent_duration=100,
    notification_time=300,
    streams=['slack', 'log'],
    fan_out=True
)

Similar to the previous examples, this alert ignores cases when interface goes down for a short period of time and activates only if the interface stays down for at least 5 min.

Alert configured this way may not be optimal though because it will not activate if interface is “flapping”, that is quickly goes up and down all the time. Flapping interface may not be in the state “down” for 5 min without going up briefly and will never satisfy the condition of this alert. To catch flapping interfaces we can modify the alert:

alert(
    name='interfaceDown',
    input=import_var('ifOperStatus'),
    condition=lambda _, value: value > 1,
    description='$alert.deviceName:$alert.componentName :: Interface is down',
    details={'slack_channel': '#netspyglass'},
    duration=300,
    percent_duration=50,
    notification_time=300,
    streams=['slack', 'log'],
    fan_out=True
)

Now the alert activates if the interface is found to be down at least half of the time during 5 min interval but it still ignores episodes when interface goes down briefly and then remains in the state “up” for a long time.

Note

It does not matter exactly which half of the time the interface was down. The alert activates if half of the observations collected during the 5 min interval show the interface in the sate “down”. It could have stayed down for 3 min and then come up, or bounced up and down several times - either way, if at least half of observations in any combination show the state to be “down”, the alert will activate.

10.8.6. bgpSessionDown: alert with dependencies

the following example illustrates an alert that takes into account dependency between monitoring variables. The alert activates when it notices that BGP session state becomes anything but established while corresponding peering interface is in operational state Up. We assume that “interface down” condition is tracked by another alert somewhere (see above for the example) and we don’t want to get two alerts when peering interface goes down - first “interfaceDown” and then “BGPSessionDown”. We want to get only “interfaceDown” alert when interface goes down and “bgpSessionDown” alert when interface stays up but BGP session disconnects. We use tag BGPPeerAddress to associate bgp state and interface state variables:

def bgp_state_for_intf_up(self):
    '''
    this function returns instances of the variable `bgpPeerState` that correspond to
    peering interfaces in op state "up".
    '''
    if_oper_up = filter(lambda x: x == 1, import_var('ifOperStatus'))
    bgp_peer_state = import_var('bgpPeerState')
    for pair in join_by_tags(if_oper_up, bgp_peer_state, ['BGP4PeerAddress']):
        op_status, bgp_state = pair
        yield bgp_state

def execute(self):

    # take filtered instances of variable `bgpPeerState` (only those that correspond to
    # peering interfaces in state "up") and trigger alert if the value is not 6 ("established").
    # It is assumed that "interface down" condition is tracked by another alert somewhere.
    alert(
        name='bgpSessionDown',
        input=self.bgp_state_for_intf_up(),
        condition=lambda mvar, value: value < 6,
        description='BGP Session is down but interface is up',
        details={},
        notification_time=300,
        streams=['log', 'slack'],
        fan_out=True
    )

First, we define function bgp_state_for_intf_up() that filters instances of the ifOperStatus variable to get only those that correspond to interfaces in the sate “Up”, then matches them to instances of the variable bgpPeerState that have the same tag BGP4PeerAddress (it calls standard function nw2functions.join_by_tags() to do that). This function is a generator that yields instances of the variable bgpPeerState. All this filtering and matching means that this generator yields instances of bgpPeerState that correspond to peering interfaces in state “Up”.

We then feed variables returned by bgp_state_for_intf_up() to an alert that checks the value and activates if the value is not 6 (“established”). This alert activates immediately when it finds the match and will not ignore short lived events when BGP session “bounces”. The alert can easily be modified to skip these events just like we have done in other examples above.

10.8.7. Alert Modules

(Available beginning with NetSpyGlass v1.2.0)

Examples above describe various ways to build an alert but they all assume the call to the function nw2functions.alert() happens in the function execute() of the class declared in your rules python hook script. With time, this script can grow quite big as you accumulate different monitoring variables, code that performs calculations on them and code that implements alerts. It would be good to add structure to this script and move its different parts to their own modules and classes. This becomes especially useful if NetSpyGlass is used in large organization where different groups may want to manage their own sets of rules and alerts. It is desirable to structure the code in a such way that they can edit and submit their parts independently and especially to ensure that when one group breaks things, they don’t break rules and alerts for other groups. In this section, we are going to look at an example that demonstrates how this can be done for the alerts. You can use the same approach to refactor your data processing rules as well.

Here is the structure of directories and files:

.
├── __init__.py
├── alerts
│   ├── __init__.py
│   ├── alert_busy_cpu.py
│   ├── alert_device_down.py
│   ├── alert_lag_partially_degraded.py
│   ├── alert_self_monitoring.py
├── big_rules.py
├── nw2.conf
├── rules.py

First, configuration file nw2.conf refers to the rule script:

# rule runner script, it imports two modules: big_rules.py and nsg_alerts.py
network.monitor.rules = "rules.Rules"

Script rules.py is simple:

import traceback
import nw2rules
import big_rules
import alerts


class Rules(nw2rules.Nw2Rules):
    def __init__(self, log):
        super(Rules, self).__init__(log)
        self.rules = big_rules.UserRules(log)
        self.alerts = alerts.AlertRules(log)

    def execute(self):
        try:
            self.rules.execute()
        except Exception, e:
            self.log.error(traceback.format_exc(e))

        try:
            self.alerts.execute()
        except Exception, e:
            self.log.error(traceback.format_exc(e))

The purpose of this top level module rules.py is to separate data processing rules and alerts. Now exceptions raised in either part won’t affect the other because they are “fenced” with try-except clause. You do not have to put rules and alerts into their separate modules if you don’t want to but this keeps code tidy.

Module big_rules.py is our usual rule processing script. It defines class based on nw2rules.Nw2Rules with function execute():

import datetime
import json
import math
import md5
import os
import time
import sys

import nw2rules
from nw2functions import *

class UserRules(nw2rules.Nw2Rules):
    def __init__(self, log):
        super(UserRules, self).__init__(log)

    def execute(self):
        super(UserRules, self).execute()

Add your own code to process monitoring data here as explained in Data Processing Rules.

Module alerts.py is very similar and in the end calls function nw2functions.alert() to declare alerts, however it “discovers” alert definitions in modules placed in directory alerts. Here is file alerts/__init__.py that does this:

import os
import pkgutil
import sys
import traceback

import nw2rules

__all__ = []
alert_functions = set()

for loader, module_name, is_pkg in pkgutil.walk_packages(__path__):
    __all__.append(module_name)
    if 'alert_' in module_name:
        module = loader.find_module(module_name).load_module(module_name)
        exec('%s = module' % module_name)
        print 'Alert module ' + module.__name__
        # find functions with name that starts with "alert_"
        for name in dir(module):
            obj = getattr(module, name)
            if hasattr(obj, '__call__') and name.startswith('alert_'):
                print '    ' + name
                alert_functions.add(obj)


class AlertRules(nw2rules.Nw2Rules):
    def __init__(self, log):
        super(AlertRules, self).__init__(log)

    def execute(self):
        for afunc in alert_functions:
            try:
                self.log.info('    Calling ' + afunc.__name__)
                afunc.__call__(self.log)
            except Exception, e:
                print traceback.format_exc(e)

When NetSpyGlass imports module alerts, it loads file alerts/__init__.py and runs code inside. This code scans packages in the same directory alerts, looking for files with name that matches simple pattern “alert_*” and imports those. Inside of each module it looks for functions with names that start with “alert_” and saves references to the functions it finds in the list alert_functions. This process happens only once when NetSpyGlass imports alerts/__init__.py. This file also declares class AlertRules. Module rules.py (above) creates an instance of it and calls its function execute() after it is done with code that processes monitoring data (call to UserRules.execute()). By this time, module alerts have already discovered all “alert_*” modules and “alert_” functions declared within and simply calls them one at a time. The call to these functions is also protected with try-except clause to make sure exceptions raised by one module do not affect others.

A module in directory alerts might look like this (this is a file “alerts/alert_busy_cpu.py”):

import nw2rules
from nw2functions import *


def alert_busy_cpu(log):
    alert(
        name='busyCpuAlert',
        input=import_var('cpuUtil'),
        condition=lambda _, value: value > 75,
        description='CPU utilization is over 75% for 20% of time for the last 10 min',
        duration=600,
        percent_duration=20,
        notification_time=600,
        streams=['log'],
        fan_out=True
    )

NetSpyGlass monitors all python modules that get imported when it loads the script identified by the configuration file parameter network.monitor.rules. In this example this means it monitors files rules.py, big_rules.py, alerts/__init__.py and all alerts/alert_*.py. The server will reload all these modules if you modify any one of them. You do not need to restart the server for this. As usual, watch the log file for errors when you modify and save one of these script files.

Note

You can use standard Python operator print instead of calling self.log.info(). The output goes to the server log and the log level depends on the output stream you print to. If you print to stdout, the log level is INFO. Output sent to stderr appears in the log under the log level ERROR.