Examples
========

.. seealso:: :ref:`examples_of_tests`

busyCpuAlert
------------

Simple alert that activates when `cpuUtil` goes over the threshold of 75% for 20% of the 5 min interval::

        alert(
            name='busyCpuAlert',
            input=import_var('cpuUtil'),
            condition=lambda _, value: value > 75,
            description='CPU utilization is over 75% for 50% of time for the last 10 min',
            details={'slack_channel': '#netspyglass'},
            duration=600,
            percent_duration=50,
            notification_time=300,
            streams=['slack', 'log'],
            fan_out=True
        )

This alert catches episodes when CPU load of a device goes over 75% but ignores short spikes when this high
load does not last long. Parameter `duration` makes the alert analyse the value of the variable for the past
10 min but parameter `percent_duration` says the value should be over the threshold at least 50% of the time.
In other words, the alert activates only if at least half of the samples collected during the 10 min interval
are above the threshold. The alert "fans out", that is, it generates separate notification message for each
device that matches its condition. Notifications are sent to two outgoing streams: 'log' and 'slack' and are
sent no more often than every  5 min (parameter `notification_time` has value 300 sec).

deviceDown
----------

Next alert watches packet loss to devices measured with ping and activates when it goes over threshold (this
is a simple way to alert on "device down" condition)::

        alert(
            name='deviceDown',
            input=import_var('icmpLoss'),
            condition=lambda _, value: value > 75,
            description='Packet loss to the device measured with ping is over 75% for the last 5 min',
            details={'slack_channel': '#devices_down'},
            duration=300,
            percent_duration=100,
            notification_time=300,
            streams=['slack', 'log'],
            fan_out=True
        )

Alert activates when the value of the variable `icmpLoss` is over 75% for 5 min and sends notifications to streams
`log` and `slack`. Note that Slack channel is passed via alert field `details`, this way you can override the default
channel configured in the `alerts.streams.slack` configuration parameter.

bigChangeInVariables
--------------------

The following alerting rule watches monitoring variable `numVars` and activates alert when it notices
big change in the value::


    def compare_to_mean(mvar, value):
        """
        compare `value` to the mean value of the variable and return True if the difference is over 30% of the mean

        :param mvar:    MonitoringVariable instance
        :param value:   value
        :return:        True if abs(mean - value) > 0.3*mean
        """
        assert isinstance(mvar, MonitoringVariable)
        mean = mvar.statistics.mean()
        return abs(value - mean) > 0.3*mean


    class UserRules(nw2rules.Nw2Rules):
        def execute(self):

            alert(
                name='bigChangeInVariables',
                input=import_var('numVars'),
                condition=lambda mvar, x: compare_to_mean(mvar, x),
                description='Big change in the number of monitoring variables; last value=$alert.value',
                duration=300,
                percent_duration=100,
                notification_time=300,
                streams=['pagerduty', 'log'],
                fan_out=True
            )

This alert is well suited to catch big changes in a value of the variable that lasts long time:

    .. aafig::
                :aspect: 60
                :scale: 100
                :proportional:
                :textual:

                ---------+    <---- old mean
                         |
                         |
                         |
                         +-------------       <--- this value becomes the new mean after some time
                ----------------------------------------------->  time


To detect change in the value of this variable, the rule calls function :py:func:`compare_to_mean()` that
calculates mean value of the variable, compares provided value with it and returns True if the value deviates
from the mean over 30%. This alert activates when the value of variable `numVar` changes by over 30% and
stays like that for 5 minutes (because parameter `duration` has value of 300 sec). This alert will ignore changes
in the value that are smaller or last shorter than that. Alert sends notification to two streams: `log` and
`pagerduty` that should be configured in the configuration section `alerts.streams` in the main configuration
file `nw2.conf`.

Because this alert compares current value of the variable with mean, it will continue to be active for a while as
long as the new value satisfies the condition. It can clear in two cases: either the value returns back to mean,
or, if the value does not change much around its new level, the new level becomes new mean and that also clears
the alert.

Note how this alert uses macro `$alert.value` in its description. See :py:func:`nw2functions.alert()` for the list
of supported macros.


lagPartiallyDegraded
--------------------

Next example is an alert that watches combined bandwidth of LAG interface groups. Monitoring variable
`portAggregatorBandwidth` computed by the default rule processing script has value of 100% if all configured
LAG members are online and are passing traffic, that is, LACP bits `collecting` and `distributing` are set.
The value drops below 100% if a member is down or misconfigured, in which case one or both bits become cleared.
This alert is a good way to catch LAG groups that become degraded because of misconfiguration.

The alert uses condition function that checks if the value of the variable is below certain limit rather than
above as in previous examples::

        alert(
            name='lagPartiallyDegraded',
            input=import_var('portAggregatorBandwidth'),
            condition=lambda _, value: value < 100,
            description='$alert.deviceName:$alert.componentName :: One or more LAG has members failed, combined bundle bandwidth is below 100%',
            details={},
            duration=300,
            percent_duration=100,
            notification_time=300,
            fan_out=True
        )

The only requirement for the function passed as parameter `condition` to :py:func:`nw2functions.alert()` is that
it should accept two arguments and return a boolean value. The first argument is :py:class:`net.happygears.nw2.py.MonitoringVariable`
object and the second argument is observation value to be examined. If parameter `duration` specifies interval
of time longer than polling interval, this function will called multiple times with the same object as its first
argument and different values as second argument.


interfaceDown
-------------

this alert watches variable `ifOperStatus` that has value 1 when interface is up and 2 when it is down. There
are other values too, such as 3 ("testing") or 7 ("lower layer down") but they all are greater than 1::

        alert(
            name='interfaceDown',
            input=import_var('ifOperStatus'),
            condition=lambda _, value: value > 1,
            description='$alert.deviceName:$alert.componentName :: Interface is down',
            details={'slack_channel': '#netspyglass'},
            duration=300,
            percent_duration=100,
            notification_time=300,
            streams=['slack', 'log'],
            fan_out=True
        )

Similar to the previous examples, this alert ignores cases when interface goes down for a short period of time and
activates only if the interface stays down for at least 5 min.

Alert configured this way may not be optimal though because it will not activate if interface is "flapping", that
is quickly goes up and down all the time. Flapping interface may not be in the state "down" for 5 min without
going up briefly and will never satisfy the condition of this alert. To catch flapping interfaces we can modify the
alert::

        alert(
            name='interfaceDown',
            input=import_var('ifOperStatus'),
            condition=lambda _, value: value > 1,
            description='$alert.deviceName:$alert.componentName :: Interface is down',
            details={'slack_channel': '#netspyglass'},
            duration=300,
            percent_duration=50,
            notification_time=300,
            streams=['slack', 'log'],
            fan_out=True
        )

Now the alert activates if the interface is found to be down at least half of the time during 5 min interval but it
still ignores episodes when interface goes down briefly and then remains in the state "up" for a long time.

.. note::
    It does not matter exactly which half of the time the interface was down. The alert activates if half of
    the observations collected during the 5 min interval show the interface in the sate "down". It could have stayed
    down for 3 min and then come up, or bounced up and down several times - either way, if at least half of
    observations in any combination show the state to be "down", the alert will activate.


.. _alert_with_dependencies:

bgpSessionDown: alert with dependencies
---------------------------------------

the following example illustrates an alert that takes into account dependency between monitoring variables.
The alert activates when it notices that BGP session state becomes anything but `established` while corresponding
peering interface is in operational state `Up`. We assume that "interface down" condition is tracked by another
alert somewhere (see above for the example) and we don't want to get two alerts when peering interface goes
down - first "interfaceDown" and then "BGPSessionDown". We want to get only "interfaceDown" alert when interface
goes down and "bgpSessionDown" alert when interface stays up but BGP session disconnects. We use tag `BGPPeerAddress`
to associate bgp state and interface state variables::

        def bgp_state_for_intf_up(self):
            '''
            this function returns instances of the variable `bgpPeerState` that correspond to
            peering interfaces in op state "up".
            '''
            if_oper_up = filter(lambda x: x == 1, import_var('ifOperStatus'))
            bgp_peer_state = import_var('bgpPeerState')
            for pair in join_by_tags(if_oper_up, bgp_peer_state, ['BGP4PeerAddress']):
                op_status, bgp_state = pair
                yield bgp_state

        def execute(self):

            # take filtered instances of variable `bgpPeerState` (only those that correspond to
            # peering interfaces in state "up") and trigger alert if the value is not 6 ("established").
            # It is assumed that "interface down" condition is tracked by another alert somewhere.
            alert(
                name='bgpSessionDown',
                input=self.bgp_state_for_intf_up(),
                condition=lambda mvar, value: value < 6,
                description='BGP Session is down but interface is up',
                details={},
                notification_time=300,
                streams=['log', 'slack'],
                fan_out=True
            )

First, we define function :py:func:`bgp_state_for_intf_up()` that filters instances of the  `ifOperStatus`
variable to get only those that correspond to interfaces in the sate "Up", then matches them to instances
of the variable `bgpPeerState` that have the same tag `BGP4PeerAddress` (it calls standard function
:py:func:`nw2functions.join_by_tags()` to do that). This function is a generator that yields instances of
the variable `bgpPeerState`. All this filtering and matching means that this generator yields instances of
`bgpPeerState` that correspond to peering interfaces in state "Up".

We then feed variables returned by :py:func:`bgp_state_for_intf_up()` to an alert that checks the value
and activates if the value is not 6 ("established"). This alert activates immediately when it finds the match
and will not ignore short lived events when BGP session "bounces". The alert can easily be modified to skip
these events just like we have done in other examples above.


Alert Modules
-------------

(Available beginning with NetSpyGlass v1.2.0)

Examples above describe various ways to build an alert but they all assume the call to the function
:func:`nw2functions.alert()` happens in the function :func:`execute()` of the class declared in your
rules python hook script. With time, this script can grow quite big as you accumulate different
monitoring variables, code that performs calculations on them and code that implements alerts. It
would be good to add structure to this script and move its different parts to their own modules and
classes. This becomes especially useful if NetSpyGlass is used in large organization where different
groups may want to manage their own sets of rules and alerts. It is desirable to structure the code
in a such way that they can edit and submit their parts independently and especially to ensure that
when one group breaks things, they don't break rules and alerts for other groups. In this section,
we are going to look at an example that demonstrates how this can be done for the alerts. You can
use the same approach to refactor your data processing rules as well.

Here is the structure of directories and files:

.. code-block:: none

    .
    ├── __init__.py
    ├── alerts
    │   ├── __init__.py
    │   ├── alert_busy_cpu.py
    │   ├── alert_device_down.py
    │   ├── alert_lag_partially_degraded.py
    │   ├── alert_self_monitoring.py
    ├── big_rules.py
    ├── nw2.conf
    ├── rules.py


First, configuration file `nw2.conf` refers to the rule script::

    # rule runner script, it imports two modules: big_rules.py and nsg_alerts.py
    network.monitor.rules = "rules.Rules"

Script `rules.py` is simple::

    import traceback
    import nw2rules
    import big_rules
    import alerts


    class Rules(nw2rules.Nw2Rules):
        def __init__(self, log):
            super(Rules, self).__init__(log)
            self.rules = big_rules.UserRules(log)
            self.alerts = alerts.AlertRules(log)

        def execute(self):
            try:
                self.rules.execute()
            except Exception, e:
                self.log.error(traceback.format_exc(e))

            try:
                self.alerts.execute()
            except Exception, e:
                self.log.error(traceback.format_exc(e))

The purpose of this top level module `rules.py` is to separate data processing rules and alerts. Now
exceptions raised in either part won't affect the other because they are "fenced" with try-except clause.
You do not have to put rules and alerts into their separate modules if you don't want to but this keeps
code tidy.

Module `big_rules.py` is our usual rule processing script. It defines class based on :class:`nw2rules.Nw2Rules`
with function :func:`execute()`::

    import datetime
    import json
    import math
    import md5
    import os
    import time
    import sys

    import nw2rules
    from nw2functions import *

    class UserRules(nw2rules.Nw2Rules):
        def __init__(self, log):
            super(UserRules, self).__init__(log)

        def execute(self):
            super(UserRules, self).execute()

Add your own code to process monitoring data here as explained in :ref:`rules`.

Module `alerts.py` is very similar and in the end calls function :func:`nw2functions.alert()` to declare
alerts, however it "discovers" alert definitions in modules placed in directory `alerts`. Here is file
`alerts/__init__.py` that does this::


    import os
    import pkgutil
    import sys
    import traceback

    import nw2rules

    __all__ = []
    alert_functions = set()

    for loader, module_name, is_pkg in pkgutil.walk_packages(__path__):
        __all__.append(module_name)
        if 'alert_' in module_name:
            module = loader.find_module(module_name).load_module(module_name)
            exec('%s = module' % module_name)
            print 'Alert module ' + module.__name__
            # find functions with name that starts with "alert_"
            for name in dir(module):
                obj = getattr(module, name)
                if hasattr(obj, '__call__') and name.startswith('alert_'):
                    print '    ' + name
                    alert_functions.add(obj)


    class AlertRules(nw2rules.Nw2Rules):
        def __init__(self, log):
            super(AlertRules, self).__init__(log)

        def execute(self):
            for afunc in alert_functions:
                try:
                    self.log.info('    Calling ' + afunc.__name__)
                    afunc.__call__(self.log)
                except Exception, e:
                    print traceback.format_exc(e)


When NetSpyGlass imports module `alerts`, it loads file `alerts/__init__.py` and runs code inside. This
code scans packages in the same directory `alerts`, looking for files with name that matches simple
pattern "alert_*" and imports those. Inside of each module it looks for functions with names that
start with "alert_" and saves references to the functions it finds in the list `alert_functions`.
This process happens only once when NetSpyGlass imports `alerts/__init__.py`. This file also declares
class :class:`AlertRules`. Module `rules.py` (above) creates an instance of it and calls its function
:func:`execute()` after it is done with code that processes monitoring data (call to :func:`UserRules.execute()`).
By this time, module `alerts` have already discovered all "alert_*" modules and "alert_" functions
declared within and simply calls them one at a time. The call to these functions is also protected
with try-except clause to make sure exceptions raised by one module do not affect others.

A module in directory `alerts` might look like this (this is a file "alerts/alert_busy_cpu.py")::


    import nw2rules
    from nw2functions import *


    def alert_busy_cpu(log):
        alert(
            name='busyCpuAlert',
            input=import_var('cpuUtil'),
            condition=lambda _, value: value > 75,
            description='CPU utilization is over 75% for 20% of time for the last 10 min',
            duration=600,
            percent_duration=20,
            notification_time=600,
            streams=['log'],
            fan_out=True
        )


NetSpyGlass monitors all python modules that get imported when it loads the script identified
by the configuration file parameter `network.monitor.rules`. In this example this means it
monitors files `rules.py`, `big_rules.py`, `alerts/__init__.py` and all `alerts/alert_*.py`.
The server will reload all these modules if you modify any one of them. You do not need to
restart the server for this. As usual, watch the log file for errors when you modify and save
one of these script files.

.. note::
    You can use standard Python operator `print` instead of calling `self.log.info()`. The output
    goes to the server log and the log level depends on the output stream you print to. If you
    print to stdout, the log level is `INFO`. Output sent to stderr appears in the log under
    the log level `ERROR`.