.. _alerting_rules:

Alerting rules
==============

An alert is created when alerts script calls function :func:`nw2functions.alert()`. To put this
into context, here is how all parts of this setup look like when deployed to NetSpyGlass SaaS.

First, we create simple script `cpu_load_alert.py` and place it in the directory `/opt/netspyglass/home/scripts/alerts/`
on the management server. All alert scripts go into this directory and that is where the server
looks for them. These scripts, together with `__init__.py`, form module `alerts` that is built
dynamically.

Here is the script `cpu_load_alert.py`::


        from nw2functions import *


        def alert_busy_cpu(log):
            alert(
                name='busyCpu',
                input=query('FROM cpuUtil'),
                condition=lambda mvar, value: value > 75,
                description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min',
                duration=300,
                percent_duration=50,
                notification_time=300,
                streams=['log', 'slack_happygears'],
                fan_out=True
            )


see :ref:`rules` for more information on rule processing scripts and how to configure them. Alert scripts
can use functions from the same module :ref:`nw2functions`.

The alert is activated immediately after you upload it to the server. There is usually small delay
because the server scans the filesystem every 10 sec or so, but you do not need to restart anything or
issue any command to force it to load your new script.

The function `alert()` takes its input data from the monitoring variable referred to by the parameter `input`.
In this example this is `cpuUtil` and just like everywhere else in the rule processing script, the variable
must be "imported" by calling simple NsgQL query :func:`nw2functions.query("FROM cpuUtil")` (see :ref:`rules`
and :ref:`nsgql`). In this example the query is quite simple and gets all instances of the variable `cpuUtil`,
in other words, we want the alert to analyze cpu utilization of all cpu-like components of all devices.
If necessary, you can apply matching criteria using `WHERE` (see :ref:`nsgql`).

See :func:`nw2functions.alert()` for the description of the parameters of the function that implements
alert.

NetSpyGlass calls function :func:`nw2functions.alert()` every time it processes collected monitoring data,
which happens periodically with interval defined by the configuration parameter `monitor.pollingIntervalSec`.
Every time the script runs and calls function `alert()`, this function is given the latest monitoring variable
instances. On the very first call function `alert()` creates alert objects corresponding to instances
of the input monitoring variable and assigns their state depending on the result returned by the condition
function. Exsiting alert objects are updated on subsequent calls to `alert()` and their state always reflects
the result of the latest call to the condition function.

Lets look at our example again::

            alert(
                name='busyCpu',
                input=query('FROM cpuUtil'),
                condition=lambda mvar, value: value > 75,
                description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min',
                duration=300,
                percent_duration=50,
                notification_time=300,
                streams=['log', 'slack_happygears'],
                fan_out=True
            )

Alert created by this rule has name `busyCpu` and becomes active if the value of the input variable
`cpuUtil` is above 75. Alert has description which can appear in the notification messages, as well as additional
parameter `details` that can also be passed to the notification stream.

.. important::
    Alert name can not have space or special characters. The name can contain letters (upper and lowercase),
    numbers and underscore, and must start with a letter. If you want to pass additional information, e.g.
    priority, use field `description` which does not have these restrictions.

You can access alert objects using NsgQL :ref:`nsgql` (table `alerts`)

In addition to the alert object, function `alert()` also creates new monitoring variable with the name matching
the name of the alert (in this case - `busyCpu`). This variable appears in Graphing Workbench under category
`Alerts` and is stored in TSDB. The value of the observations in this variable an opaque large number when alert
is active and zero when it is cleared. This variable has set of tags that is a copy of that from the input
variable, merged with tags supplied via argument `tags` used in the call to the function :func:`alert()`.

There are many uses for this variable, here are just a few ideas:

#. Alert monitoring variable provides historical view of the alert - when it was active and when it was cleared.
   You can graph alert monitoring variable in combination with other monitoring variables for visual comparison
   and to search for correlations.
#. you can configure NetSpyGlass to show a variable like this in maps and configure color thresholds for it. This way,
   you can select it in the map legend and immediately see devices with active alerts colored red
#. This variable can be used to pass information about alerts to Nagios (see :ref:`alerts_nagios`). Nagios works by
   polling NetSpyGlass using provided plugin script. Nagios can be configured to poll NetSpyGlass to read any monitoring
   variable and then use matching rules configured in Nagios to trigger alerts. Alternative set up is to implement
   condition matching in NetSpyGlass using calls to `alert()` and just let Nagios poll alert monitoring variables
   and check if the value is greater than zero

Alert state
-----------

Depending on the value returned by the condition function, created alert can be one of the following states:

#. **cleared**   - alert is in this state when Python function referred to by the parameter `condition`
   returns False. This means input monitoring variable instance did not satisfy the condition. Note
   that the alert is created anyway and will appear in the output of the JSON API call :ref:`json_api_list_alerts`

#. **active**    - alert is in this state when function referred to by the parameter  `condition` returns True.

Active alerts send notifications to the configured notification streams, in our example these are PagerDuty and log.
This means whenever alert enters state *active*, a log entry is made in the log file `/opt/netspyglass/var/logs/alerts.log`
and *trigger* event is sent to PagerDuty using their web API. See more on notifications below: :ref:`alert_notifications`.

Cleared alerts, on the other hand, can resolve corresponding opened incidents in PagerDuty and ServiceNow.
This is not a default behavior and, if required, must be enabled per alert. See more on notifications below: :ref:`alert_notifications`.


Fan Out
-------

Parameter `fan_out` allows administrator to create an alert that "fans out", that is, the system creates
separate alert object for each instance of the input variable. Since separate instances of monitoring variables
describe a metrics collected for single component of a device, this way we create separate alert object for
each component of each device. Notification messages can include macros `$alert.device_name` and
`$alert.component_name` to expose device and component name in the log, email or other notification. See more
on macros in :func:`nw2functions.alert()`

If parameter `fan_out` has value False, the system creates one alert object regardless of the number of
MonitoringVariable instances in the input variable. Every time the script calls function :func:`nw2functions.alert()`,
the alert object is updated with a list of devices and components that matched condition. This information is
placed in the field `details` of the alert object and can appear in notification messages (macro `$alert.details`)
and can be accessed via JSON API.


.. _conditions_with_timing:

Conditions with timing
----------------------

As mentioned previously, condition function is called on every monitoring cycle. However the logic that switches
alert state to *active* can take timing into account in addition to the value of the input variable. In the simplest
case you can call :func:`nw2functions.alert()` as follows::

            alert(
                name='busyCpu',
                input=query('FROM cpuUtil'),
                condition=lambda mvar, value: value > 75,
                description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min',
                notification_time=300,
                streams=['log', 'slack_happygears'],
                fan_out=True
            )

Here we have omitted parameters `duration` and `percent_duration` which means we instructed function `alert()`
to only analyze the latest value in the time series of the input variable. Condition function will be called only
once with the value of the latest observation in the time series and alert will become *active* if this function
returns True. In this example, this means alert activates whenever CPU load of any device exceeds 75%. This alert
does not take into account how long CPU load was that high and will trigger even in the case of a short spike that
was measured on just one monitoring cycle.

It is often more useful to build an alert that skips short spikes like that but activates when the value of
the input variable goes over threshold several times during specified interval of time. For example, we may want
to trigger alert if CPU load is over 75% for 4 measurement cycles out of 5. Here is how this might look like:

.. aafig::
        :aspect: 60
        :scale: 100
        :proportional:
        :textual:

           ^ CPU utilization
           |
           |
           |      *                         *        *
        75%|- - - - - - - - - - - - - - - - - - - - - - - - -
           |              *        *
           |
           |
           |      1       2        3        4        5         monitoring cycle
           +------|-------|--------|--------|--------|----------------------------> time


In this example, CPU utilization measured on monitoring cycles 1,4,5 was over the threshod, but the value
measured on cycles 2 and 3 was below threshold. We can write the alert that would activate in this situation
using parameters `duration` and `percent_duration` which are expressed in seconds (assuming monitoring interval
is 1 min in this example)::

            alert(
                name='busyCpu',
                input=query('FROM cpuUtil'),
                condition=lambda mvar, value: value > 75,
                description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min',
                duration=300,
                percent_duration=50,
                notification_time=300,
                streams=['log', 'slack_happygears'],
                fan_out=True
            )

Parameter `duration` tells the system how many monitoring cycles to use to analyse the value of the variable.
At 1 min interval, 300 sec is 5 cycles, so the function `alert()` is going to call condition function five
times, feeding it values of the file latest observations one after another. Parameter `percent_duration`
tells the system how many of the observations must match the criteria (i.e. condition function must return
True) for the alert to activate. In this case the value is 50, which means over 50% of the observations must
match the condition for the alert to activate. In our example 3 observations match, therefore the alert is
going to be activated.

Note that the order in which observations match the condition does not matter, only the percentage of the numer
of matching observations matters. This means a call to `alert()` with parameters `duration` and `percent_duration`
configured this way can identiy "flapping" variables, that is, those that quickly change their value between
that below and above the threshold. Setting `percent_duration` to a value greater than the equivalent of one
polling cycle will make alert skip single spikes in the input variable but activate when it starts flapping.

You can differentiate between situations when variable is "flapping" and completely goes over the threshold by
setting up two alerts, one with `percent_duration` equal to 100 (all observations within `duration` interval are
over the threshold) and the other with `percent_duration` equal to 50 to catch the variable in "flapping" state.

Matching only the latest value in the time series is easy, just call `alert()` with parameter `duration` equal
to 0 (the default) or polling interval in seconds. If `duration` is 0 or equal to the polling interval, parameter
`percent_duration` is ignored becase we are analyzing just one observation. This is the same as calling
`alert()` with parameters `duration` and `percent_duration` omitted.