.. _alerting_rules: Alerting rules ============== An alert is created when alerts script calls function :func:`nw2functions.alert()`. To put this into context, here is how all parts of this setup look like when deployed to NetSpyGlass SaaS. First, we create simple script `cpu_load_alert.py` and place it in the directory `/opt/netspyglass/home/scripts/alerts/` on the management server. All alert scripts go into this directory and that is where the server looks for them. These scripts, together with `__init__.py`, form module `alerts` that is built dynamically. Here is the script `cpu_load_alert.py`:: from nw2functions import * def alert_busy_cpu(log): alert( name='busyCpu', input=query('FROM cpuUtil'), condition=lambda mvar, value: value > 75, description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min', duration=300, percent_duration=50, notification_time=300, streams=['log', 'slack_happygears'], fan_out=True ) see :ref:`rules` for more information on rule processing scripts and how to configure them. Alert scripts can use functions from the same module :ref:`nw2functions`. The alert is activated immediately after you upload it to the server. There is usually small delay because the server scans the filesystem every 10 sec or so, but you do not need to restart anything or issue any command to force it to load your new script. The function `alert()` takes its input data from the monitoring variable referred to by the parameter `input`. In this example this is `cpuUtil` and just like everywhere else in the rule processing script, the variable must be "imported" by calling simple NsgQL query :func:`nw2functions.query("FROM cpuUtil")` (see :ref:`rules` and :ref:`nsgql`). In this example the query is quite simple and gets all instances of the variable `cpuUtil`, in other words, we want the alert to analyze cpu utilization of all cpu-like components of all devices. If necessary, you can apply matching criteria using `WHERE` (see :ref:`nsgql`). See :func:`nw2functions.alert()` for the description of the parameters of the function that implements alert. NetSpyGlass calls function :func:`nw2functions.alert()` every time it processes collected monitoring data, which happens periodically with interval defined by the configuration parameter `monitor.pollingIntervalSec`. Every time the script runs and calls function `alert()`, this function is given the latest monitoring variable instances. On the very first call function `alert()` creates alert objects corresponding to instances of the input monitoring variable and assigns their state depending on the result returned by the condition function. Exsiting alert objects are updated on subsequent calls to `alert()` and their state always reflects the result of the latest call to the condition function. Lets look at our example again:: alert( name='busyCpu', input=query('FROM cpuUtil'), condition=lambda mvar, value: value > 75, description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min', duration=300, percent_duration=50, notification_time=300, streams=['log', 'slack_happygears'], fan_out=True ) Alert created by this rule has name `busyCpu` and becomes active if the value of the input variable `cpuUtil` is above 75. Alert has description which can appear in the notification messages, as well as additional parameter `details` that can also be passed to the notification stream. .. important:: Alert name can not have space or special characters. The name can contain letters (upper and lowercase), numbers and underscore, and must start with a letter. If you want to pass additional information, e.g. priority, use field `description` which does not have these restrictions. You can access alert objects using NsgQL :ref:`nsgql` (table `alerts`) In addition to the alert object, function `alert()` also creates new monitoring variable with the name matching the name of the alert (in this case - `busyCpu`). This variable appears in Graphing Workbench under category `Alerts` and is stored in TSDB. The value of the observations in this variable an opaque large number when alert is active and zero when it is cleared. This variable has set of tags that is a copy of that from the input variable, merged with tags supplied via argument `tags` used in the call to the function :func:`alert()`. There are many uses for this variable, here are just a few ideas: #. Alert monitoring variable provides historical view of the alert - when it was active and when it was cleared. You can graph alert monitoring variable in combination with other monitoring variables for visual comparison and to search for correlations. #. you can configure NetSpyGlass to show a variable like this in maps and configure color thresholds for it. This way, you can select it in the map legend and immediately see devices with active alerts colored red #. This variable can be used to pass information about alerts to Nagios (see :ref:`alerts_nagios`). Nagios works by polling NetSpyGlass using provided plugin script. Nagios can be configured to poll NetSpyGlass to read any monitoring variable and then use matching rules configured in Nagios to trigger alerts. Alternative set up is to implement condition matching in NetSpyGlass using calls to `alert()` and just let Nagios poll alert monitoring variables and check if the value is greater than zero Alert state ----------- Depending on the value returned by the condition function, created alert can be one of the following states: #. **cleared** - alert is in this state when Python function referred to by the parameter `condition` returns False. This means input monitoring variable instance did not satisfy the condition. Note that the alert is created anyway and will appear in the output of the JSON API call :ref:`json_api_list_alerts` #. **active** - alert is in this state when function referred to by the parameter `condition` returns True. Active alerts send notifications to the configured notification streams, in our example these are PagerDuty and log. This means whenever alert enters state *active*, a log entry is made in the log file `/opt/netspyglass/var/logs/alerts.log` and *trigger* event is sent to PagerDuty using their web API. See more on notifications below: :ref:`alert_notifications`. Cleared alerts, on the other hand, can resolve corresponding opened incidents in PagerDuty and ServiceNow. This is not a default behavior and, if required, must be enabled per alert. See more on notifications below: :ref:`alert_notifications`. Fan Out ------- Parameter `fan_out` allows administrator to create an alert that "fans out", that is, the system creates separate alert object for each instance of the input variable. Since separate instances of monitoring variables describe a metrics collected for single component of a device, this way we create separate alert object for each component of each device. Notification messages can include macros `$alert.device_name` and `$alert.component_name` to expose device and component name in the log, email or other notification. See more on macros in :func:`nw2functions.alert()` If parameter `fan_out` has value False, the system creates one alert object regardless of the number of MonitoringVariable instances in the input variable. Every time the script calls function :func:`nw2functions.alert()`, the alert object is updated with a list of devices and components that matched condition. This information is placed in the field `details` of the alert object and can appear in notification messages (macro `$alert.details`) and can be accessed via JSON API. .. _conditions_with_timing: Conditions with timing ---------------------- As mentioned previously, condition function is called on every monitoring cycle. However the logic that switches alert state to *active* can take timing into account in addition to the value of the input variable. In the simplest case you can call :func:`nw2functions.alert()` as follows:: alert( name='busyCpu', input=query('FROM cpuUtil'), condition=lambda mvar, value: value > 75, description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min', notification_time=300, streams=['log', 'slack_happygears'], fan_out=True ) Here we have omitted parameters `duration` and `percent_duration` which means we instructed function `alert()` to only analyze the latest value in the time series of the input variable. Condition function will be called only once with the value of the latest observation in the time series and alert will become *active* if this function returns True. In this example, this means alert activates whenever CPU load of any device exceeds 75%. This alert does not take into account how long CPU load was that high and will trigger even in the case of a short spike that was measured on just one monitoring cycle. It is often more useful to build an alert that skips short spikes like that but activates when the value of the input variable goes over threshold several times during specified interval of time. For example, we may want to trigger alert if CPU load is over 75% for 4 measurement cycles out of 5. Here is how this might look like: .. aafig:: :aspect: 60 :scale: 100 :proportional: :textual: ^ CPU utilization | | | * * * 75%|- - - - - - - - - - - - - - - - - - - - - - - - - | * * | | | 1 2 3 4 5 monitoring cycle +------|-------|--------|--------|--------|----------------------------> time In this example, CPU utilization measured on monitoring cycles 1,4,5 was over the threshod, but the value measured on cycles 2 and 3 was below threshold. We can write the alert that would activate in this situation using parameters `duration` and `percent_duration` which are expressed in seconds (assuming monitoring interval is 1 min in this example):: alert( name='busyCpu', input=query('FROM cpuUtil'), condition=lambda mvar, value: value > 75, description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min', duration=300, percent_duration=50, notification_time=300, streams=['log', 'slack_happygears'], fan_out=True ) Parameter `duration` tells the system how many monitoring cycles to use to analyse the value of the variable. At 1 min interval, 300 sec is 5 cycles, so the function `alert()` is going to call condition function five times, feeding it values of the file latest observations one after another. Parameter `percent_duration` tells the system how many of the observations must match the criteria (i.e. condition function must return True) for the alert to activate. In this case the value is 50, which means over 50% of the observations must match the condition for the alert to activate. In our example 3 observations match, therefore the alert is going to be activated. Note that the order in which observations match the condition does not matter, only the percentage of the numer of matching observations matters. This means a call to `alert()` with parameters `duration` and `percent_duration` configured this way can identiy "flapping" variables, that is, those that quickly change their value between that below and above the threshold. Setting `percent_duration` to a value greater than the equivalent of one polling cycle will make alert skip single spikes in the input variable but activate when it starts flapping. You can differentiate between situations when variable is "flapping" and completely goes over the threshold by setting up two alerts, one with `percent_duration` equal to 100 (all observations within `duration` interval are over the threshold) and the other with `percent_duration` equal to 50 to catch the variable in "flapping" state. Matching only the latest value in the time series is easy, just call `alert()` with parameter `duration` equal to 0 (the default) or polling interval in seconds. If `duration` is 0 or equal to the polling interval, parameter `percent_duration` is ignored becase we are analyzing just one observation. This is the same as calling `alert()` with parameters `duration` and `percent_duration` omitted.