.. _alerting_rules: Alerting rules ============== An alert is created when rule processing script calls function :func:`nw2functions.alert()`. To put this into conext, here is how the call to :func:`nw2functions.alert()` may look like (see :ref:`rules` for more information on rule processing script and how to configure it):: import nw2rules from nw2functions import * class UserRules(nw2rules.Nw2Rules): def __init__(self, log): super(UserRules, self).__init__(log) def execute(self): super(UserRules, self).execute() alert( name='cpu_load_high', input=import_var('cpuUtil'), condition=lambda mvar, value: value > 75, description='CPU utilization is over 75% for 50% of time for the last 5 min', details={}, duration=300, percent_duration=50, notification_time=300, # send notification once in 5 min streams=['pagerduty', 'log'], fan_out=True ) This function call takes its input data from the monitoring variable referred to by the parameter `input`. In this example this is `cpuUtil` and just like everywhere else in the rule processing script, the variable must be "imported" by calling :func:`nw2functions.import_var()` (see :ref:`rules`). In this example we use input monitoring variable without any transformations, but if necessary, you can perform some calculations and feed temporary monitoring variable to the call to :func:`nw2functions.alert()` (see :ref:`alert_with_dependencies`) Function :func:`nw2functions.alert()` applies function referred to by the parameter `condition` to every :class:`net.happygears.nw2.py.MonitoringVariable` objects it takes from `input` and creates separate :class:`net.happygears.nw2.alerts.Alert` object for each. Condition function is given two arguments: :class:`net.happygears.nw2.py.MonitoringVariable` object and the value to be analyzed (a number). The function in our example ignores first argument and returns True if the value (second argument) is greater than 75. Reference to the :class:`net.happygears.nw2.py.MonitoringVariable` object can be useful if you want to compare the value (second argument) to mean or average value of all observations in the time series of the monitoring variable instance (first argument) or make comparison conditioned on the information about the device, component or tags. See more on the alerting rules below: :ref:`alerting_rules`. NetSpyGlass calls function :func:`nw2functions.alert()` every time it processes collected monitoring data, which happens periodically with interval defined by the configuration parameter `monitor.pollingIntervalSec`. Every time the script runs and calls function `alert()`, this function is given the latest monitoring variable instances. On the very first call function `alert()` creates alert objects corresponding to instances of the input monitoring variable and assigns their state depending on the result returned by the condition function. Exsiting alert objects are updated on subsequent calls to `alert()` and their state always reflects the result of the latest call to the condition function. Lets look at our example again:: alert( name='cpu_load_high', input=import_var('cpuUtil'), condition=lambda mvar, value: value > 75, description='CPU utilization is over 75% for 50% of time for the last 5 min', details={}, duration=300, percent_duration=50, notification_time=300, # send notification once in 5 min streams=['pagerduty', 'log'], fan_out=True ) Alert created by this rule has name `cpu_load_high` and becomes active if the value of the input variable `cpuUtil` is above 75. Alert has description which can appear in the notification messages, as well as additional parameter `details` that can also be passed to the notification stream. You can access alert objects using JSON API call :ref:`json_api_list_alerts` In addition to the alert object, function `alert()` also creates new monitoring variable with the name matching the name of the alert (in this case - `cpu_load_high`). This variable appears in Graphing Workbench under category `Alerts` and is stored in TSDB. The value of the observations in this variable an opaque large number when alert is active and zero when it is cleared. This variable has set of tags that is a copy of that from the input variable, merged with tags supplied via argument `tags` used in the call to the function :func:`alert()`. There are many uses for this variable, here are just a few ideas: #. Alert monitoring variable provides historical view of the alert - when it was active and when it was cleared. You can graph alert monitoring variable in combination with other monitoring variables for visual comparison and to search for correlations. #. you can configure NetSpyGlass to show a variable like this in maps and configure color thresholds for it. This way, you can select it in the map legend and immediately see devices with active alerts colored red #. This variable can be used to pass information about alerts to Nagios (see :ref:`alerts_nagios`). Nagios works by polling NetSpyGlass using provided plugin script. Nagios can be configured to poll NetSpyGlass to read any monitoring variable and then use matching rules configured in Nagios to trigger alerts. Alternative set up is to implement condition matching in NetSpyGlass using calls to `alert()` and just let Nagios poll alert monitoring variables and check if the value is greater than zero Alert state ----------- Depending on the value returned by the condition function, created alert can be one of the following states: #. **cleared** - alert is in this state when Python function referred to by the parameter `condition` returns False. This means input monitoring variable instance did not satisfy the condition. Note that the alert is created anyway and will appear in the output of the JSON API call :ref:`json_api_list_alerts` #. **active** - alert is in this state when function referred to by the parameter `condition` returns True. Active alerts send notifications to the configured notification streams, in our example these are PagerDuty and log. This means whenever alert enters state *active*, a log entry is made in the log file `/opt/netspyglass/home/logs/alerts.log` and *trigger* event is sent o PagerDuty using their web API. See more on notifications below: :ref:`alert_notifications`. Fan Out ------- Parameter `fan_out` allows administrator to create an alert that "fans out", that is, the system creates separate alert object for each instance of the input variable. Since separate instances of monitoring variables describe a metrics collected for single component of a device, this way we create separate alert object for each component of each device. Notification messages can include macros `$alert.device_name` and `$alert.component_name` to expose device and component name in the log, email or other notification. See more on macros in :func:`nw2functions.alert()` If parameter `fan_out` has value False, the system creates one alert object regardless of the number of MonitoringVariable instances in the input variable. Every time the script calls function :func:`nw2functions.alert()`, the alert object is updated with a list of devices and components that matched condition. This information is placed in the field `details` of the alert object and can appear in notification messages (macro `$alert.details`) and can be accessed via JSON API. .. _conditions_with_timing: Conditions with timing ---------------------- As mentioned previously, condition function is called on every monitoring cycle. However the logic that switches alert state to *active* can take timing into account in addition to the value of the input variable. In the simplest case you can call :func:`nw2functions.alert()` as follows:: alert( name='cpu_load_high', input=import_var('cpuUtil'), condition=lambda mvar, value: value > 75, description='CPU utilization is over 75% for 50% of time for the last 10 min', notification_time=300, # send notification once in 5 min streams=['pagerduty', 'log'], fan_out=True ) Here we have omitted parameters `duration` and `percent_duration` which means we instructed function `alert()` to only analyze the latest value in the time series of the input variable. Condition function will be called only once with the value of the latest observation in the time series and alert will become *active* if this function returns True. In this example, this means alert activates whenever CPU load of any device exceeds 75%. This alert does not take into account how long CPU load was that high and will trigger even in the case of a short spike that was measured on just one monitorign cycle. It is often more useful to build an alert that skips short spikes like that but activates when the value of the input variable goes over threshold several times during specified interval of time. For example, we may want to trigger alert if CPU load is over 75% for 4 measurement cycles out of 5. Here is how this might look like: .. aafig:: :aspect: 60 :scale: 100 :proportional: :textual: ^ CPU utilization | | | * * * 75%|- - - - - - - - - - - - - - - - - - - - - - - - - | * * | | | 1 2 3 4 5 monitoring cycle +------|-------|--------|--------|--------|----------------------------> time In this example, CPU utilization measured on monitoring cycles 1,4,5 was over the threshod, but the value measured on cycles 2 and 3 was below threshold. We can write the alert that would activate in this situation using parameters `duration` and `percent_duration` which are expressed in seconds (assuming monitoring interval is 1 min in this example):: alert( name='cpu_load_high', input=import_var('cpuUtil'), condition=lambda mvar, value: value > 75, description='CPU utilization is over 75% for 50% of time for the last 5 min', details={}, duration=300, percent_duration=50, notification_time=300, # send notification once in 5 min streams=['pagerduty', 'log'], fan_out=True ) Parameter `duration` tells the system how many monitoring cycles to use to analyse the value of the variable. At 1 min interval, 300 sec is 5 cycles, so the function `alert()` is going to call condition function five times, feeding it values of the file latest observations one after another. Parameter `percent_duration` tells the system how many of the observations must match the criteria (i.e. condition function must return True) for the alert to activate. In this case the value is 50, which means over 50% of the observations must match the condition for the alert to activate. In our example 3 observations match, therefore the alert is going to be activated. Note that the order in which observations match the condition does not matter, only the percentage of the numer of matching observations matters. This means a call to `alert()` with parameters `duration` and `percent_duration` configured this way can identiy "flapping" variables, that is, those that quickly change their value between that below and above the threshold. Setting `percent_duration` to a value greater than the equivalent of one polling cycle will make alert skip single spikes in the input variable but activate when it starts flapping. You can differentiate between situations when variable is "flapping" and completely goes over the threshold by setting up two alerts, one with `percent_duration` equal to 100 (all observations within `duration` interval are over the threshold) and the other with `percent_duration` equal to 50 to catch the variable in "flapping" state. Matching only the latest value in the time series is easy, just call `alert()` with parameter `duration` equal to 0 (the default) or polling interval in seconds. If `duration` is 0 or equal to the polling interval, parameter `percent_duration` is ignored becase we are analyzing just one observation. This is the same as calling `alert()` with parameters `duration` and `percent_duration` omitted.