Examples ======== .. seealso:: :ref:`examples_of_tests` busyCpuAlert ------------ Simple alert that activates when `cpuUtil` goes over the threshold of 75% for 20% of the 5 min interval:: alert( name='busyCpuAlert', input=import_var('cpuUtil'), condition=lambda _, value: value > 75, description='CPU utilization is over 75% for 50% of time for the last 10 min', details={'slack_channel': '#netspyglass'}, duration=600, percent_duration=50, notification_time=300, streams=['slack', 'log'], fan_out=True ) This alert catches episodes when CPU load of a device goes over 75% but ignores short spikes when this high load does not last long. Parameter `duration` makes the alert analyse the value of the variable for the past 10 min but parameter `percent_duration` says the value should be over the threshold at least 50% of the time. In other words, the alert activates only if at least half of the samples collected during the 10 min interval are above the threshold. The alert "fans out", that is, it generates separate notification message for each device that matches its condition. Notifications are sent to two outgoing streams: 'log' and 'slack' and are sent no more often than every 5 min (parameter `notification_time` has value 300 sec). deviceDown ---------- Next alert watches packet loss to devices measured with ping and activates when it goes over threshold (this is a simple way to alert on "device down" condition):: alert( name='deviceDown', input=import_var('icmpLoss'), condition=lambda _, value: value > 75, description='Packet loss to the device measured with ping is over 75% for the last 5 min', details={'slack_channel': '#devices_down'}, duration=300, percent_duration=100, notification_time=300, streams=['slack', 'log'], fan_out=True ) Alert activates when the value of the variable `icmpLoss` is over 75% for 5 min and sends notifications to streams `log` and `slack`. Note that Slack channel is passed via alert field `details`, this way you can override the default channel configured in the `alerts.streams.slack` configuration parameter. bigChangeInVariables -------------------- The following alerting rule watches monitoring variable `numVars` and activates alert when it notices big change in the value:: def compare_to_mean(mvar, value): """ compare `value` to the mean value of the variable and return True if the difference is over 30% of the mean :param mvar: MonitoringVariable instance :param value: value :return: True if abs(mean - value) > 0.3*mean """ assert isinstance(mvar, MonitoringVariable) mean = mvar.statistics.mean() return abs(value - mean) > 0.3*mean class UserRules(nw2rules.Nw2Rules): def execute(self): alert( name='bigChangeInVariables', input=import_var('numVars'), condition=lambda mvar, x: compare_to_mean(mvar, x), description='Big change in the number of monitoring variables; last value=$alert.value', duration=300, percent_duration=100, notification_time=300, streams=['pagerduty', 'log'], fan_out=True ) This alert is well suited to catch big changes in a value of the variable that lasts long time: .. aafig:: :aspect: 60 :scale: 100 :proportional: :textual: ---------+ <---- old mean | | | +------------- <--- this value becomes the new mean after some time -----------------------------------------------> time To detect change in the value of this variable, the rule calls function :py:func:`compare_to_mean()` that calculates mean value of the variable, compares provided value with it and returns True if the value deviates from the mean over 30%. This alert activates when the value of variable `numVar` changes by over 30% and stays like that for 5 minutes (because parameter `duration` has value of 300 sec). This alert will ignore changes in the value that are smaller or last shorter than that. Alert sends notification to two streams: `log` and `pagerduty` that should be configured in the configuration section `alerts.streams` in the main configuration file `nw2.conf`. Because this alert compares current value of the variable with mean, it will continue to be active for a while as long as the new value satisfies the condition. It can clear in two cases: either the value returns back to mean, or, if the value does not change much around its new level, the new level becomes new mean and that also clears the alert. Note how this alert uses macro `$alert.value` in its description. See :py:func:`nw2functions.alert()` for the list of supported macros. lagPartiallyDegraded -------------------- Next example is an alert that watches combined bandwidth of LAG interface groups. Monitoring variable `portAggregatorBandwidth` computed by the default rule processing script has value of 100% if all configured LAG members are online and are passing traffic, that is, LACP bits `collecting` and `distributing` are set. The value drops below 100% if a member is down or misconfigured, in which case one or both bits become cleared. This alert is a good way to catch LAG groups that become degraded because of misconfiguration. The alert uses condition function that checks if the value of the variable is below certain limit rather than above as in previous examples:: alert( name='lagPartiallyDegraded', input=import_var('portAggregatorBandwidth'), condition=lambda _, value: value < 100, description='$alert.deviceName:$alert.componentName :: One or more LAG has members failed, combined bundle bandwidth is below 100%', details={}, duration=300, percent_duration=100, notification_time=300, fan_out=True ) The only requirement for the function passed as parameter `condition` to :py:func:`nw2functions.alert()` is that it should accept two arguments and return a boolean value. The first argument is :py:class:`net.happygears.nw2.py.MonitoringVariable` object and the second argument is observation value to be examined. If parameter `duration` specifies interval of time longer than polling interval, this function will called multiple times with the same object as its first argument and different values as second argument. interfaceDown ------------- this alert watches variable `ifOperStatus` that has value 1 when interface is up and 2 when it is down. There are other values too, such as 3 ("testing") or 7 ("lower layer down") but they all are greater than 1:: alert( name='interfaceDown', input=import_var('ifOperStatus'), condition=lambda _, value: value > 1, description='$alert.deviceName:$alert.componentName :: Interface is down', details={'slack_channel': '#netspyglass'}, duration=300, percent_duration=100, notification_time=300, streams=['slack', 'log'], fan_out=True ) Similar to the previous examples, this alert ignores cases when interface goes down for a short period of time and activates only if the interface stays down for at least 5 min. Alert configured this way may not be optimal though because it will not activate if interface is "flapping", that is quickly goes up and down all the time. Flapping interface may not be in the state "down" for 5 min without going up briefly and will never satisfy the condition of this alert. To catch flapping interfaces we can modify the alert:: alert( name='interfaceDown', input=import_var('ifOperStatus'), condition=lambda _, value: value > 1, description='$alert.deviceName:$alert.componentName :: Interface is down', details={'slack_channel': '#netspyglass'}, duration=300, percent_duration=50, notification_time=300, streams=['slack', 'log'], fan_out=True ) Now the alert activates if the interface is found to be down at least half of the time during 5 min interval but it still ignores episodes when interface goes down briefly and then remains in the state "up" for a long time. .. note:: It does not matter exactly which half of the time the interface was down. The alert activates if half of the observations collected during the 5 min interval show the interface in the sate "down". It could have stayed down for 3 min and then come up, or bounced up and down several times - either way, if at least half of observations in any combination show the state to be "down", the alert will activate. .. _alert_with_dependencies: bgpSessionDown: alert with dependencies --------------------------------------- the following example illustrates an alert that takes into account dependency between monitoring variables. The alert activates when it notices that BGP session state becomes anything but `established` while corresponding peering interface is in operational state `Up`. We assume that "interface down" condition is tracked by another alert somewhere (see above for the example) and we don't want to get two alerts when peering interface goes down - first "interfaceDown" and then "BGPSessionDown". We want to get only "interfaceDown" alert when interface goes down and "bgpSessionDown" alert when interface stays up but BGP session disconnects. We use tag `BGPPeerAddress` to associate bgp state and interface state variables:: def bgp_state_for_intf_up(self): ''' this function returns instances of the variable `bgpPeerState` that correspond to peering interfaces in op state "up". ''' if_oper_up = filter(lambda x: x == 1, import_var('ifOperStatus')) bgp_peer_state = import_var('bgpPeerState') for pair in join_by_tags(if_oper_up, bgp_peer_state, ['BGP4PeerAddress']): op_status, bgp_state = pair yield bgp_state def execute(self): # take filtered instances of variable `bgpPeerState` (only those that correspond to # peering interfaces in state "up") and trigger alert if the value is not 6 ("established"). # It is assumed that "interface down" condition is tracked by another alert somewhere. alert( name='bgpSessionDown', input=self.bgp_state_for_intf_up(), condition=lambda mvar, value: value < 6, description='BGP Session is down but interface is up', details={}, notification_time=300, streams=['log', 'slack'], fan_out=True ) First, we define function :py:func:`bgp_state_for_intf_up()` that filters instances of the `ifOperStatus` variable to get only those that correspond to interfaces in the sate "Up", then matches them to instances of the variable `bgpPeerState` that have the same tag `BGP4PeerAddress` (it calls standard function :py:func:`nw2functions.join_by_tags()` to do that). This function is a generator that yields instances of the variable `bgpPeerState`. All this filtering and matching means that this generator yields instances of `bgpPeerState` that correspond to peering interfaces in state "Up". We then feed variables returned by :py:func:`bgp_state_for_intf_up()` to an alert that checks the value and activates if the value is not 6 ("established"). This alert activates immediately when it finds the match and will not ignore short lived events when BGP session "bounces". The alert can easily be modified to skip these events just like we have done in other examples above. Alert Modules ------------- (Available beginning with NetSpyGlass v1.2.0) Examples above describe various ways to build an alert but they all assume the call to the function :func:`nw2functions.alert()` happens in the function :func:`execute()` of the class declared in your rules python hook script. With time, this script can grow quite big as you accumulate different monitoring variables, code that performs calculations on them and code that implements alerts. It would be good to add structure to this script and move its different parts to their own modules and classes. This becomes especially useful if NetSpyGlass is used in large organization where different groups may want to manage their own sets of rules and alerts. It is desirable to structure the code in a such way that they can edit and submit their parts independently and especially to ensure that when one group breaks things, they don't break rules and alerts for other groups. In this section, we are going to look at an example that demonstrates how this can be done for the alerts. You can use the same approach to refactor your data processing rules as well. Here is the structure of directories and files: .. code-block:: none . ├── __init__.py ├── alerts │   ├── __init__.py │   ├── alert_busy_cpu.py │   ├── alert_device_down.py │   ├── alert_lag_partially_degraded.py │   ├── alert_self_monitoring.py ├── big_rules.py ├── nw2.conf ├── rules.py First, configuration file `nw2.conf` refers to the rule script:: # rule runner script, it imports two modules: big_rules.py and nsg_alerts.py network.monitor.rules = "rules.Rules" Script `rules.py` is simple:: import traceback import nw2rules import big_rules import alerts class Rules(nw2rules.Nw2Rules): def __init__(self, log): super(Rules, self).__init__(log) self.rules = big_rules.UserRules(log) self.alerts = alerts.AlertRules(log) def execute(self): try: self.rules.execute() except Exception, e: self.log.error(traceback.format_exc(e)) try: self.alerts.execute() except Exception, e: self.log.error(traceback.format_exc(e)) The purpose of this top level module `rules.py` is to separate data processing rules and alerts. Now exceptions raised in either part won't affect the other because they are "fenced" with try-except clause. You do not have to put rules and alerts into their separate modules if you don't want to but this keeps code tidy. Module `big_rules.py` is our usual rule processing script. It defines class based on :class:`nw2rules.Nw2Rules` with function :func:`execute()`:: import datetime import json import math import md5 import os import time import sys import nw2rules from nw2functions import * class UserRules(nw2rules.Nw2Rules): def __init__(self, log): super(UserRules, self).__init__(log) def execute(self): super(UserRules, self).execute() Add your own code to process monitoring data here as explained in :ref:`rules`. Module `alerts.py` is very similar and in the end calls function :func:`nw2functions.alert()` to declare alerts, however it "discovers" alert definitions in modules placed in directory `alerts`. Here is file `alerts/__init__.py` that does this:: import os import pkgutil import sys import traceback import nw2rules __all__ = [] alert_functions = set() for loader, module_name, is_pkg in pkgutil.walk_packages(__path__): __all__.append(module_name) if 'alert_' in module_name: module = loader.find_module(module_name).load_module(module_name) exec('%s = module' % module_name) print 'Alert module ' + module.__name__ # find functions with name that starts with "alert_" for name in dir(module): obj = getattr(module, name) if hasattr(obj, '__call__') and name.startswith('alert_'): print ' ' + name alert_functions.add(obj) class AlertRules(nw2rules.Nw2Rules): def __init__(self, log): super(AlertRules, self).__init__(log) def execute(self): for afunc in alert_functions: try: self.log.info(' Calling ' + afunc.__name__) afunc.__call__(self.log) except Exception, e: print traceback.format_exc(e) When NetSpyGlass imports module `alerts`, it loads file `alerts/__init__.py` and runs code inside. This code scans packages in the same directory `alerts`, looking for files with name that matches simple pattern "alert_*" and imports those. Inside of each module it looks for functions with names that start with "alert_" and saves references to the functions it finds in the list `alert_functions`. This process happens only once when NetSpyGlass imports `alerts/__init__.py`. This file also declares class :class:`AlertRules`. Module `rules.py` (above) creates an instance of it and calls its function :func:`execute()` after it is done with code that processes monitoring data (call to :func:`UserRules.execute()`). By this time, module `alerts` have already discovered all "alert_*" modules and "alert_" functions declared within and simply calls them one at a time. The call to these functions is also protected with try-except clause to make sure exceptions raised by one module do not affect others. A module in directory `alerts` might look like this (this is a file "alerts/alert_busy_cpu.py"):: import nw2rules from nw2functions import * def alert_busy_cpu(log): alert( name='busyCpuAlert', input=import_var('cpuUtil'), condition=lambda _, value: value > 75, description='CPU utilization is over 75% for 20% of time for the last 10 min', duration=600, percent_duration=20, notification_time=600, streams=['log'], fan_out=True ) NetSpyGlass monitors all python modules that get imported when it loads the script identified by the configuration file parameter `network.monitor.rules`. In this example this means it monitors files `rules.py`, `big_rules.py`, `alerts/__init__.py` and all `alerts/alert_*.py`. The server will reload all these modules if you modify any one of them. You do not need to restart the server for this. As usual, watch the log file for errors when you modify and save one of these script files. .. note:: You can use standard Python operator `print` instead of calling `self.log.info()`. The output goes to the server log and the log level depends on the output stream you print to. If you print to stdout, the log level is `INFO`. Output sent to stderr appears in the log under the log level `ERROR`.