8.10. Examples

See also

examples_of_tests

8.10.1. busyCpuAlert

Simple alert that activates when cpuUtil goes over the threshold of 75% for 20% of the 5 min interval:

alert(
    name='busyCpuAlert',
    input=import_var('cpuUtil'),
    condition=lambda _, value: value > 75,
    description='CPU utilization is over 75% for 50% of time for the last 10 min',
    details={'slack_channel': '#netspyglass'},
    duration=600,
    percent_duration=50,
    notification_time=300,
    streams=['slack', 'log'],
    fan_out=True
)

This alert catches episodes when CPU load of a device goes over 75% but ignores short spikes when this high load does not last long. Parameter duration makes the alert analyse the value of the variable for the past 10 min but parameter percent_duration says the value should be over the threshold at least 50% of the time. In other words, the alert activates only if at least half of the samples collected during the 10 min interval are above the threshold. The alert “fans out”, that is, it generates separate notification message for each device that matches its condition. Notifications are sent to two outgoing streams: ‘log’ and ‘slack’ and are sent no more often than every 5 min (parameter notification_time has value 300 sec).

8.10.2. deviceDown

Next alert watches packet loss to devices measured with ping and activates when it goes over threshold (this is a simple way to alert on “device down” condition):

alert(
    name='deviceDown',
    input=import_var('icmpLoss'),
    condition=lambda _, value: value > 75,
    description='Packet loss to the device measured with ping is over 75% for the last 5 min',
    details={'slack_channel': '#devices_down'},
    duration=300,
    percent_duration=100,
    notification_time=300,
    streams=['slack', 'log'],
    fan_out=True
)

Alert activates when the value of the variable icmpLoss is over 75% for 5 min and sends notifications to streams log and slack. Note that Slack channel is passed via alert field details, this way you can override the default channel configured in the alerts.streams.slack configuration parameter.

8.10.3. bigChangeInVariables

The following alerting rule watches monitoring variable numVars and activates alert when it notices big change in the value:

def compare_to_mean(mvar, value):
    """
    compare `value` to the mean value of the variable and return True if the difference is over 30% of the mean

    :param mvar:    MonitoringVariable instance
    :param value:   value
    :return:        True if abs(mean - value) > 0.3*mean
    """
    assert isinstance(mvar, MonitoringVariable)
    mean = mvar.statistics.mean()
    return abs(value - mean) > 0.3*mean


class UserRules(nw2rules.Nw2Rules):
    def execute(self):

        alert(
            name='bigChangeInVariables',
            input=import_var('numVars'),
            condition=lambda mvar, x: compare_to_mean(mvar, x),
            description='Big change in the number of monitoring variables; last value=$alert.value',
            duration=300,
            percent_duration=100,
            notification_time=300,
            streams=['pagerduty', 'log'],
            fan_out=True
        )

This alert is well suited to catch big changes in a value of the variable that lasts long time:

../_images/aafig-d82504a9f22ad51cba1ed79fb6050e69e45857dc.svg

To detect change in the value of this variable, the rule calls function compare_to_mean() that calculates mean value of the variable, compares provided value with it and returns True if the value deviates from the mean over 30%. This alert activates when the value of variable numVar changes by over 30% and stays like that for 5 minutes (because parameter duration has value of 300 sec). This alert will ignore changes in the value that are smaller or last shorter than that. Alert sends notification to two streams: log and pagerduty that should be configured in the configuration section alerts.streams in the main configuration file nw2.conf.

Because this alert compares current value of the variable with mean, it will continue to be active for a while as long as the new value satisfies the condition. It can clear in two cases: either the value returns back to mean, or, if the value does not change much around its new level, the new level becomes new mean and that also clears the alert.

Note how this alert uses macro $alert.value in its description. See nw2functions.alert() for the list of supported macros.

8.10.4. lagPartiallyDegraded

Next example is an alert that watches combined bandwidth of LAG interface groups. Monitoring variable portAggregatorBandwidth computed by the default rule processing script has value of 100% if all configured LAG members are online and are passing traffic, that is, LACP bits collecting and distributing are set. The value drops below 100% if a member is down or misconfigured, in which case one or both bits become cleared. This alert is a good way to catch LAG groups that become degraded because of misconfiguration.

The alert uses condition function that checks if the value of the variable is below certain limit rather than above as in previous examples:

alert(
    name='lagPartiallyDegraded',
    input=import_var('portAggregatorBandwidth'),
    condition=lambda _, value: value < 100,
    description='$alert.deviceName:$alert.componentName :: One or more LAG has members failed, combined bundle bandwidth is below 100%',
    details={},
    duration=300,
    percent_duration=100,
    notification_time=300,
    fan_out=True
)

The only requirement for the function passed as parameter condition to nw2functions.alert() is that it should accept two arguments and return a boolean value. The first argument is net.happygears.nw2.py.MonitoringVariable object and the second argument is observation value to be examined. If parameter duration specifies interval of time longer than polling interval, this function will called multiple times with the same object as its first argument and different values as second argument.

8.10.5. interfaceDown

this alert watches variable ifOperStatus that has value 1 when interface is up and 2 when it is down. There are other values too, such as 3 (“testing”) or 7 (“lower layer down”) but they all are greater than 1:

alert(
    name='interfaceDown',
    input=import_var('ifOperStatus'),
    condition=lambda _, value: value > 1,
    description='$alert.deviceName:$alert.componentName :: Interface is down',
    details={'slack_channel': '#netspyglass'},
    duration=300,
    percent_duration=100,
    notification_time=300,
    streams=['slack', 'log'],
    fan_out=True
)

Similar to the previous examples, this alert ignores cases when interface goes down for a short period of time and activates only if the interface stays down for at least 5 min.

Alert configured this way may not be optimal though because it will not activate if interface is “flapping”, that is quickly goes up and down all the time. Flapping interface may not be in the state “down” for 5 min without going up briefly and will never satisfy the condition of this alert. To catch flapping interfaces we can modify the alert:

alert(
    name='interfaceDown',
    input=import_var('ifOperStatus'),
    condition=lambda _, value: value > 1,
    description='$alert.deviceName:$alert.componentName :: Interface is down',
    details={'slack_channel': '#netspyglass'},
    duration=300,
    percent_duration=50,
    notification_time=300,
    streams=['slack', 'log'],
    fan_out=True
)

Now the alert activates if the interface is found to be down at least half of the time during 5 min interval but it still ignores episodes when interface goes down briefly and then remains in the state “up” for a long time.

Note

It does not matter exactly which half of the time the interface was down. The alert activates if half of the observations collected during the 5 min interval show the interface in the sate “down”. It could have stayed down for 3 min and then come up, or bounced up and down several times - either way, if at least half of observations in any combination show the state to be “down”, the alert will activate.

8.10.6. bgpSessionDown: alert with dependencies

the following example illustrates an alert that takes into account dependency between monitoring variables. The alert activates when it notices that BGP session state becomes anything but established while corresponding peering interface is in operational state Up. We assume that “interface down” condition is tracked by another alert somewhere (see above for the example) and we don’t want to get two alerts when peering interface goes down - first “interfaceDown” and then “BGPSessionDown”. We want to get only “interfaceDown” alert when interface goes down and “bgpSessionDown” alert when interface stays up but BGP session disconnects. We use tag BGPPeerAddress to associate bgp state and interface state variables and parameter ignore_if to provide additional condition function that is called whenever the alert is about to become active. If the second condition function returns True because the interface is down, then the alert does not activate. Here is the code:

def intf_down(mvar):
    # MonitoringVariable instance passed as an argument is an instance of `bgpPeerState`
    # variable. Find corresponding `ifOperState` variable using device id and tag `BGP4PeerAddress`
    # ifOperStatus values: 1=Up, 2=Down

    for tag in mvar.getTagsInFacet('BGP4PeerAddress'):
        query('FROM ifOperState WHERE deviceId={0} AND BGP4PeerAddress={1}'.format(mvar.ds.deviceId, tag))
        return mvar.timeseries.getLastNonNaNValue() > 1

    # return False if we can't find the tag
    return False

def alert_bgp_down(log):

    # take instances of variable `bgpPeerState` and check if the value
    # is not 6 ("established"). If this condition is satisfied, check
    # operational state of the interface and ignore if the interface is down

    alert(
        name='BGPDown',
        input=query('FROM bgpPeerState'),
        condition=lambda mvar, value: value < 6,
        ignore_if=lambda mvar: intf_down(mvar),
        description='BGP Session is down but interface is up',
        details={},
        notification_time=300,
        streams=['log', 'slack'],
        fan_out=True
    )

Note

You can use standard Python operator print instead of calling self.log.info(). The output goes to the server log and the log level depends on the output stream you print to. If you print to stdout, the log level is INFO. Output sent to stderr appears in the log under the log level ERROR.