8.12. Alerts and NetSpyGlass Cluster

8.12.1. Configuration

Alerting script used with NetSpyGlass SaaS has been designed to be modular with individual alerts isolation from each other and to make it easier and safer to manage alerts one by one.

All alert scripts are located in the directory /opt/netspyglass/home/scripts/alerts/. The package level __init__.py file is provided with the system, you do not need to edit it. This script automatically finds other Python scripts in the same directory, looks for functions with name that starts with alert_, loads and executes them. Each individual alert should be placed in its own script with just one call to function nw2functions.alert(). Here is an overview of the directory structure with file __init__.py and few alerts:

# ls -la /opt/netspyglass/home/scripts/alerts/
total 46
drwxrwxr-x    2 nw2      nw2              0 Sep 20 01:22 .
drwxrwxr-x    2 nw2      nw2              0 Aug  1 16:10 ..
-rw-rw-r--    1 nw2      nw2           1173 Oct 10 04:38 __init__.py
-rw-rw-r--    1 nw2      nw2            614 Sep 20 01:22 big_change_in_num_vars.py
-rw-rw-r--    1 nw2      nw2            743 Sep 20 01:22 busy_server_cpu.py
-rw-rw-r--    1 nw2      nw2            640 Sep 20 01:22 file_system_full.py
-rw-rw-r--    1 nw2      nw2            689 Sep 20 01:22 server_out_of_time.py
-rw-rw-r--    1 nw2      nw2            832 Oct  4 16:04 system_memory_low.py
-rw-rw-r--    1 nw2      nw2            584 Oct  4 16:04 tsdb_errors.py
-rw-rw-r--    1 nw2      nw2            539 Sep 20 01:22 tsdb_falls_behind.py

Important

Do not edit file __init__.py

An individual alert script looks like this:

# cat /opt/netspyglass/home/scripts/alerts/busy_server_cpu.py

from nw2functions import *


def alert_busy_server_cpu(log):
    """
    this alert activates when CPU utilization on one of the NSG servers or agents
    goes over 75% for a half of any consecutive 5 min interval of time (that is,
    for 3 out of any 5 consecutive measurements)

    Alert sends repeated notifications every 10 min if the problem persists
    """
    alert(
        name='busyServerCpu',
        input=import_var('cpuUsage'),
        condition=lambda _, value: value > 75,
        description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min',
        duration=300,
        percent_duration=50,
        notification_time=600,
        streams=['log', 'slack'],
        fan_out=True
    )

As you can see, there is nothing in it but a single function alert_busy_server_cpu() that defines single alert. The top level script __init__.py will find it because its name starts with alert_.

If your alert logic is more complex and requires additional Python code that you want to put in a function, just do not call it alert_ and the system will not try to execute it as another alert.

The top level script __init__.py wraps calls to the functions found in these alert scripts in the try-except clause to catch exceptions. This provides for the isolation of the alerts from each other. If you make an error in your alert script, it won’t break other alerts.

8.12.2. Avoiding Duplication

NetSpyGlass service runs in the cloud as a cluster, that is, it consists of several NetSpyGlass servers to provide for redundancy and scale. Each server in the cluster runs identical copy of the alerting Python script, but since these copies of the script operate on different sets of data, they do not create duplicate alerts. Here is how it works:

Each NSG server in the cluster collects data from a subset of devices and stores values of the collected monitoring variables in the in-memory data pool. A call to nw2functions.import_var() returns only the variables that are in the data pool of the server, which means function nw2functions.alert() in turn is going to see only these variables. Since the variables in the data pool of different servers do not overlap, we get no duplicate alerts.

Aggregation variables present their own separate challenge, though.

Each server can execute a second user’s Python script to compute aggregated values (see ref:rules). This script creates new monitoring variables and sends them back to the system by calling nw2functions.export_var(). These variables end up in the same data pool from which alerts take them. All servers execute the same copy of the script that generates aggregated variables, this means alerting script in different servers operates on the same set of data and therefore can generate duplicate alerts.

To avoid this, use function nw2functions.lock() as part of the alert code. This function ensures that protected part of the Python code is executed only by one server in the cluster. The argument in the call to this function is the name of the lock, the name of the alert function is a simple choice for this.

Note that only alerts that operate with aggregate variables need to use lock().

Here is an example of the alert that uses locking:

from nw2functions import *


def alert_aggregated_value_check(log):
    """
    this alert activates when the value of a hypothetical aggregated variable `AggregatedVar` goes over threshold
    and uses lock to make sure only one server in the cluster executes it on every cycle to
    avoid duplication
    """
    if lock('alert_aggregated_value_check'):
        alert(
            name='aggregateTooHigh',
            input=import_var('AggregatedVar'),
            condition=lambda _, value: value > 10,
            description='An aggregated variable went over threshold',
            duration=300,
            percent_duration=50,
            notification_time=600,
            streams=['log', 'slack'],
            fan_out=True
        )