8.12. Alerts and NetSpyGlass Cluster¶
8.12.1. Configuration¶
Alerting script used with NetSpyGlass SaaS has been designed to be modular with individual alerts isolation from each other and to make it easier and safer to manage alerts one by one.
All alert scripts are located in the directory /opt/netspyglass/home/scripts/alerts/.
The package level __init__.py file is provided with the system, you do not need to edit it.
This script automatically finds other Python scripts in the same directory, looks for
functions with name that starts with alert_, loads and executes them. Each individual alert
should be placed in its own script with just one call to function nw2functions.alert()
.
Here is an overview of the directory structure with file __init__.py and few alerts:
# ls -la /opt/netspyglass/home/scripts/alerts/
total 46
drwxrwxr-x 2 nw2 nw2 0 Sep 20 01:22 .
drwxrwxr-x 2 nw2 nw2 0 Aug 1 16:10 ..
-rw-rw-r-- 1 nw2 nw2 1173 Oct 10 04:38 __init__.py
-rw-rw-r-- 1 nw2 nw2 614 Sep 20 01:22 big_change_in_num_vars.py
-rw-rw-r-- 1 nw2 nw2 743 Sep 20 01:22 busy_server_cpu.py
-rw-rw-r-- 1 nw2 nw2 640 Sep 20 01:22 file_system_full.py
-rw-rw-r-- 1 nw2 nw2 689 Sep 20 01:22 server_out_of_time.py
-rw-rw-r-- 1 nw2 nw2 832 Oct 4 16:04 system_memory_low.py
-rw-rw-r-- 1 nw2 nw2 584 Oct 4 16:04 tsdb_errors.py
-rw-rw-r-- 1 nw2 nw2 539 Sep 20 01:22 tsdb_falls_behind.py
Important
Do not edit file __init__.py
An individual alert script looks like this:
# cat /opt/netspyglass/home/scripts/alerts/busy_server_cpu.py
from nw2functions import *
def alert_busy_server_cpu(log):
"""
this alert activates when CPU utilization on one of the NSG servers or agents
goes over 75% for a half of any consecutive 5 min interval of time (that is,
for 3 out of any 5 consecutive measurements)
Alert sends repeated notifications every 10 min if the problem persists
"""
alert(
name='busyServerCpu',
input=import_var('cpuUsage'),
condition=lambda _, value: value > 75,
description='CPU utilization on one of the servers is over 75% for 50% of time for the last 5 min',
duration=300,
percent_duration=50,
notification_time=600,
streams=['log', 'slack'],
fan_out=True
)
As you can see, there is nothing in it but a single function alert_busy_server_cpu() that defines single alert. The top level script __init__.py will find it because its name starts with alert_.
If your alert logic is more complex and requires additional Python code that you want to put in a function, just do not call it alert_ and the system will not try to execute it as another alert.
The top level script __init__.py wraps calls to the functions found in these alert scripts in the try-except clause to catch exceptions. This provides for the isolation of the alerts from each other. If you make an error in your alert script, it won’t break other alerts.
8.12.2. Avoiding Duplication¶
NetSpyGlass service runs in the cloud as a cluster, that is, it consists of several NetSpyGlass servers to provide for redundancy and scale. Each server in the cluster runs identical copy of the alerting Python script, but since these copies of the script operate on different sets of data, they do not create duplicate alerts. Here is how it works:
Each NSG server in the cluster collects data from a subset of devices and stores
values of the collected monitoring variables in the in-memory data pool. A call
to nw2functions.import_var()
returns only the variables that are in the data pool of
the server, which means function nw2functions.alert()
in turn is going to see only
these variables. Since the variables in the data pool of different servers do not overlap,
we get no duplicate alerts.
Aggregation variables present their own separate challenge, though.
Each server can execute a second user’s Python script to compute aggregated values (see ref:rules).
This script creates new monitoring variables and sends them back to the system
by calling nw2functions.export_var()
. These variables end up in the same data pool from which
alerts take them. All servers execute the same copy of the script that generates
aggregated variables, this means alerting script in different servers operates on the same
set of data and therefore can generate duplicate alerts.
To avoid this, use function nw2functions.lock()
as part of the alert code. This function ensures
that protected part of the Python code is executed only by one server in the cluster. The argument
in the call to this function is the name of the lock, the name of the alert function is a simple
choice for this.
Note that only alerts that operate with aggregate variables need to use lock().
Here is an example of the alert that uses locking:
from nw2functions import *
def alert_aggregated_value_check(log):
"""
this alert activates when the value of a hypothetical aggregated variable `AggregatedVar` goes over threshold
and uses lock to make sure only one server in the cluster executes it on every cycle to
avoid duplication
"""
if lock('alert_aggregated_value_check'):
alert(
name='aggregateTooHigh',
input=import_var('AggregatedVar'),
condition=lambda _, value: value > 10,
description='An aggregated variable went over threshold',
duration=300,
percent_duration=50,
notification_time=600,
streams=['log', 'slack'],
fan_out=True
)