Before we go over the notification logic, we should look at silences.
A “silence” is an object that exists in NetSpyGlass and acts as a filter for alert notifications. A silence can match alert name, device id and name, component id, tags and alert key. If at the time when alert needs to send notification it is determined that matching silence object exists, the notification is suppressed and only log record is made in the log /opt/netspyglass/home/logs/alerts.log that looks like this:
2015-06-22 21:24:00,212: ALERT SILENCED: lagPartiallyDegraded.9.611 | fs-cs2 | ae10 | active since: 2015-06-19 23:22:00 UTC; silence id=1
This log record lists alert name, device and compinent Id that triggered the alert, device and component names, the time when it went into the state active and indication that it has been silenced, with matching silence id=1.
At this time we provide two methods for adding, updating and listing existing silences: a command line script /opt/netspyglass/current/bin/silence.py and JSON API calls POST /api/alerts/networks/:netid/silences. You can find description of the script silence.py here
Silences are useful when you want to suppress alert notifications for known problems, such as during network maintenance. Some alerting systems NetSpyGlass can send notifications to have feature that allows operator to “mute” them, for example this is possible in PagerDuty. However the granularity of this “alert muting” may be insufficient. In PagerDuty you can apply “maintenance window” to a whole service at once but not to individual incidents. If all (or a big part ) of NetSpyGlass alerts go to the same service in PagerDuty, this means you can either stop all of them from escalating or none. However it may be useful to be able to “mute” or “silence” all alerts for just one device or all alerts of partilar kind for a group of devices. This can be achieved by adding silence that matches devices and other attributes.
You can add silence ahead of time to block alert notifications before they happen.
Silence has the following basic parameters (these are accessible via JSON API or command line script):
- expirationTimeMs - expiration time for the silence, sarting from the moment when it was created.
- key - alert key to match. This matches specific alert, so if this attribute is provided, other matching attributes are redundant
- varName - alert name to match, this can be regular expression
- deviceId - device id to match
- deviceName - device name to match, this can be regular expression
- index - component id to match
- tags - tags to match
Matching attributes can be provided in any combination; silence matches alert if all of the mathcing attributes match. Here are few examples of silences:
Suppose we have alerting rule that creates alert lagPartiallyDegraded and the input variable passed as argument inpit to the call to alert() has devices with names fs-cs1 (id=8) and fs-cs2 (id=9). This call to the alert() function creates alerts with the name lagPartiallyDegraded that you can see in the Graphing Workbench (category Alerts). Suppose further we are going to perform maintenance on device fs-cs2 and want to silence all alerts for it for 1 hour. This can be achieved with script silence.py:
silence.py add --var_name='lagPartiallyDegraded' --dev_name='fs-cs2' --expiraion=60
We can create silence to match just alert name, in which case it applies to alerts generated for all devices:
silence.py add --var_name='lagPartiallyDegraded' --expiraion=60
Since device name can be regular expression, we can match just specific devices like this:
silence.py add --var_name='lagPartiallyDegraded' --dev_name='fs-cs' --expiraion=60
The same can be achieved using JSON API (See POST /api/alerts/networks/:netid/silences); in fact, script silence.py makes these JSON API calls to manipulate silences.