11. Integration with Nagios

NetSpyGlass comes with Nagios plugin script check_netspyglass.py. This script checks values of all instances of a single monitoring variable specified on the command line and returns standard Nagios return codes and output to identify the state as “OK”, “WARNING” or “CRITICAL”.

11.1. Configuration

Command line:

Usage:

    check_netspyglass.py (-s|--server)=host [(-p|--port)=port] (-n|--network)=netid [-y|--healthcheck] [(-d|--device)=devaddress] [(-k|--component)=component] [(-t|--tags)=tags] (-v|--var)=varName (-w|--warning)=N1 (-c|--critical)=N2 [-x|--matchValue] [-u|--unknownOk] [-h|--help]
       server:         NetSpyGlass UI backend server host name or address

       port:           port number used by NetSpyGlass UI backend (default: 9100)

       network:        NetSpyGlass network id (a number, default: 1)

       healthcheck:    use /api/metrics/healthcheck API call to determine "health status" of the server.
                       Use port numbers configured in the nw2.conf configuration file to query UI
                       backend or monitor.

       device:         Address of the device to check the variable for. This must match ip address
                       configured for this device in NetSpyGlass. If this parameter is missing, plugin
                       checks value of the variable for all devices.

       component:      network interface or hardware component name

       tags:           comma separated list of tags to match (if more than one tag is specified, variable instance
                       matches only when it has all tags)

       var:            the name of the monitoring variable

       warning:        range of values of the attribute 'colorLevel' of the variable (n1:n2 or just n1:). If value
                       is outside of this range, the condition is considered a "warning"

       critical:       range of values of the attribute 'colorLevel' of the variable (n1:n2 or just n1:). If value
                       is outside of this range, the condition is considered a "critical"

       matchValue:     compare value of the monitoring variable against 'warning' and 'critical' ranges (by
                       default, script compares the value of the attribute "colorLevel"). Note that NetSpyGlass API
                       returns the value of the variable in two formats: "raw" and "scaled". Raw value is in
                       the original or unscaled units of the variable, for example, "bits/sec", while scaled value
                       is in scaled units such as "Mbit/sec". API also returns "raw" and "scaled" unit of the variable.
                       This plugin compares raw value against warning and critical ranges, but includes scaled value
                       and unit in the output which Nagios puts in $SERVICEOUTPUT$ macro.

       unknownOk:      if NetSpyGlass does not have monitoring variable for the device and component defined by
                       the --device and --component (or --tags) parameters, the plugin returns status "UNKNOWN"
                       by default. However when option --unknownOk is also present, it returns status "OK".

       -h --help:      print this usage summary

Examples

    check_netspyglass.py --server=10.1.1.1 --device='10.10.10.10' --var=ifOperStatus -c:99
    check_netspyglass.py --server=10.1.1.1 --device='10.10.10.10' --tags=ifRole.PeeringInterface --var=ifOutRate -c@2:2

The plugin can be used to monitor values of monitoring variables or self-monitorng metrics generated by NetSpyGlass servers. To monitor devices and their components, use command line parameters –device, –component, –tags and –var. These parameters form a filter used to determine which monitoring variable the plugin with inspect.

11.2. Monitoring Devices

Note

In most cases you can skip parameter –network and use the default value.

The plugin calls JSON API on NetSpyGlass server to get values of the specified variable. Returned data looks like this:

[ {
  "device" : "c3560g-1",
  "address" : "10.0.14.228",
  "component" : "Gi0/17",
  "color" : "#dddddd",
  "colorLevel" : "0",
  "value" : "172846.93333333332",
  "scaledValue" : "0.173",
  "unit" : "bit/sec",
  "scaledUnit" : "Mbit/sec",
  "tags" : "Explicit.core, Link.vlan1_1, ifOperStatus.Up, ifRole.PhysicalPort, ifAdminStatus.Up, ifSpeed.1G, ifRole.UntaggedSwtichPort, ifRole.BroadcastTypeInterface"
}, {
  "device" : "c3560g-1",
  "address" : "10.0.14.228",
  "component" : "Gi0/18",
  "color" : "#ffdead",
  "colorLevel" : "1",
  "value" : "146676.53333333333",
  "scaledValue" : "146.677",
  "unit" : "bit/sec",
  "scaledUnit" : "kbit/sec",
  "tags" : "Explicit.core, Link.10.0.14.90, ifOperStatus.Up, ifRole.PhysicalPort, ifSpeed.10M, ifAdminStatus.Up, ifRole.UntaggedSwtichPort, ifRole.BroadcastTypeInterface"
}, {
     . . .
]

In addition to the device name, interface or hardware component name and value, each variable is annotated with attributes “color” and “colorLevel”. These corresponds to the colors you see in NetSpyGlass UI. These colors are assigned by NetSpyGlass monitor according to the thresholds defined in the configuration file or as the result of actions taken by the Python script in NetSpyGlass.

The meaning of the thresholds and therefore, “colorLevel” attribute, depends on the variable. For example, default configuration sets 5 levels for the interface utilization variables ifInUtilization and ifOutUtilization:

network.monitor.display.thresholds {
    ifInUtilization: [
        { value = "0",    colorLevel = 0},
        { value = "0.2",  colorLevel = 1},
        { value = "0.6",  colorLevel = 2},
        { value = "0.9",  colorLevel = 3},
    ],

    ifOutUtilization: [
        { value = "0",    colorLevel = 0},
        { value = "0.2",  colorLevel = 1},
        { value = "0.6",  colorLevel = 2},
        { value = "0.9",  colorLevel = 3},
    ],

Values of the variables ifInUtilization and ifOutUtilization are computed by the default Python rule processing script nw2rules.py (you can find copy of it in the directory python of the distribution tar archive). These values are calculated as interface traffic divided by the interface speed:

ifInUtilization = ifInRate / ifSpeed
ifOutUtilization = ifOutRate / ifSpeed

If interface utilization is under 0.2 (20%), the value of “colorLevel” attribute is going to be “0”. If utilization is greater or equal to 0.2 but is less than 0.6, “colorLevel” is equal to 1, and so on. Color level “100” is reserved for interfaces that are in operational state “down”.

Use command line parameter “-w” or “–warning” to instruct the plugin to match color level to declare state waring. Remember that the value of the “colorLevel” attribute is the number of the threshold rather than its value. This means colorLevel can be equal to “1” but not “0.2” or “20%” as in this example.

Note

threshold values for variables ifInUtilization and ifOutUtilization shown above are the defaults. If you want to change them, just add network.monitor.display.thresholds section to your configuration file nw2.conf and define new values for the thresholds. If you do this and use parameter “-w” with nagios plugin command line, the plugin will follow the new thresholds values.

Our plugin recognizes standard Nagios range definitions, such as “10”, “1:2”, “~:10”, “@2:3”. You can find their description here:

https://nagios-plugins.org/doc/guidelines.html#THRESHOLDFORMAT

Note that ranges define values of the parameter that are considered to be “OK” by Nagios. Nagios declares state “warning” or “critical” when the value falls outside of the range. Range boundaries are inclusive. Prepending range with “@” inverts the logic, that is, condition is now “OK” if the value falls outside of the range.

For example, to warn when colorLevel is “2” (this corresponds to color orange in NetSpyGlass), we can use the following range:

-w@2:2

To declare critical state when colorLevel is 3 or greater (this usually corresponds to the color red in NetSpyGlass maps and device details tables), use:

-c@3:

or:

-c:2

If your NetSpyGlass configuration uses more thresholds than 3, you may need to change Nagios ranges as well.

Alternatively, you can instruct the plugin to compare actual values of variables against ranges defined by the “-w” and “-c” parameters. To do this, add command line parameter “-x” or “–matchValue”. Note that the API returns two values for each variable: “raw” and “scaled”. Raw value is in the original or unscaled units of the variable, for example, “bits/sec”, while scaled value is in scaled units such as “Mbit/sec”. API also returns “raw” and “scaled” unit of the variable. The plugin compares raw value against warning and critical ranges, but includes scaled value and unit in the output which Nagios puts in $SERVICEOUTPUT$ macro.

11.3. Monitoring NetSpyGlass Server

There are different ways to monitor NetSpyGlass itself and its components. NetSpyGlass consists of two components that run as separate processes: UI backend and the Monitor. There can be multiple monitors that register themselves with UI backend and pass the data they gather from the devices to the server. In addition to collecting information about network devices, NetSpyGlass monitors its own components, both the server and monitors, and presents this information as a separate set of monitoring variables. These variables appear in the Graphing Workbench under category “Monitor” and track cpu and memory utilization, numbers of devices, variables and observations in the system, and more.

File netspyglass_services.cfg includes few commands can be used to set up alerts on low memory and high cpu utilization of the machine running NetSpyGlass components (server and monitors).

Command netspyglass_memory_low calls the plugin as follows:

/usr/lib/nagios/plugins/check_netspyglass.py --server='$HOSTADDRESS$' --var=jvmMemFree -x -c20000000:

This command checks variable jvmMemFree that tracks amount of free heap memory in Java virtual machine running NetSpyGlass server or monitor and raises critical alert if it drops below 20MB.

Command netspyglass_memory_low monitors cpu load by checking variable cpuUsage that returns average cpu load in percent. It is very similar:

/usr/lib/nagios/plugins/check_netspyglass.py --server='$HOSTADDRESS$' --var=cpuUsage -x -c@75:

Another way to monitor NetSpyGlass components is based on the built-in health checking and statistics JSON API. To use this approach, you need to configure Nagios command using script check_netspyglass.py with parameters –healthcheck.

You can also use parameter “–healthcheck” to poll special metrics that define an overall “health” of the server. These metrics are exposed using different JSON API url on the NetSpyGlass server. NetSpyGlass monitor also has embedded http server that listens on a different tcp port (default is 9200, this can be changed using command line parameters passed to it via startup script netspyglass.sh). This embedded http server also provides the same health status metrics.

Command that checks health status of NetSpyGlass server or monitor and response it returns look like this:

/usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --healthcheck
OK: 'Summary.healthy'=True

File netspyglass_services.cfg includes commands netspyglass_ui_backend_status and netspyglass_monitor_status that query health status API on the server and monitor.

11.4. Installation

The plugin talks to NetSpyGlass over HTTP, this means NetSpyGlass and Nagios can work on different machines. The plugin must be installed on the Nagios server though.

The following configuration examples assume NetSpyGlass works on the machine with address 10.1.1.1, the plugin script was installed in the direcitory /usr/lib/nagios/plugins/ on the server running Nagios and Nagios configuration files are located in /etc/nagios3 directory.

  • First, copy plugin script check_netspyglass.py to your Nagios server. Lets assume it is going to be installed in directory /usr/lib/nagios/plugins (but it can be anywhere). Make sure the script is executable.

  • Then copy provided file netspyglass_services.cfg (it is part of the distribution package) to the directory /etc/nagios3/conf.d/ on your Nagios server. The contents of this file define commands that call our plugin and Nagios services that use them. They look similar to this:

    define command{
            command_name    netspyglass_interface_down
            command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server='$USER5$' --device='$HOSTADDRESS$'  --var=ifOperStatus -c:99
            }
    
    define command{
            command_name    netspyglass_in_rate
            command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server='$USER5$' --device='$HOSTADDRESS$' --var=ifInUtilization -w @2:2 -c@3:3
            }
    
    define command{
            command_name    netspyglass_out_rate
            command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server='$USER5$' --device='$HOSTADDRESS$' --var=ifOutRate -w @2:2 -c@3:3
            }
    
    define command{
            command_name    netspyglass_temperature
            command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server='$USER5$' --device='$HOSTADDRESS$' --var=tempSensor -w:50 -c:60 --matchValue
            }
    
    define command{
            command_name    netspyglass_cpu_utilization
            command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server='$USER5$' --device='$HOSTADDRESS$' --var=cpuUtil -w:50 -c:75 --matchValue
            }
    
    
    
    define service {
            hostgroup_name                  netspyglass-devices
            service_description             INTERFACE DOWN
            check_command                   netspyglass_interface_down
            use                             generic-service
            notification_interval           0 ; set > 0 if you want to be renotified
    }
    
    define service {
            hostgroup_name                  netspyglass-devices
            service_description             INTERFACE UTILIZATION INBOUND
            check_command                   netspyglass_in_rate
            use                             generic-service
            notification_interval           0 ; set > 0 if you want to be renotified
    }
    
    define service {
            hostgroup_name                  netspyglass-devices
            service_description             INTERFACE UTILIZATION OUTBOUND
            check_command                   netspyglass_out_rate
            use                             generic-service
            notification_interval           0 ; set > 0 if you want to be renotified
    }
    
    define service {
            hostgroup_name                  netspyglass-devices
            service_description             TEMPERATURE
            check_command                   netspyglass_temperature
            use                             generic-service
            notification_interval           0 ; set > 0 if you want to be renotified
    }
    
    define service {
            hostgroup_name                  netspyglass-devices
            service_description             CPU LOAD
            check_command                   netspyglass_cpu_utilization
            use                             generic-service
            notification_interval           0 ; set > 0 if you want to be renotified
    }
    
  • Edit file /etc/nagios3/resource.cfg and add definition of macro “$USER5$” with the address of your NetSpyGlass server:

    $USER5$=10.1.1.1
    
  • Services defined in the file netspyglass_services.cfg are just examples. They provide a good starting point but you can add your own definitions to monitor various parameters using data collected and processed by NetSpyGlass. Commands that alert on interface down status and utilization use thresholds configured in NetSpyGlass, this means alerts will be triggered when the color of corresponding link in maps changes to black (interface down) or red (utilization is over threshold 3).

  • The choice of the value for the -c parameter for variable ifOperStatus is determined by the default value “100” of the color level assigned to interfaces in the state “down”. Parameter “-c:99” means the value is “ok” as long as it is in the range 0-99 (inclusive).

  • For variables ifInUtilization and ifOutUtilization, ranges are defined to warn when color level is equal to “2” (orange) and make critical alert when it is equal to 3 (red).

  • Commands that alert on CPU load and temperature match variable values; corresponding thresholds are configured using -w and -c parameter in the Nagios config. The reason it is different from the commands for the interface status and utilization is just to illustrate this method.

  • Use script generate_hosts.py (part of the distribution package) to generate Nagios host and hostgroup definition. This script uses command line parameters to determine the address and port used by NetSpyGlass. The simplest command line looks like this:

    generate_hosts.py --server=10.1.1.1 > netspyglass.cfg
    

Note that the script generates Nagios configuration on standard output, so you’ll need to redirect its output to a file and then copy it to Nagios machine to directory /etc/nagios3/conf.d

  • Now reload Nagios and check for errors. Commands, services, hosts and host group should appear in Nagios web interface and it should start polling NetSpyGlass via the plugin.

11.5. Use Cases

Monitor and alert for the operational status changes of network interface matched by its name:

define command{
        command_name    netspyglass_interface_down
        command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' -k ge-0/0/21 --var=ifOperStatus -c:99
        }

In the following example we monitor and alert for the operational status changes of network interface that is connected to device “hpsw1”. Note that in this case, we do not know interface name beforehand. This command is going to work correctly (will monitor the right interface) even if the link to the device “hpsw1” moves to another interface:

define command{
        command_name    netspyglass_interface_down
        command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' -t Link.hpsw2 --var=ifOperStatus -c:99
        }

We can use tag matching to build alerts that will match multiple interfaces as well. For example, the following command alerts when outbound interface utilization of any peering interface goes over threshold. Note that this command matches device, so it will generate alert associated with this device but it is not tied to any predetermined interface and will pick up new interfaces whenever you expand your peering:

define command{
        command_name    netspyglass_peering_interface_overload
        command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' -t ifRole.PeeringInterface --var=ifOutRate -c@2:2
        }

We can take it one step further and build Nagios command and service to alert when any interface that has certain tag of any device goes over the threshold. In this example, we alert when any peering interface that peers with AS174 goes over threshold, regardless of the router:

define command{
        command_name    netspyglass_as174_overload
        command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 -t ifBGP4Peer.AS174 --var=ifOutRate -c@2:2
        }

In the following example we monitor CPU utilization of the routing engine processor of a Juniper switch. The component (CPU in question) is identified by its name. To get the name, just copy and paste it from the device details page in NetSpyGlass UI. Notice how component name may contain white spaces and special characters:

define command{
        command_name    netspyglass_cpu_utilization_high
        command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' -k'Routing Engine 0, CPU utilization (%%)' --var=cpuUtil -c@2:2
        }

We can do the same with temperature sensors and other hardware components:

define command{
        command_name    netspyglass_device_overheating
        command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' -k'Slot A Temperature' --var=tempSensor -c@2:2
        }

Some devices may have just one or two temperature sensors, but some vendors equip their devices with dozens of sensors. It may be useful to alert when any of these sensors report overheating without having to configure each sensor as a separate Nagios service with its own specialized command. This is very easy to do, just do not match the component in the command and it will generate Nagios alert when any of the sensors is over the threshold:

define command{
        command_name    netspyglass_device_overheating
        command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' --var=tempSensor -c@2:2
        }

NetSpyGlass plugin can match monitoring variable values instead of the value of “colorLevel” attribute. Both approaches have their pros and cons. If you match value of “colorLevel” (which is the deafult), you can keep threshold configuration in one place, that is, in NetSpyGlass. The system uses these thresholds to set colors in maps and device details pages; the plugin then just follows already defined colors and translates them into Nagios alerts. On the other hand, it may be convenient to set thresholds in Nagios commands if you do not want alerts to directly match colors in NetSpyGlass. The following command alerts when temperature of any component of given device raises above 60C (option -x or –matchValue makes pluging compare variable value instead of “colorLevel” attribute):

define command{
        command_name    netspyglass_device_overheating
        command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' --var=tempSensor -c:59 -x
        }

Alert when free memory amount in the monitor goes below 20M threshold. If NetSpyGlass runs with multiple monitors, this generates alert if any one of them has low free memory:

define command{
        command_name    netspyglass_monitor_memory_low
        command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --var=freeMemory -x -c20000000:
        }

The following command uses built-in health check to monitor the status of the monitor:

define command{
        command_name    netspyglass_monitor_status
        command_line    /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --port=9200 --healthcheck
        }

11.6. Examples

Here is how the output of the plugin looks like:

./check_netspyglass.py --server=10.0.14.120 --port=9101 --network=2 --var=cpuUtil -x -c0:20
CRITICAL: rdsw11:unit 1 = 32.0 %; router2.rd:CPU of Switching Processor 5 = 26.0 %

./check_netspyglass.py --server=10.0.14.120 --port=9101 --network=2 --var=tempSensor -x -c0:45
CRITICAL: router1.rd:module 1 outlet temperature Sensor = 50.0 C; router2.sj:module 1 outlet temperature Sensor = 49.0 C; router2.rd:module 1 outlet temperature Sensor = 48.0 C

./check_netspyglass.py --server=10.0.14.120 --port=9101 --network=2 --var=ifOutRate --tags=ifRole.PeeringInterface,ifBGP4Peer.AS174 -x -c0:1400000000
CRITICAL: router1.rd:Te1/3 = 1.427 Gbit/sec

11.7. Using Nagios with NetSpyGlass Alerts

Nagios plugin script can query NetSpyGlass for any monitoring variable; it just inspects its value and compares with predefined thresholds. One interesting possibility is to build the alert in NetSpyGlass using function nw2functions.alert() as described in Alerting and then use Nagios plugin to query for the monitoring variable created by NetSpyGlass for the alert. The name of the variable is the same as the name of the alert and its value is an opaque large number that is guaranteed to be >0 when alert is active and zero when it is cleared. The advantage of this approach is that you can use flexibility of the alerting mechanism provided by function alert() to full extent. For example, you can build alerts with dependencies (see bgpSessionDown: alert with dependencies), use conditions with timing (see Conditions with timing) to detect flapping variables, use NetSpyGlass unit testing framework to build tests for your alerts (see Testing Framework). In this setup Nagios plays the role of the notification mechanism and manages alert life cycle for you. You can also combine Nagios with other alert notification streams supported by NetSpyGlass, such as logging, email or Slack.

Suppose we have created alert in NetSpyGlass:

alert(
    name='packetLoss',
    input=import_var('icmpLoss'),
    condition=lambda _, value: value > 20 and value < 100,
    description='Packet loss to the device measured with ping is over 20% for the last 5 min',
    duration=300,
    percent_duration=100,
    notification_time=600,
    streams=['log'],
    fan_out=True
)

This alert activates when we measure over 20% packet loss to a device during 5 min interval; it is configured to only log the event. However it also creates monitoring variable packetLoss that appears in Graphing Workbench under category Alerts and can be queries by Nagios plugin like so:

./check_netspyglass.py --server=10.0.14.120 --port=9101 --network=2 --var=packetLoss -x -c0:1

Parameters -x -c0:1 tell the plugin that values between 0 and 1 are considered to be “ok”, while anything greater than 1 is “critical”. Since monitoring variable created for the alert has value that is some very big number when alert is active, this Nagios plugin call will return CRITICAL when this condition is met.

Note that alert packetLoss is created as “fan out” alert, that is, it will create an instance of the packetLoss variable for each device defined in NetSpyGlass. These instances will track packet loss for each device separately. Nagios plugin command above, however, does not match the device and therefore will “merge” individual packetLoss alerts back into one nagios alert. If this is not what you want, you’ll need to build Nagios configuration to make it call the plugin for each device separately. See examples above that include parameter device in Nagios plugin call.