.. _nagios: Integration with Nagios *********************** NetSpyGlass comes with Nagios plugin script `check_netspyglass.py`. This script checks values of all instances of a single monitoring variable specified on the command line and returns standard Nagios return codes and output to identify the state as "OK", "WARNING" or "CRITICAL". Configuration ============= Command line:: Usage: check_netspyglass.py (-s|--server)=host [(-p|--port)=port] (-n|--network)=netid [-y|--healthcheck] [(-d|--device)=devaddress] [(-k|--component)=component] [(-t|--tags)=tags] (-v|--var)=varName (-w|--warning)=N1 (-c|--critical)=N2 [-x|--matchValue] [-u|--unknownOk] [-h|--help] server: NetSpyGlass UI backend server host name or address port: port number used by NetSpyGlass UI backend (default: 9100) network: NetSpyGlass network id (a number, default: 1) healthcheck: use /api/metrics/healthcheck API call to determine "health status" of the server. Use port numbers configured in the nw2.conf configuration file to query UI backend or monitor. device: Address of the device to check the variable for. This must match ip address configured for this device in NetSpyGlass. If this parameter is missing, plugin checks value of the variable for all devices. component: network interface or hardware component name tags: comma separated list of tags to match (if more than one tag is specified, variable instance matches only when it has all tags) var: the name of the monitoring variable warning: range of values of the attribute 'colorLevel' of the variable (n1:n2 or just n1:). If value is outside of this range, the condition is considered a "warning" critical: range of values of the attribute 'colorLevel' of the variable (n1:n2 or just n1:). If value is outside of this range, the condition is considered a "critical" matchValue: compare value of the monitoring variable against 'warning' and 'critical' ranges (by default, script compares the value of the attribute "colorLevel"). Note that NetSpyGlass API returns the value of the variable in two formats: "raw" and "scaled". Raw value is in the original or unscaled units of the variable, for example, "bits/sec", while scaled value is in scaled units such as "Mbit/sec". API also returns "raw" and "scaled" unit of the variable. This plugin compares raw value against warning and critical ranges, but includes scaled value and unit in the output which Nagios puts in $SERVICEOUTPUT$ macro. unknownOk: if NetSpyGlass does not have monitoring variable for the device and component defined by the --device and --component (or --tags) parameters, the plugin returns status "UNKNOWN" by default. However when option --unknownOk is also present, it returns status "OK". -h --help: print this usage summary Examples check_netspyglass.py --server=10.1.1.1 --device='10.10.10.10' --var=ifOperStatus -c:99 check_netspyglass.py --server=10.1.1.1 --device='10.10.10.10' --tags=ifRole.PeeringInterface --var=ifOutRate -c@2:2 The plugin can be used to monitor values of monitoring variables or self-monitorng metrics generated by NetSpyGlass servers. To monitor devices and their components, use command line parameters --device, --component, --tags and --var. These parameters form a filter used to determine which monitoring variable the plugin with inspect. Monitoring Devices ================== .. note:: In most cases you can skip parameter --network and use the default value. The plugin calls JSON API on NetSpyGlass server to get values of the specified variable. Returned data looks like this:: [ { "device" : "c3560g-1", "address" : "10.0.14.228", "component" : "Gi0/17", "color" : "#dddddd", "colorLevel" : "0", "value" : "172846.93333333332", "scaledValue" : "0.173", "unit" : "bit/sec", "scaledUnit" : "Mbit/sec", "tags" : "Explicit.core, Link.vlan1_1, ifOperStatus.Up, ifRole.PhysicalPort, ifAdminStatus.Up, ifSpeed.1G, ifRole.UntaggedSwtichPort, ifRole.BroadcastTypeInterface" }, { "device" : "c3560g-1", "address" : "10.0.14.228", "component" : "Gi0/18", "color" : "#ffdead", "colorLevel" : "1", "value" : "146676.53333333333", "scaledValue" : "146.677", "unit" : "bit/sec", "scaledUnit" : "kbit/sec", "tags" : "Explicit.core, Link.10.0.14.90, ifOperStatus.Up, ifRole.PhysicalPort, ifSpeed.10M, ifAdminStatus.Up, ifRole.UntaggedSwtichPort, ifRole.BroadcastTypeInterface" }, { . . . ] In addition to the device name, interface or hardware component name and value, each variable is annotated with attributes "color" and "colorLevel". These corresponds to the colors you see in NetSpyGlass UI. These colors are assigned by NetSpyGlass monitor according to the thresholds defined in the configuration file or as the result of actions taken by the Python script in NetSpyGlass. The meaning of the thresholds and therefore, "colorLevel" attribute, depends on the variable. For example, default configuration sets 5 levels for the interface utilization variables ifInUtilization and ifOutUtilization:: network.monitor.display.thresholds { ifInUtilization: [ { value = "0", colorLevel = 0}, { value = "0.2", colorLevel = 1}, { value = "0.6", colorLevel = 2}, { value = "0.9", colorLevel = 3}, ], ifOutUtilization: [ { value = "0", colorLevel = 0}, { value = "0.2", colorLevel = 1}, { value = "0.6", colorLevel = 2}, { value = "0.9", colorLevel = 3}, ], Values of the variables `ifInUtilization` and `ifOutUtilization` are computed by the default Python rule processing script `nw2rules.py` (you can find copy of it in the directory `python` of the distribution tar archive). These values are calculated as interface traffic divided by the interface speed:: ifInUtilization = ifInRate / ifSpeed ifOutUtilization = ifOutRate / ifSpeed If interface utilization is under 0.2 (20%), the value of "colorLevel" attribute is going to be "0". If utilization is greater or equal to 0.2 but is less than 0.6, "colorLevel" is equal to 1, and so on. Color level "100" is reserved for interfaces that are in operational state "down". Use command line parameter "-w" or "--warning" to instruct the plugin to match color level to declare state waring. Remember that the value of the "colorLevel" attribute is the number of the threshold rather than its value. This means colorLevel can be equal to "1" but not "0.2" or "20%" as in this example. .. note:: threshold values for variables `ifInUtilization` and `ifOutUtilization` shown above are the defaults. If you want to change them, just add `network.monitor.display.thresholds` section to your configuration file `nw2.conf` and define new values for the thresholds. If you do this and use parameter "-w" with nagios plugin command line, the plugin will follow the new thresholds values. Our plugin recognizes standard Nagios range definitions, such as "10", "1:2", "~:10", "@2:3". You can find their description here: https://nagios-plugins.org/doc/guidelines.html#THRESHOLDFORMAT Note that ranges define values of the parameter that are considered to be "OK" by Nagios. Nagios declares state "warning" or "critical" when the value falls outside of the range. Range boundaries are inclusive. Prepending range with "@" inverts the logic, that is, condition is now "OK" if the value falls outside of the range. For example, to warn when colorLevel is "2" (this corresponds to color orange in NetSpyGlass), we can use the following range:: -w@2:2 To declare critical state when colorLevel is 3 or greater (this usually corresponds to the color red in NetSpyGlass maps and device details tables), use:: -c@3: or:: -c:2 If your NetSpyGlass configuration uses more thresholds than 3, you may need to change Nagios ranges as well. Alternatively, you can instruct the plugin to compare actual values of variables against ranges defined by the "-w" and "-c" parameters. To do this, add command line parameter "-x" or "--matchValue". Note that the API returns two values for each variable: "raw" and "scaled". Raw value is in the original or unscaled units of the variable, for example, "bits/sec", while scaled value is in scaled units such as "Mbit/sec". API also returns "raw" and "scaled" unit of the variable. The plugin compares raw value against warning and critical ranges, but includes scaled value and unit in the output which Nagios puts in `$SERVICEOUTPUT$` macro. Monitoring NetSpyGlass Server ============================= There are different ways to monitor NetSpyGlass itself and its components. NetSpyGlass consists of two components that run as separate processes: UI backend and the Monitor. There can be multiple monitors that register themselves with UI backend and pass the data they gather from the devices to the server. In addition to collecting information about network devices, NetSpyGlass monitors its own components, both the server and monitors, and presents this information as a separate set of monitoring variables. These variables appear in the Graphing Workbench under category "Monitor" and track cpu and memory utilization, numbers of devices, variables and observations in the system, and more. File `netspyglass_services.cfg` includes few commands can be used to set up alerts on low memory and high cpu utilization of the machine running NetSpyGlass components (server and monitors). Command **netspyglass_memory_low** calls the plugin as follows:: /usr/lib/nagios/plugins/check_netspyglass.py --server='$HOSTADDRESS$' --var=jvmMemFree -x -c20000000: This command checks variable `jvmMemFree` that tracks amount of free heap memory in Java virtual machine running NetSpyGlass server or monitor and raises critical alert if it drops below 20MB. Command **netspyglass_memory_low** monitors cpu load by checking variable `cpuUsage` that returns average cpu load in percent. It is very similar:: /usr/lib/nagios/plugins/check_netspyglass.py --server='$HOSTADDRESS$' --var=cpuUsage -x -c@75: Another way to monitor NetSpyGlass components is based on the built-in health checking and statistics JSON API. To use this approach, you need to configure Nagios command using script `check_netspyglass.py` with parameters --healthcheck. You can also use parameter "--healthcheck" to poll special metrics that define an overall "health" of the server. These metrics are exposed using different JSON API url on the NetSpyGlass server. NetSpyGlass monitor also has embedded http server that listens on a different tcp port (default is 9200, this can be changed using command line parameters passed to it via startup script `netspyglass.sh`). This embedded http server also provides the same health status metrics. Command that checks health status of NetSpyGlass server or monitor and response it returns look like this:: /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --healthcheck OK: 'Summary.healthy'=True File `netspyglass_services.cfg` includes commands **netspyglass_ui_backend_status** and **netspyglass_monitor_status** that query health status API on the server and monitor. .. _nagios_netspyglass_installation: Installation ============ The plugin talks to NetSpyGlass over HTTP, this means NetSpyGlass and Nagios can work on different machines. The plugin must be installed on the Nagios server though. The following configuration examples assume NetSpyGlass works on the machine with address 10.1.1.1, the plugin script was installed in the direcitory `/usr/lib/nagios/plugins/` on the server running Nagios and Nagios configuration files are located in `/etc/nagios3` directory. - First, copy plugin script `check_netspyglass.py` to your Nagios server. Lets assume it is going to be installed in directory `/usr/lib/nagios/plugins` (but it can be anywhere). Make sure the script is executable. - Then copy provided file `netspyglass_services.cfg` (it is part of the distribution package) to the directory `/etc/nagios3/conf.d/` on your Nagios server. The contents of this file define commands that call our plugin and Nagios services that use them. They look similar to this:: define command{ command_name netspyglass_interface_down command_line /usr/lib/nagios/plugins/check_netspyglass.py --server='$USER5$' --device='$HOSTADDRESS$' --var=ifOperStatus -c:99 } define command{ command_name netspyglass_in_rate command_line /usr/lib/nagios/plugins/check_netspyglass.py --server='$USER5$' --device='$HOSTADDRESS$' --var=ifInUtilization -w @2:2 -c@3:3 } define command{ command_name netspyglass_out_rate command_line /usr/lib/nagios/plugins/check_netspyglass.py --server='$USER5$' --device='$HOSTADDRESS$' --var=ifOutRate -w @2:2 -c@3:3 } define command{ command_name netspyglass_temperature command_line /usr/lib/nagios/plugins/check_netspyglass.py --server='$USER5$' --device='$HOSTADDRESS$' --var=tempSensor -w:50 -c:60 --matchValue } define command{ command_name netspyglass_cpu_utilization command_line /usr/lib/nagios/plugins/check_netspyglass.py --server='$USER5$' --device='$HOSTADDRESS$' --var=cpuUtil -w:50 -c:75 --matchValue } define service { hostgroup_name netspyglass-devices service_description INTERFACE DOWN check_command netspyglass_interface_down use generic-service notification_interval 0 ; set > 0 if you want to be renotified } define service { hostgroup_name netspyglass-devices service_description INTERFACE UTILIZATION INBOUND check_command netspyglass_in_rate use generic-service notification_interval 0 ; set > 0 if you want to be renotified } define service { hostgroup_name netspyglass-devices service_description INTERFACE UTILIZATION OUTBOUND check_command netspyglass_out_rate use generic-service notification_interval 0 ; set > 0 if you want to be renotified } define service { hostgroup_name netspyglass-devices service_description TEMPERATURE check_command netspyglass_temperature use generic-service notification_interval 0 ; set > 0 if you want to be renotified } define service { hostgroup_name netspyglass-devices service_description CPU LOAD check_command netspyglass_cpu_utilization use generic-service notification_interval 0 ; set > 0 if you want to be renotified } - Edit file `/etc/nagios3/resource.cfg` and add definition of macro "$USER5$" with the address of your NetSpyGlass server:: $USER5$=10.1.1.1 - Services defined in the file `netspyglass_services.cfg` are just examples. They provide a good starting point but you can add your own definitions to monitor various parameters using data collected and processed by NetSpyGlass. Commands that alert on interface down status and utilization use thresholds configured in NetSpyGlass, this means alerts will be triggered when the color of corresponding link in maps changes to black (interface down) or red (utilization is over threshold 3). - The choice of the value for the `-c` parameter for variable `ifOperStatus` is determined by the default value "100" of the color level assigned to interfaces in the state "down". Parameter "-c:99" means the value is "ok" as long as it is in the range 0-99 (inclusive). - For variables `ifInUtilization` and `ifOutUtilization`, ranges are defined to warn when color level is equal to "2" (orange) and make critical alert when it is equal to 3 (red). - Commands that alert on CPU load and temperature match variable values; corresponding thresholds are configured using -w and -c parameter in the Nagios config. The reason it is different from the commands for the interface status and utilization is just to illustrate this method. - Use script `generate_hosts.py` (part of the distribution package) to generate Nagios host and hostgroup definition. This script uses command line parameters to determine the address and port used by NetSpyGlass. The simplest command line looks like this:: generate_hosts.py --server=10.1.1.1 > netspyglass.cfg Note that the script generates Nagios configuration on standard output, so you'll need to redirect its output to a file and then copy it to Nagios machine to directory `/etc/nagios3/conf.d` - Now reload Nagios and check for errors. Commands, services, hosts and host group should appear in Nagios web interface and it should start polling NetSpyGlass via the plugin. Use Cases ========= Monitor and alert for the operational status changes of network interface matched by its name:: define command{ command_name netspyglass_interface_down command_line /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' -k ge-0/0/21 --var=ifOperStatus -c:99 } In the following example we monitor and alert for the operational status changes of network interface that is connected to device "hpsw1". Note that in this case, we do not know interface name beforehand. This command is going to work correctly (will monitor the right interface) even if the link to the device "hpsw1" moves to another interface:: define command{ command_name netspyglass_interface_down command_line /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' -t Link.hpsw2 --var=ifOperStatus -c:99 } We can use tag matching to build alerts that will match multiple interfaces as well. For example, the following command alerts when outbound interface utilization of any peering interface goes over threshold. Note that this command matches device, so it will generate alert associated with this device but it is not tied to any predetermined interface and will pick up new interfaces whenever you expand your peering:: define command{ command_name netspyglass_peering_interface_overload command_line /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' -t ifRole.PeeringInterface --var=ifOutRate -c@2:2 } We can take it one step further and build Nagios command and service to alert when any interface that has certain tag of any device goes over the threshold. In this example, we alert when any peering interface that peers with AS174 goes over threshold, regardless of the router:: define command{ command_name netspyglass_as174_overload command_line /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 -t ifBGP4Peer.AS174 --var=ifOutRate -c@2:2 } In the following example we monitor CPU utilization of the routing engine processor of a Juniper switch. The component (CPU in question) is identified by its name. To get the name, just copy and paste it from the device details page in NetSpyGlass UI. Notice how component name may contain white spaces and special characters:: define command{ command_name netspyglass_cpu_utilization_high command_line /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' -k'Routing Engine 0, CPU utilization (%%)' --var=cpuUtil -c@2:2 } We can do the same with temperature sensors and other hardware components:: define command{ command_name netspyglass_device_overheating command_line /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' -k'Slot A Temperature' --var=tempSensor -c@2:2 } Some devices may have just one or two temperature sensors, but some vendors equip their devices with dozens of sensors. It may be useful to alert when any of these sensors report overheating without having to configure each sensor as a separate Nagios service with its own specialized command. This is very easy to do, just do not match the component in the command and it will generate Nagios alert when any of the sensors is over the threshold:: define command{ command_name netspyglass_device_overheating command_line /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' --var=tempSensor -c@2:2 } NetSpyGlass plugin can match monitoring variable values instead of the value of "colorLevel" attribute. Both approaches have their pros and cons. If you match value of "colorLevel" (which is the deafult), you can keep threshold configuration in one place, that is, in NetSpyGlass. The system uses these thresholds to set colors in maps and device details pages; the plugin then just follows already defined colors and translates them into Nagios alerts. On the other hand, it may be convenient to set thresholds in Nagios commands if you do not want alerts to directly match colors in NetSpyGlass. The following command alerts when temperature of any component of given device raises above 60C (option -x or --matchValue makes pluging compare variable value instead of "colorLevel" attribute):: define command{ command_name netspyglass_device_overheating command_line /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --device='$HOSTADDRESS$' --var=tempSensor -c:59 -x } Alert when free memory amount in the monitor goes below 20M threshold. If NetSpyGlass runs with multiple monitors, this generates alert if any one of them has low free memory:: define command{ command_name netspyglass_monitor_memory_low command_line /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --var=freeMemory -x -c20000000: } The following command uses built-in health check to monitor the status of the monitor:: define command{ command_name netspyglass_monitor_status command_line /usr/lib/nagios/plugins/check_netspyglass.py --server=10.1.1.1 --port=9200 --healthcheck } Examples ======== Here is how the output of the plugin looks like:: ./check_netspyglass.py --server=10.0.14.120 --port=9101 --network=2 --var=cpuUtil -x -c0:20 CRITICAL: rdsw11:unit 1 = 32.0 %; router2.rd:CPU of Switching Processor 5 = 26.0 % ./check_netspyglass.py --server=10.0.14.120 --port=9101 --network=2 --var=tempSensor -x -c0:45 CRITICAL: router1.rd:module 1 outlet temperature Sensor = 50.0 C; router2.sj:module 1 outlet temperature Sensor = 49.0 C; router2.rd:module 1 outlet temperature Sensor = 48.0 C ./check_netspyglass.py --server=10.0.14.120 --port=9101 --network=2 --var=ifOutRate --tags=ifRole.PeeringInterface,ifBGP4Peer.AS174 -x -c0:1400000000 CRITICAL: router1.rd:Te1/3 = 1.427 Gbit/sec .. _using_alerts_with_nagios: Using Nagios with NetSpyGlass Alerts ==================================== Nagios plugin script can query NetSpyGlass for any monitoring variable; it just inspects its value and compares with predefined thresholds. One interesting possibility is to build the alert in NetSpyGlass using function :func:`nw2functions.alert()` as described in :ref:`alerting` and then use Nagios plugin to query for the monitoring variable created by NetSpyGlass for the alert. The name of the variable is the same as the name of the alert and its value is an opaque large number that is guaranteed to be >0 when alert is active and zero when it is cleared. The advantage of this approach is that you can use flexibility of the alerting mechanism provided by function `alert()` to full extent. For example, you can build alerts with dependencies (see :ref:`alert_with_dependencies`), use conditions with timing (see :ref:`conditions_with_timing`) to detect flapping variables, use NetSpyGlass unit testing framework to build tests for your alerts (see :ref:`testing`). In this setup Nagios plays the role of the notification mechanism and manages alert life cycle for you. You can also combine Nagios with other alert notification streams supported by NetSpyGlass, such as logging, email or Slack. Suppose we have created alert in NetSpyGlass:: alert( name='packetLoss', input=import_var('icmpLoss'), condition=lambda _, value: value > 20 and value < 100, description='Packet loss to the device measured with ping is over 20% for the last 5 min', duration=300, percent_duration=100, notification_time=600, streams=['log'], fan_out=True ) This alert activates when we measure over 20% packet loss to a device during 5 min interval; it is configured to only log the event. However it also creates monitoring variable `packetLoss` that appears in Graphing Workbench under category `Alerts` and can be queries by Nagios plugin like so:: ./check_netspyglass.py --server=10.0.14.120 --port=9101 --network=2 --var=packetLoss -x -c0:1 Parameters `-x -c0:1` tell the plugin that values between 0 and 1 are considered to be "ok", while anything greater than 1 is "critical". Since monitoring variable created for the alert has value that is some very big number when alert is active, this Nagios plugin call will return CRITICAL when this condition is met. Note that alert `packetLoss` is created as "fan out" alert, that is, it will create an instance of the `packetLoss` variable for each device defined in NetSpyGlass. These instances will track packet loss for each device separately. Nagios plugin command above, however, does not match the device and therefore will "merge" individual `packetLoss` alerts back into one nagios alert. If this is not what you want, you'll need to build Nagios configuration to make it call the plugin for each device separately. See examples above that include parameter `device` in Nagios plugin call.