1.18. Release Notes 1.5.1

NetSpyGlass v1.5.1

1.18.1. Improvements and New Features

  • this release improves the time it takes for the collected monitoring data to propagate through cluster members in NetSpyGlass configured to work as a cluster.

    In versions prior to 1.5.1, all monitors started data collection, and all servers collected data from monitors and secondary servers at the same time. For example, in systems was running with 1 min polling interval, the sequence started at the beginning of each minute. For example, suppose monitors started their cycle at the time 00:00:00 and completed it few seconds before 00:01:00 (monitor spreads polling over almost full polling interval to reduce load on the device). Secondary servers collected data at the beginning of the next cycle, at 00:01:00, and then ran Python hook script to perform calculations on it. Calculations usually take a few seconds, after which results are available to be collected by the primary server, however this collection did not happen until the beginning of the next cycle, at 02:00:00. Primary server requires some time to perform its own calculations too, which means data became available for the UI and alerts few seconds past 02:00:00, or with a delay of over 2 min after the start of the cycle. If primary server fed data to a dedicated alerts server, that added delay of one more polling cycle, making alerts over 3 min late. In clusters running at 30 sec polling interval total delay from a monitor to primary sever was over 1min. This delay was most pronounced in systems running with longer polling interval. For example in a system built with just one monitor and one server and running with 5 min polling interval data was available to the UI and monitors 5 min late. In the three-tier cluster built from monitors, secondary servers and the primary server data was available over 10 min late.

    This release improves data flow speed by making serves perform calculations immediately after monitors or lower level servers that feed data to them signal the end of the data push. If the cluster is running with 1 min polling inerval and monitors spread SNMP polling over 50 sec, secondary servers start their calculations as soon as monitors finish uploading data, or approximately on the 51-st second of the cycle. Secondary servers push data to the primary immediately after completion of their calculations and the primary, in turn, begins its calculations as soon as secondary servers complete their push. With this improvement, data becomes available in the primary server after 1 minute and a few seconds after beginning of the cycle, compared to the delay of over 2 minutes in the old versions. In clusters running with 5 min polling interval, data becomes available much faster than before.

  • New configuration parameter monitor.snmpPollingSpreadOutTimeSec controls amount of time the monitor uses to spread out SNMP queries it sends to devices. Queries are spread out instead of being sent as fast as possible to avoid overloading device CPU. Default value of this parameter is 50 sec, which is chosen in coordination with default polling interval of 1 min. Actual time used to spread queries out is lesser of the value of monitor.snmpPollingSpreadOutTimeSec and monitor.pollingIntervalSec minus 10 sec.

    Note

    Parameter monitor.snmpPollingSpreadOutTimeSec can have value 0. This makes monitor send queries to the device as fast as possible. Negative values are not allowed

  • NSGDB-30: added monitoring variables pythonErrors (a counter of exceptions raised by all python hook scripts) and pythonErrorsRate, calculated as rate(pythonErrors) to track errors that occur in Python hook scripts. This includes exceptions raised when we load the script and when we execute it. Default configuration displays variable pythonErrorsRate in the Graphing Workbench but hides pythonErrors.

    Note

    Exceptions that happen when the server tries to load updated script look like transient spikes in the graph of pythonErrorsRate because the server continues to use old version of the script if the new one fails to load. The server registers one error but since it goes back to using old script, this error does not repeat. This is the reason pythonErrorsRate has non-zero value only during one polling cycle and then goes back to zero after that.

  • NSGDB-67: Added support for discovery and mnitoring of Dell Z9100 switch.

    Note

    Known bug in the switch software causes it to return bogus values when we walk SNMP table dot1qTpFdbPort; these numbers are supposed to be port numbers matching values from SNMP table dot1dBasePortIfIndex but they do not match. This breaks network topology discovery using switch CAM tables. Network topology can be discovered using other means, such as LLDP if device runs it.

  • NET-1176: Improved discovery of interface vlan membership on Cisco Nexus 3000 devices

  • default snmp timeout changed to 1 sec and default number of retries to 2. This change affects only new installations because these values come from the configuration file nw2.conf and existing installations already have this file with some values.

  • multiple improvements in the code that handles Zookeeper disconnects

  • improvements in the code that handles RPC client timeouts and disconnects

  • support for third party RPC clients and TrafficPredictor has been removed

  • NET-1212: additional logging in the tag updater

  • the size of RPC client pool is now configurable, configuration parameter rpc.connectionsPoolSize

  • NET-1210: use single snmp queries to discover System table; add logging to reveal oids and responses for debugging; try 2 times if queries time out

  • Monitor sets receive buffer size to a large value (26214400) in the socket used to send and receive SNMP queries. Actual receive buffer size depends on the maximum value configured in the kernel.

1.18.2. Bug fixes

  • NET-1215: the server should retry RPC call used to start device discovery on monitor side when this call fails because of timeout or because rpc client could not be found. The monitor should retry RPC call used to report completion of the device discovery in case of errors, such as timeout or when RPC client factor could not create new client.
  • NSGDB-69: when cluster configuration file cluster.conf uses include statement to include contents of another .conf file, the server should watch included file for changes and reload cluster configuration when change is detected. Prior to this release, included file would not be monitored for changes.