Release Notes 1.5.4 =================== NetSpyGlass v1.5.4 Important --------- This version of NetSpyGlass (1.5.4) can work with both Java7 and Java8, however performance and memory handling are better when running on Java8. Next release of NetSpyGlass (1.6.0) will require Java8. Please upgrade machines you use to run NetSpyGlass. Support for Influxdb 0.8 has been suspended. This version still works with InfluxDb 0.8 but support for it will be removed in the next release of NetSpyGlass. Improvements and New Features ----------------------------- - this release introduces distributed repository of device objects based on Zookeeper. Now secondary servers can create pseudo-devices too, for example when they compute aggregate variables. - Numerous improvements have been made to reduce Java heap usage and reduce garbage collection pauses in large installations (several thousands of devices and up to 1 million of variables in the primary server). See :ref:`performance_tuning` for more details. - the following fields have been removed or modified to improve memory footprint: * DataSource.constraints has been removed * DataSource.oid now has type String (NET-1224) * MonitoringVariable.statistics is now created on demand and not stored permanently with monitoring variable object - Several improvements in the standard python rules script helped significantly reduce time needed to process monitoring data in large NetSpyGlass installations. For example, operation that copies latest value of `ifAlias` variable to the `description` field of other interface-related variable has been implemented in Java instead of Python. Another operation that has moved to Java code is the function that copies tags `ifOperStatus.Up` and `ifOperStatus.Down` form variable `ifOperStatus` to other interface-related variables - graphite connector does not try to maintain persistent connection to the Carbon collector server anymore, instead, it opens connection to upload data and then closes it when done. - configuration parameter `monitor.storage.expireVariablesForOneDeviceAtATime` has been deprecated. - new configuration parameter `monitor.storage.graphite.uploadSpreadTime` can be used to control time interval over which graphite connector spreads data upload. The interval is defined as a fraction of the polling intrval and the value of this parameter is a floating-point number between 0 and 1. - beginning with this version it is possible to run NSG monitors in active-active configuration. When two or more monitors can be configured with identical or overlapping `allocation` lists in the `cluster.conf` configuration file, NetSpyGlass assigns devices that match `allocation` specification evenly between these monitors to spread the work. When one monitor goes offline, devices are automatically reallocated to others with matching `allocation` configuration. Here is an example of `cluster.conf` file using this feature. Monitors `mon1` and `mon2` have identical values for their `allocation` parameter. This means the server is going to distribute devices that fall into subnets `${SUBNETS}` between these two monitors. Each monitor pushes collected variables to its respective secondary server (`leaf1` and `leaf2`) which in turn push to the primary server:: PUSH_VARS = ${graphingWorkbench.variables} SUBNETS = [ "10.1.1.0/24", "10.1.2.0/24", "10.1.3.0/24" ] cluster { members = [ { name = PrimaryServer role = primary }, { name = leaf1 role = secondary, push = [ { server = PrimaryServer, variables = ${PUSH_VARS} } ] }, { name = leaf2 role = secondary, push = [ { server = PrimaryServer, variables = ${PUSH_VARS} } ] }, { name = mon1 role = monitor, allocation = ${SUBNETS}, push = [ { server = leaf1, variables = ["*"] } ] }, { name = mon2 role = monitor, allocation = ${SUBNETS}, push = [ { server = leaf2, variables = ["*"] } ] }, - this version introduces new mechanism for the data push from secondary servers to the primary server, it is based on the subscription. To activate, add parameter ``subscribe`` to the definition of the primary server in the file `cluster.conf`:: cluster { members = [ { name = PrimaryServer role = primary subscribe = ${graphingWorkbench.variables} }, { name = leaf1 role = secondary, }, { name = leaf2 role = secondary, }, The value of parameter `subscribe` is a list of variable names, e.g. ``subscribe = [ ifInRate, ifOutRate ]``. Value shown in the example is a copy of the variables that appear in Graphing Workbench, this is a reasonable default for the primary server. Note how in the example above secondary servers do not have parameter `push` anymore. This parameter is unnecessary because the primary can find variables automatically and ask their "owners" (servers "leaf1" and "leaf2") to push them. Subscription is different from the regular data push configured via parameter `push` in secondary server configuration in that secondary servers are do not to push variables into the primary unless these variables are used by some UI or JSON API queries. In this case the primary starts with no variables and subscribes to some of them whenever UI tries to access them. This helps reduce number of variables in the data pool of the primary server, which is useful when NetSpyGlass cluster works with several millions of monitoring variables. Subscription based push is activated only when parameter `subscribe` is present and its value is not an empty list. If this parameter is not there, primary server relies on static data push configuration in secondary servers. If parameter `subscribe` is missing and at the same time secondary servers are not configured to push to the primary, the primary server is not going to have access to monitoring variables at all and UI will appear broken. You need to restart the server when you add or remove parameter `subscribe`, but changes to its value after it has been added do not require restart. .. note:: At the time of this release (v1.5.4) only primary server can use subscription based push. - NET-1241 monitor compares number returned by OID RFC1213:ifNumber with number of interfaces it actually discovered by walking various tables in RFC1213 MIB. It retries discovery if numbers do not match. This helpds work around corner case failure when device reports number of interfaces via ifNumber but then silently fails without timeout when we walk RFC1213 MIB tables. - add configuration parameter `push.segmentSize`. Its value sets maximum number of monitoring variables that can be pushed in one RPC call from monitor to server and from server to server. Pushing lots of variables in one call requires very large data structures to be created on both sides and makes memory footprint of the server worse. Recommented values are in the range between 10 and 400. Changes to the value of this parameter require server restart. - add configuration parameter `push.threads`. This parameter sets number of threads used to make data push RPC calls in parallel. This parameter can be used in combination with `push.segmentSize` to tune data push to make sure it does not require lots of memory but all data can be transferred within time interval shorter than polling interval. Changes to the value of this parameter require server restart. - size of the batch "put" operation for Influxdb 0.9 is configurable and can be changed using configuration parameter `monitor.storage.influxdb.putBlockSize`. Default value is 1000, changes to this parameter require server restart. Bug fixes --------- - primary server should make all device objects available to to all cluster members rather than only those allocated to them. This includes pseudo-devices created when python hook scripts create aggregate variables. When these pseudo-devices were not pushed to secondary servers, corresponding aggregate variables were not accepted by them. - fixed bug that made self-monitoring variables disappear sometimes - NSGDB-72 fix autoscaling and prefix display for QoS related variables - fixed a bug that caused server to throw ConcurrentModificationException in python code that called `filter_by_tags(import_var('someVariable'), tags)`, then created new device by calling `new_var()` and after this called `aggregate()` that iterated over variables returned by `filter_by_tags()`. Exception was thrown only when this sequence was called for the very first time when call to `new_var()` actually created new device, all subsequent calls worked as expected. The bug was introduced with the new feature that allows running Python code in multiple threads.