2.11. Distributed Monitoring and Discovery Strategies

NetSpyGlass can be configured to run in a distributed configuration (as a cluster) where the work is divided between individual NetSpyGlass processes to improve performance and scalability. We recommend running NetSpyGlass in cluster configuration when number of monitored devices exceeds 1000 and number of monitoring variables exceeds few hundreds of thousands. See Cluster configuration to learn how to configure NetSpyGlass cluster. This section discusses various cluster deployment strategies.

2.11.1. SNMP Polling

Only NetSpyGlass servers running with role monitor perform SNMP polling of the devices. On the other hand, servers running with role monitor do not communicate with databases, only the primary server does that. Because of this design, we recommend running monitors close to devices, while the primary server should be running close to users and databases. In networking terms “close” means lower latency.

If you have multiple remote data centers, it makes sense to run separate monitor servers in each of them. These monitors would still connect to the same central server to pass collected data and receive configuration updates.

It is safe to “move monitors around”, trying different configuration. When monitor moves from one machine to another, it obtains its configuration from the primary server. There is very little state on the monitor side, all you are going to lose is a couple of polling cycles worth of data while the monitor restarts. This means you can try different configurations, moving the monitor from place to place, and all monitoring data collected in the process will remain consistent and accessible through the UI and JSON API.

2.11.2. Configuration

Cluster configuration is defined in the file cluster.conf. There has to be only one copy of this file located in the home directory of the primary NetSpyGlass server. See Cluster configuration for more details.

You do not need to restart any server or monitor when you change cluster configuration, except when you delete cluster member or add new one.

2.11.3. Device Allocation

Devices are allocated to regions, rather than actual NetSpyGlass servers. Each server joins a region according to its -DREGION cli parameter and then primary server allocates devices to servers with role monitor in each region. There can be multiple servers with role monitor in the same region, in this case devices allocated to the region are evenly divided between servers. If servers in the region are running busy (for example they are running short on memory and machines they run on do not allow you to add more), then you can just start NetSpyGlass on a new machine with the same -DREGION parameter and role monitor to expand capacity in the region. There is no need to restart any other server or change anything in the configuration of the primary server.

Allocation of devices to regions is governed by the configuration file cluster.conf, see Cluster Configuration.

2.11.4. Zookeeper

NetSpyGlass cluster uses Zookeeper for service discovery and coordination. There are two options:

  • you can make the primary NetSpyGlass server start embedded zookeeper server if you provide command line parameter -DZK=embedded
  • if you already run zookeeper as part of your infrastructure, you can make NetSpyGlass servers connect to it using parameter -DZK. In this case, the value of the parameter should be “zookeeper connect string”, which is a comma separated list of ip addresses or names of the servers in your zookeeper cluster.

Important

If you choose to run zookeeper as embedded server in the primary NetSpyGlass server, make sure parameter -DZK of all other NetSpyGlass servers points to the primary. If you use external zookeeper, all NetSpyGlass servers in the cluster must use the same zookeeper cluster.

2.11.5. Single server, single monitor

This is the default setup “out of the box” that works well for evaluations or small installations. The server has name PrimaryServer and combined roles primary,monitor. In this configuration the server uses default built-in region name and cluster configuration so the file cluster.conf is not needed. All devices are allocated to the same server since there is only one.

2.11.6. Single server, multiple monitor servers in one location

In this setup NetSpyGlass cluster runs in one location (i.e. one data center). There is one primary server and several monitoring servers that perform SNMP polling and standard calculations on the collected data. all servers run in the same region, the name of the region does not matter. Primary server receives collected data from monitors and stores it in the TSDB. It does not really matter which monitor out of the pool every device is assigned to because network latency between all monitors and all devices is approximately the same. All monitors push collected data directly to the primary server.

Configuration file cluster.conf is trivial:

cluster {
    regions = [
        {
            name = world
            allocation = [ "0.0.0.0/0" ]
        }
    ]
}

Each monitoring server should have the following parameters in its command line (defined in the file /etc/default/netspyglass):

# Server name must be unique
NAME=$(uname -n)

# supported roles are: "primary", "secondary", "monitor". Single-server configuration
# runs with combined role "primary,monitor"
#
ROLE="monitor"

# regions are defined in the file cluster.conf; primary server allocates devices
# to regions according to the parameter "allocations" in the file cluster.conf.
# In single server configuration there is only one default region "world" and file
# cluster.conf is unnecessary
#
REGION="world"

# monitor should know where to find zookeeper to connect to. This assumes
# the primary runs embedded zookeeper server
ZK=${PRIMARY_SERVER_IP_ADDRESS}:2181

#----------------------------------------------------------
#   Server command line parameters
JVM_CLI="-XX:+UseG1GC -XX:MaxPermSize=256m -XX:G1HeapRegionSize=32M -XX:+ParallelRefProcEnabled"
SERVER_CLI="$JVM_CLI -DZK=$ZK -DNAME=$NAME -DROLE=$ROLE -DREGION=$REGION"
SERVER_CLI="$SERVER_CLI -DCONFIG=${HOME}/nw2.conf -DLOG_DIR=${HOME}/logs"

We use machine name for the NetSpyGlass monitor server name because it is likely to be unique. You can use any other unique name you want. Note that region name matches that in the file cluster.conf and the role for the monitoring server is monitor. Each monitor server connects to the zookeeper server used to coordinate the work of the cluster. Assuming we use embedded zookeeper in the primary server, the value of the shell variable ZK should be its address and port “2181”.

Primary server has these parameters set as follows:

# Server name must be unique
NAME=$(uname -n)

# supported roles are: "primary", "secondary", "monitor". Single-server configuration
# runs with combined role "primary,monitor"
#
ROLE="primary"

# regions are defined in the file cluster.conf; primary server allocates devices
# to regions according to the parameter "allocations" in the file cluster.conf.
# In single server configuration there is only one default region "world" and file
# cluster.conf is unnecessary
#
REGION="world"

#----------------------------------------------------------
#   Server command line parameters
JVM_CLI="-XX:+UseG1GC -XX:MaxPermSize=256m -XX:G1HeapRegionSize=32M -XX:+ParallelRefProcEnabled"
SERVER_CLI="$JVM_CLI -DZK=embedded -DNAME=$NAME -DROLE=$ROLE -DREGION=$REGION"
SERVER_CLI="$SERVER_CLI -DCONFIG=${HOME}/nw2.conf -DLOG_DIR=${HOME}/logs"

As you can see, only the role and value of the cli parameter -DZK are different. Parameter -DZK=embedded tells the primary server that it should start embedded zookeeper server.

2.11.7. Multiple servers and monitors in one region

In this setup we run a primary server, two monitors and a dedicated “compute” server in the same region.

Since we still have one region, file cluster.conf looks the same as in the previous example:

cluster {
    regions = [
        {
            name = world
            allocation = [ "0.0.0.0/0" ]
        }
    ]
}

Monitoring servers perform basic calculations using default Python rule set and upload result to the compute and the primary servers. In the NetSpyGlass cluster, servers find monitoring variables and other servers that collect and prepare them automatically.

The difference is in the command line of the compute server. Monitor servers have the same command line as in the previous example:

# Server name must be unique
NAME=$(uname -n)

# supported roles are: "primary", "secondary", "monitor". Single-server configuration
# runs with combined role "primary,monitor"
#
ROLE="monitor"

# regions are defined in the file cluster.conf; primary server allocates devices
# to regions according to the parameter "allocations" in the file cluster.conf.
# In single server configuration there is only one default region "world" and file
# cluster.conf is unnecessary
#
REGION="world"

# monitor should know where to find zookeeper to connect to. This assumes
# the primary runs embedded zookeeper server
ZK=${PRIMARY_SERVER_IP_ADDRESS}:2181

#----------------------------------------------------------
#   Server command line parameters
JVM_CLI="-XX:+UseG1GC -XX:MaxPermSize=256m -XX:G1HeapRegionSize=32M -XX:+ParallelRefProcEnabled"
SERVER_CLI="$JVM_CLI -DZK=$ZK -DNAME=$NAME -DROLE=$ROLE -DREGION=$REGION"
SERVER_CLI="$SERVER_CLI -DCONFIG=${HOME}/nw2.conf -DLOG_DIR=${HOME}/logs"

compute server has role secondary but otherwise its command line is the same

# Server name must be unique
NAME=$(uname -n)

# supported roles are: "primary", "secondary", "monitor". Single-server configuration
# runs with combined role "primary,monitor"
#
ROLE="secondary"

# regions are defined in the file cluster.conf; primary server allocates devices
# to regions according to the parameter "allocations" in the file cluster.conf.
# In single server configuration there is only one default region "world" and file
# cluster.conf is unnecessary
#
REGION="world"

# monitor should know where to find zookeeper to connect to. This assumes
# the primary runs embedded zookeeper server
ZK=${PRIMARY_SERVER_IP_ADDRESS}:2181

#----------------------------------------------------------
#   Server command line parameters
JVM_CLI="-XX:+UseG1GC -XX:MaxPermSize=256m -XX:G1HeapRegionSize=32M -XX:+ParallelRefProcEnabled"
SERVER_CLI="$JVM_CLI -DZK=$ZK -DNAME=$NAME -DROLE=$ROLE -DREGION=$REGION"
SERVER_CLI="$SERVER_CLI -DCONFIG=${HOME}/nw2.conf -DLOG_DIR=${HOME}/logs"

The main difference in this setup is in the Python rule script that compute server is using. Monitors perform basic calculations, such as process counter values they collect from devices via SNMP to compute rate variables or normalize variables for things like power supply and fan state, memory and cpu utilization and so on. Results of these calculations are assigned to new monitoring variables which are then pushed to the primary server to make them available for the UI and send values to TSDB. Unlike monitor servers, which work with a subset of devices, compute server receives data from all monitors and therefore has “full picture of the world”. This means it can compute aggregate variables using information obtained from all devices.

Another use for this setup is to make compute server generate alerts, that is, be a dedicated alerts server. Depending on the condition function, alerts can be expensive and take a lot of cpu cycles. If your NetSpyGlass cluster runs with a fast polling interval (e.g. 30sec) it may be beneficial to more alerts generation to a dedicated server, spearate from the primary server. Just as with calculation of aggregate values, the difference is in the Python rule script used with the server.

It is also possible to run both dedicated compute and dedicated alerts severs in addition to the primary and some number of monitor servers. Compute, alerts and primary servers will find monitoring variables and arrange data transfer from monitors automatically.

Rule processing script used by the compute server is configured in its nw2.conf configuration file (see Data Processing Rules). You can find more information about alerts and alerting rules in Alerting.

2.11.8. Multiple regions

In this scenario we run single primary server and have multiple remote monitor servers that send data to it. These monitor servers are grouped in different regions. Each region may correspond to a data center or metro area. The idea is to have one or more monitors per region to keep SNMP polling as local as possible. In other words, we poll devices from the server that is located in the same data center or metro area and therefore has lowest network latency to the devices it is responsible for. However, even if there is just one monitor server per region, other monitor servers will take over the load if it goes down, although polling may become suboptimal because of added latency.

Suppose we have three locations “SJC”, “IAD” and “DFW” and want to run monitors in each. File cluster.conf looks like this:

# this will probably list many more subnets
SJC_SUBNETS = [ "10.101.11.0/24", "10.101.6.0/24", ]
IAD_SUBNETS = [ "10.102.11.0/24", "10.102.12.0/24", ]
DFW_SUBNETS = [ "10.103.11.0/24", "10.103.13.0/24", ]

cluster {

    regions = [
        {
            name = sjc
            allocation = ${SJC_SUBNETS}
        }

        {
            name = iad
            allocation = ${IAD_SUBNETS}
        }

        {
            name = dfw
            allocation = ${DFW_SUBNETS}
        }
    ]
}

To make monitors “join” their respective regions, supply command line parameter -DREGION. For example, shell variables used in the file /etc/default/netspyglass look like this for the monitors in region “sjc”:

NAME=$(uname -n)
ROLE="monitor"
REGION="sjc"

If you want to run two or more monitors per region, they should have the same value of the variable REGION but value of variable NAME must be unique.