2.10. Cluster configuration

Note

Ability to run NetSpyGlass in cluster configuration is currently in beta

NetSpyGlass can be configured to run in a cluster configuration. The architecture is three-tier: “primary - secondary - monitor”. NetSpyGlass instance can run in one of these three roles. We refer to NetSpyGlass instance as a “cluster member” when it runs as part of the cluster regardless of it role.

Primary server loads configuration from the config file and “owns” it. It creates and stores device objects in the database, manages views and maps. Basically, primary server is identical for and mostly backwards compatible by configuration and data formats with the NetSpyGlass server running in standalone mode. When system works as a cluster, primary server assigns devices to monitors using parameter allocation from configuration file cluster.conf. The value of this parameter is a list of subnet definitions in CIDR notation. Each device is assigned to a monitor if its address belongs to one of the subnets in the allocation parameter of that monitor. More on this below.

Primary server runs JSON API and users connect to it with their browsers. Secondary servers and monitors also answer JSON API queries but have limited view of the world and some parts of the UI may not work.

There must be one primary but can be multiple monitors and secondary servers.

Secondary servers and monitors do not require SQL database to store information about devices and normally do not communicate with TSDB but can be configured to upload data to TSDB if needed.

Collected monitoring data flows from monitors to secondary servers, which can also send it between them. Primary server collects monitoring data from other cluster members and caches it, making it available for the UI. All cluster members (monitors, secondary and primary servers) can perform calculations with collected data if configured with Python rules script, but monitors operate with data collected from limited set of devices. It is possible to use different Python rules script with different cluster members.

Transfer of the collected monitoring data from one cluster member to others is done by means of data push. Cluster member that creates monitoring variables pushes them to other members once it completes its monitoring cycle. Cluster members use “subscription” mechanism to tell other cluster members what variables they need. This “subscription” mechanism is dynamic, it is designed to allow cluster members to find monitoring variables without any prior manual configuration.

All cluster members share their view of the full list of cluster members. Synchronization is achieved via zookeeper.

2.10.1. Command line parameters

2.10.1.1. Zookeeper

We use command line parameter -DZK to pass Zookeeper connection string to NetSpyGlass servers and monitors.

2.10.1.1.1. Primary Server

Examples:

-DZKembedded
-DZK10.0.0.1:2181
-DZKzk1:2181,zk2:2181,zk3:2181

2.10.1.1.2. Primary Server with embedded zookeeper

If the value of this parameter is embedded, the primary server launches in-process ZooKeeper server that other culster members will use. If you already run ZooKeeper cluster as part of your infrastructure, you can use it by setting the value of this parameter to the corresponding connection string.

If you run embedded zookeeper (use cli argument -DZK=embedded), zookeeper server binds to the address it takes from the cli argument -DzkAddress and port taken from -DzkPort. If argument -DzkAddress is not provided, the address is taken from the parameter ui.url in the configuration file nw2.conf, that is, it is the same as the one used for the UI url. The port defaults to 2181 is argument -DzkPort is not provided. The value of the argument -DZK of other NetSpyGlass cluster members should match configuration of the embedded zookeeper server in the primary.

2.10.1.1.3. Secondary Servers and Monitors

Secondary servers and monitors can not use -DZKembedded and should be provided with the connection string using actual address.

2.10.1.2. Namespace

zookeeper znodes created and monitored by NetSpyGlass cluster members are located in the namespace “netspyglass” by default. This means znode path for all znodes begins with “/netspyglass”. Optional command line parameter -DZKNAMESPACE can be used to change this. This way, you can run multiple NetSpyGlass clusters using the same infrastructure zookeeper ensemble. Example:

-DZKNAMESPACE=netspyglass2 -DZK=zk1:2181,zk2:2181,zk3:2181

2.10.1.3. Name, Role and Region

Each cluster member must have three properties that define its role in the cluster and control device allocation to it. These properties are its name, role and region it belongs to.

The name is any word (no white space is allowed). The name is defined using command line argument -DNAME. By default the name of the primary server or the one running in single server configuration is PrimaryServer. For secondary servers and monitors names could be any word as long as it is unique. For example:

-DNAME=SJC1
-DNAME=SJC1-monitor
-DNAME=$(uname -n)

The last example shows how machine name can be used as cluster member name. Using shell command substitution ($(uname -n)) is possible because command line of the NetSpyGlass process is defined in the file /etc/default/netspyglass which is interpreted by startup shell script.

Note

Cluster member name must be unique

Each member also has a role, which can be “primary”, “secondary” or “monitor”. Information about the role is also passed via command line argument -DROLE. Parameter -DROLE is mandatory when NetSpyGlass runs in cluster configuration and is optional when it runs in a single-server configuration.

Each cluster member also belongs to a region. Region name appears in two places in the configuration: in the file cluster.conf and in the command line of each cluster member. Region name can be any word (no spaces are allowed) and must be unique throughout the cluster. For example:

-DREGION=sjc

See below for the examples of how regions are configured in the cluster.conf configuration file.

Note

Primary server uses information found in the file cluster.conf to match devices to regions and then distributes devices evenly to all cluster members within each region. If workload becomes too heavy for the number of devices in the region, you can just add new server to the region to spread the work. There is no need to change anything in the central configuration files or restart other cluster members when you add new one.

Cli arguments are configured in the file /etc/default/netspyglass (or launch.conf file, if you installed the system using distribution tar archive).

Typical entry for the primary server with name PrimaryServer, role primary and region world in /etc/default/netspyglass looks like this:

# Server name must be unique
NAME="PrimaryServer"

# supported roles are: "primary", "secondary", "monitor". Single-server configuration
# runs with combined role "primary,monitor"
#
ROLE="primary"

# regions are defined in the file cluster.conf; primary server allocates devices
# to regions according to the parameter "allocations" in the file cluster.conf.
# In single server configuration there is only one default region "world" and file
# cluster.conf is unnecessary
#
REGION="world"

#----------------------------------------------------------
#   Server command line parameters
JVM_CLI="-XX:+UseG1GC -XX:MaxPermSize=256m -XX:G1HeapRegionSize=32M -XX:+ParallelRefProcEnabled"
SERVER_CLI="$JVM_CLI -DZK=embedded -DNAME=$NAME -DROLE=$ROLE -DREGION=$REGION"
SERVER_CLI="$SERVER_CLI -DCONFIG=${HOME}/nw2.conf -DLOG_DIR=${HOME}/logs"

For the secondary server that is a member of region SJC it could look like this:

# Server name must be unique
NAME="sjc-secondary"

# supported roles are: "primary", "secondary", "monitor". Single-server configuration
# runs with combined role "primary,monitor"
#
ROLE="secondary"

# regions are defined in the file cluster.conf; primary server allocates devices
# to regions according to the parameter "allocations" in the file cluster.conf.
# In single server configuration there is only one default region "world" and file
# cluster.conf is unnecessary
#
REGION="SJC"

ADDRESS_OF_PRIMARY=$HOST_NAME_OR_ADDRESS_OF_PRIMARY_SERVER

#----------------------------------------------------------
#   Server command line parameters
JVM_CLI="-XX:+UseG1GC -XX:MaxPermSize=256m -XX:G1HeapRegionSize=32M -XX:+ParallelRefProcEnabled"
SERVER_CLI="$JVM_CLI -DZK=$ADDRESS_OF_PRIMARY:2181 -DNAME=$NAME -DROLE=$ROLE -DREGION=$REGION"
SERVER_CLI="$SERVER_CLI -DCONFIG=${HOME}/nw2.conf -DLOG_DIR=${HOME}/logs"

and for the monitor in region SJC:

# Server name must be unique
NAME="sjc-mon"

# supported roles are: "primary", "secondary", "monitor". Single-server configuration
# runs with combined role "primary,monitor"
#
ROLE="monitor"

# regions are defined in the file cluster.conf; primary server allocates devices
# to regions according to the parameter "allocations" in the file cluster.conf.
# In single server configuration there is only one default region "world" and file
# cluster.conf is unnecessary
#
REGION="SJC"

ADDRESS_OF_PRIMARY=$HOST_NAME_OR_ADDRESS_OF_PRIMARY_SERVER

#----------------------------------------------------------
#   Server command line parameters
JVM_CLI="-XX:+UseG1GC -XX:MaxPermSize=256m -XX:G1HeapRegionSize=32M -XX:+ParallelRefProcEnabled"
SERVER_CLI="$JVM_CLI -DZK=$ADDRESS_OF_PRIMARY:2181 -DNAME=$NAME -DROLE=$ROLE -DREGION=$REGION"
SERVER_CLI="$SERVER_CLI -DCONFIG=${HOME}/nw2.conf -DLOG_DIR=${HOME}/logs"

Variables NAME, ROLE and REGION change in these configuration examples to reflect server’s name, role and region. Note also that parameter -DZK in secondary server and monitor points to the primary server’s address and port 2181. If you use infrastructure Zookeeper, then parameter -DZK should point to it in the command line configuration of all servers.

Cluster member names and roles do not need to appear in any other configuration file in NetSpyGlass cluster because members discover each other automatically. This means you dont need to make any changes in the configuration when you add new cluster member - just configure its name, role and region in its own /etc/default/netspyglass file and launch the process. NetSpyGlass primary server will notice new cluster member and reallocate devices across members in the same region to rebalance workload. This happens completely automatically.

2.10.2. Cluster Configuration

2.10.2.1. Configuration file cluster.conf

Configuration file cluster.conf is required only by the primary server. This configuration file describes regions and defines rules for device allocation to regions.

Here is how file cluster.conf looks like:

REGION_SJC = [ "10.1.1.0/24", "10.23.1.0/24", ]
REGION_IAD = [ "10.101.11.0/24", "10.101.5.0/24", ]

cluster {
    regions = [
        {
            name = sjc
            allocation = ${REGION_SJC}
        }

        {
            name = iad
            allocation = ${REGION_IAD}
        }

    ]
}

Parameters:

  • list regions defines cluster regions. Each element in this list is a dictionary with keys name and allocation. Parameter -DREGION used with CLI of the cluster members must match region name defined here.
  • allocation - this is a list of subnet addresses in CIDR location. Network devices are allocated to regions according to this list.

2.10.2.2. Configuration file nw2.conf

As mentioned above, cluster members automatically find monitoring variables and use “subscription” mechanism to make servers that produce these variable push them. In addition to this, there is also a filter that can be used to limit set of variables any given cluster member server is allowed to subscribe to. This filter is configured in the file nw2.conf using parameter subscription.filter. The value is a list of variable names this server is allowed to subscribe to. Default value is the same as the list of variables that can appear in Graphing Workbench.

Note

You can make a server subscribe all variables from all other servers in the cluster by assigning value ["*"] to the parameter subscription.filter.

This filter can be used to further shard workload across several secondary servers. For example, one can configure dedicated secondary server to process QoS related data. To achive this, add the following to its file nw2.conf:

subscription.filter = [
    tailDropsRateQueue0,
    tailDropsRateQueue1,
    tailDropsRateQueue2,
    tailDropsRateQueue3,
    tailDropsRateQueue4,
    tailDropsRateQueue5,
    tailDropsRateQueue6,
    tailDropsRateQueue7,

    redDropsRateQueue0,
    redDropsRateQueue1,
    redDropsRateQueue2,
    redDropsRateQueue3,
    redDropsRateQueue4,
    redDropsRateQueue5,
    redDropsRateQueue6,
    redDropsRateQueue7,

    totalDropsRateQueue0,
    totalDropsRateQueue1,
    totalDropsRateQueue2,
    totalDropsRateQueue3,
    totalDropsRateQueue4,
    totalDropsRateQueue5,
    totalDropsRateQueue6,
    totalDropsRateQueue7,

    txedPacketRateQueue0,
    txedPacketRateQueue1,
    txedPacketRateQueue2,
    txedPacketRateQueue3,
    txedPacketRateQueue4,
    txedPacketRateQueue5,
    txedPacketRateQueue6,
    txedPacketRateQueue7,

    txedBitRateQueue0,
    txedBitRateQueue1,
    txedBitRateQueue2,
    txedBitRateQueue3,
    txedBitRateQueue4,
    txedBitRateQueue5,
    txedBitRateQueue6,
    txedBitRateQueue7,

]

Server confiugred with a filter like this will not subscribe to any variables except those listed here. This can significantly reduce amount of data it holds in its memory buffers and has to process with its Python rules script, which can be useful when there are millions of QoS-related variables. Another secondary server can be configured with the list that includes other variables, except these. That server will perform calculations with other variables except QoS-related ones.

2.10.3. Known limitations as of time of this release

  • Secondary server can not generate reports at this time; this should be done in the primary server.

2.10.4. Examples of cluster configurations

2.10.4.1. Single server configuration (the default)

Default NetSpyGlass configuration when it runs with just one process is to run the server with roles primary,monitor. In this configuration the same process performs SNMP polling, data collection and processing, and serves UI. This is the default “out of the box” configuration, it does not require file cluster.conf at all and appropriate CLI parameters come as part of the default /etc/default/netspyglass file.

2.10.4.2. Primary server and several monitors

This setup can be recommended when you need NetSpyGlass to monitor over 2000 devices and total number of monitoring variables reaches several hundreds of thousands. Sharding the workload across multiple monitoris allows us to perform basic calculations there. These calculations include computing of rate of change of variables as well as normalization and other standard calculations performed by the default Python rules script. Since each monitor server in this setup runs with only subset of devices, it should not try to calculate any aggregate variables. This should be done in the primary server that has full set of variables from all devices.

In this configuration the primary server still does two very different functions:

  • it controls network discovery, creates and manages device objects, builds and serves network maps and supports the UI.
  • it performs aggregate calculations with monitoring data

These two functions have very different requirements to the CPU and memory on the primary server and as number of devices and variables grows, we are going to have to separate them eventually. But at some point primary server can do both kinds of work, especially if it runs on a powerful server with many CPU cores and lots of RAM.

Here is how cluster.conf file is going to look like:

REGION_SJC = [ "10.1.0.0/24", "10.2.0.0/24", ]
REGION_IAD = [ "10.101.0.0/24", "10.102.0.0/24", ]
REGION_ORD = [ "10.10.0.0/24", "10.11.0.0/24", ]

cluster {
    regions = [
        {
            name = sjc
            allocation = ${REGION_SJC}
        }

        {
            name = iad
            allocation = ${REGION_IAD}
        },

        {
            name = ord
            allocation = ${REGION_ORD}
        }
    ]
}

Primary server uses default settings for its name, role and region (these are defined in its /etc/default/netspyglass file):

NAME="PrimaryServer"
ROLE="primary"
REGION="world"

Monitors should use unique name and declare the region they want to join:

NAME=$(uname -n)
ROLE="monitor"
REGION="sjc"

this configuration uses machine name for the cluster member name to reduce number of configuration parameters that need to be managed. If you want to add another monitor to the same region, just launch new NetSpyGlass process on a different machine with exactly the same file /etc/default/netspyglass. In other words, you need to manage different copy of this file for each region regardless of how many server you run inside of the region.

2.10.4.3. Dedicated compute server

As our cluster grows, the primary server becomes busy. At some point it becomes too busy to perform calculations within allocated monitoring cycle and do other tasks it does at the same time. Time it spends in JVN garbage collection grows and begins to interfere with its near-real-time functions, such as calculations on the monitoring data. To improve its performance we can split its functions and move calculations to a dedicated “compute” server. Here are the steos required to do this:

  • launch new server with role secondary and region name that is not one of the regions defined in the cluster.conf file (so no devices are going to be directly allocated to it). In this example I use region name world for this:

    NAME=$(uname -n)
    ROLE="secondary"
    REGION="world"
    
  • move your current Python rules script that performs aggregate calculations from the primary to this new compute server and disable it in the primary. To disable it, just comment out parameter network.monitor.rules in its nw2.conf file.

This is it. Once compute server starts up and joins the cluster, the primary will find it and will subscribe to the variables compute server creates. Data will begin to flow from monitors to the compute server and then to the primary.

2.10.4.4. Dedicated compute and alerts servers

This is really just a variation of the previous configuration. Here, you launch two secondary servers in the region world. Both servers have role secondary and unique names, but otherwise their /etc/default/netspyglass file is the same as others. The main difference should be in their Python rules scripts. Compute server should calculate aggregate variables and alerts server should calculate alerts. Subscription mechanism will be used to find variables and they will arrange data push to each other automatically.