2.11. High Availability¶

NetSpyGlass can be configured to run in High Availability (HA) configuration where each role in the cluster (Cluster configuration) can be performed by a pair (or more) of servers rather than one machine. For example, in HA configuration there can be two primary servers, maybe two alert servers, two compute servers and so on. They form HA “groups” and there can be any number of servers in each group, however only one of them is going to be the “leader” at any given time. The leader performs all the functions of the NetSpyGlass server with corresponding role, while other servers are also up and running but some of their functions are suspended. These servers still get all monitoring data just like the leader, but they do not run Python scripts and do not push data to TSDB. However, since they get monitoring data, the UI still works, they serve maps and graphs and respond to API queries which opens interesting possibilities for load balancing.

2.11.1. Configuration¶

Command line parameter -DHA_ID=word turns on HA mode for the NetSpyGlass process and places it in the HA group with id word. This command line parameter should be added to the file /etc/default/netspyglass

When servers that belong to an HA group start up, they automatically elect the “leader” which becomes active, completes startup sequence and begins to serve. If servers in HA group have role “primary”, the leader of this group coordinates the work of the whole cluster by signalling the start of the polling cycle. This server also runs network discovery and generates scheduled reports. Other servers in an HA group with role “primary” also connect to SQL and TSDB databases but do not update objects in either one, they can only read them.

Non-leader servers in an HA group go into standby mode where they maintain their connection to Zookeeper so that they can immediately take over should current leader go down. Since they participate in the data push and run most of the internal components, they take over very quickly when zookeeper signals that current leader went down.

Important

NetSpyGlass servers running in HA group must use external zookeeper cluster. HA configuration will not work if you try to run servers with embedded zookeeeper.

Which server in an HA group becomes active is undefined. Most often it is the one that was started first.

Note

Servers in the HA group must be configured to perform the same role in NetSpyGlass cluster, that is, they must have the same region, role, configuration files and scripts. However, their names must be different because server names in the cluster must be unique. That is, the value of the -DNAME command line parameter must be unique for all server in NetSpyGlass cluster.

For example, here are the contents of the /etc/default/netspyglass files for two primary NetSpyGlass servers in HA configuration:

# Server name must be unique
NAME="primary-1"

# supported roles are: "primary", "secondary", "monitor". Single-server configuration
# runs with combined role "primary,monitor"
#
ROLE="primary"

# regions are defined in the file cluster.conf; primary server allocates devices
# to regions according to the parameter "allocations" in the file cluster.conf.
# In single server configuration there is only one default region "world" and file
# cluster.conf is unnecessary
#
REGION="world"

# HA group id. This must be identical on all servers in the same HA group.
HA_ID="nsg-development-primary-ha"

#----------------------------------------------------------
#   Server command line parameters
JVM_CLI="-XX:+UseG1GC -XX:MaxPermSize=256m -XX:G1HeapRegionSize=32M -XX:+ParallelRefProcEnabled"
SERVER_CLI="$JVM_CLI -DZK=embedded -DNAME=$NAME -DROLE=$ROLE -DREGION=$REGION -DHA_ID=$HA_ID"
SERVER_CLI="$SERVER_CLI -DCONFIG=${HOME}/nw2.conf -DLOG_DIR=${HOME}/logs"

And the other one:

# Server name must be unique
NAME="primary-2"

# supported roles are: "primary", "secondary", "monitor". Single-server configuration
# runs with combined role "primary,monitor"
#
ROLE="primary"

# regions are defined in the file cluster.conf; primary server allocates devices
# to regions according to the parameter "allocations" in the file cluster.conf.
# In single server configuration there is only one default region "world" and file
# cluster.conf is unnecessary
#
REGION="world"

# HA group id. This must be identical on all servers in the same HA group.
HA_ID="nsg-development-primary-ha"

#----------------------------------------------------------
#   Server command line parameters
JVM_CLI="-XX:+UseG1GC -XX:MaxPermSize=256m -XX:G1HeapRegionSize=32M -XX:+ParallelRefProcEnabled"
SERVER_CLI="$JVM_CLI -DZK=embedded -DNAME=$NAME -DROLE=$ROLE -DREGION=$REGION -DHA_ID=$HA_ID"
SERVER_CLI="$SERVER_CLI -DCONFIG=${HOME}/nw2.conf -DLOG_DIR=${HOME}/logs"

All parameters are exactly the same except for NAME.

Not only command line parameters must be the same between these two servers, configuration files and scripts they use should also be identical.

Note

NetSpyGlass does not synchronize configuration files and other resources between servers in HA configuration. You probably need to employ some automated system to deploy and synchronize these files across servers in HA pair.

HA configuration makes sense for the mission critical NetSpyGlass clusters where the cost of running machines that host standby NetSpyGlass servers is justified because of the high cost of the downtime. If the cluster is built with a dedicated compute and alert servers, in addition to the primary server and monitors, then the primary, each compute and alert servers should be configured as separate HA pairs. In a configuration like this, each HA pair should have its own separate HA Id string used with command line parameter -DHA_ID. Make sure HA IDs are not mixed up, that is, the same HA Id should never be used with servers that have different roles.

2.11.2. Health Checks¶

you can use API call GET /v2/ping/net/:netid/ to quickly check if the server is up and running and if it is a leader in an HA pair. There are two versions of this call:

GET /v2/ping/net/:netid/ : this returns HTTP response 200 and “ok” in the response body if the server is up and running and if it is a leader in HA pair. If the server is not a leader, this call still returns HTTP response 200 but the body says “standby”
GET /v2/ping/net/:netid/se/ : this returns HTTP response 200 and “ok” in the response body if the server is up and running and if it is a leader in HA pair. If the server is not a leader, this call returns HTTP response 503 (Service Not Available) and the body says “standby”

Your load balancer can use /v2/ping/net/:netid/ or /v2/ping/net/:netid/se/ to determine if it should send request to the server.