.. _syslog_alerts: Syslog alerts ============== In contrast to :ref:`monitoring variables` that are built around time series, which are evenly distributed along the time line, Syslog messages are sporadic because monitored devices send them at any moment. Therefore, Syslog alerts are based on number of similar messages received from a particular monitoring device during certain period of time. Because Syslog messages are irregular events, they do not fit the model of evenly spaced time series data. NetSpyGlass stores syslog messages in ElasticSearch Database but provides consistent NsgQL API for access that can be used to build alerts in exactly the same way as alerts built around time series data. Simple log record match ^^^^^^^^^^^^^^^^^^^^^^^ In this example, we build an alert that counts number of times message matching certain simple pattern appears in the log during specified interval. The alert activates if the number is greater than threshold, which means we have found lots of log records like that in a short amount of time. Consider the following log record that was generated by Ubiquiti wifi access point: .. code-block:: none <30>Apr 20 19:51:11 unifi-2-office 802aa8100916,UAP-AC-LR-6.0.15+13 mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16 radio=wifi0 vap=ath0 sta=00:23:38:10:73:01 satisfaction_now=77 anomalies=tcp_latency We could build an alert to watch this kind of message and notify us when the quality of WiFi connection deteriorates. The simplest case is to just count messages with word `anomalies=tcp_latency` and notify when the number of these inside of 10 min interval goes above a threshold. The log record appears in ElastiSearch as a document with following fields (some fields have been removed for brevity): .. code-block:: json { "logSyslogSeverityCode": 6, "deviceId": 4, "productName": "UAP-AC-LR", "sysDescr": "UAP-AC-LR 6.0.15.13647", "chassisId": "80:2a:a8:10:09:16", "SoftwareRev": [ "6.0.15.13647" ], "logSyslogFacilityName": "daemon", "sysName": "unifi-2-office", "prio": "30", "VlanId": [ "1" ], "Role": [ "Router", "WirelessAP" ], "SerialNumber": [ "80:2a:a8:10:09:16" ], "device": "unifi-2-office", "physicalDevice": true, "boxDescr": "UAP-AC-LR", "logText": "802aa8100916,UAP-AC-LR-6.0.15+13 mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16 radio=wifi0 vap=ath0 sta=00:23:38:10:73:01 satisfaction_now=77 anomalies=tcp_latency", "logTimestamp": "2022-04-21T02:51:11.383Z", "VlanName": [ "default" ], "logSource": "unifi-2-office", "Vendor": [ "Ubiquiti" ], "logSyslogPriority": 30, "logSyslogSeverityName": "info", "lastSuccessfulDiscoveryTime": "2022-04-20T01:04:17.153Z", "logSyslogText": "<30>Apr 20 19:51:11 unifi-2-office 802aa8100916,UAP-AC-LR-6.0.15+13 mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16 radio=wifi0 vap=ath0 sta=00:23:38:10:73:01 satisfaction_now=77 anomalies=tcp_latency", "Model": [ "UAP-AC-LR" ], "time": 1650509471384, "agentUuid": "9a8d2ae6-c042-11ec-914f-7d41e27d963e", } NsgQL query to match this log record and others like it, can look like this .. code-block:: none SELECT time,deviceId,device,logText FROM syslog WHERE MATCH_PHRASE(logText, 'anomalies=tcp_latency') and time between 'now-1h' and now you can try this query using command line tool `nsgql` to get an idea of what it finds A note about use of function `MATCH_PHRASE()` to match log message text. NetSpyGlass stores syslog records in Elastic Search where they are tokenized and indexed. The log record is split and processed as set of words. You can use NsgQL operators `LIKE` and `REGEXP` to match log records, but they are going to operate on words rather than log record as a whole. The log record from our example has many words separated by spaces. An attempt to match this text using regular expression `wireless_agg_stats.log_sta_anomalies.*anomalies=tcp_latency` won't work: .. code-block:: none logText REGEXP 'wireless_agg_stats.log_sta_anomalies.*anomalies=tcp_latency' .. code-block:: none # nsgql "SELECT time,deviceId,device,logText FROM syslog WHERE logText REGEXP 'wireless_agg_stats.log_sta_anomalies.*anomalies=tcp_latency'" -------------+----------+--------+-------- time (local) | deviceId | device | logText -------------+----------+--------+-------- This query does not match anything because REGEXP operates on words rather than entire string. Instead, you need to use function MATCH_PHRASE to match entire string. This query works: .. code-block:: none MATCH_PHRASE(logText, 'wireless_agg_stats.log_sta_anomalies*') .. code-block:: none # nsgql "SELECT time,deviceId,device,logText FROM syslog WHERE MATCH_PHRASE(logText, 'wireless_agg_stats.log_sta_anomalies*') limit 10" ---------------------------+----------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- time (local) | deviceId | device | logText ---------------------------+----------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 2022-04-20 01:49:56.308000 | 11 | unifi-6 | U2HSR,802aa82ca0f6,v4.3.28.11361 : mcad[1112]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:2d:a0:f6 radio=wifi0 vap=ath0 sta=00:23:38:10:73:df satisfaction_now=78 anomalies=tcp_latency 2022-04-20 01:49:59.850000 | 11 | unifi-6 | U2HSR,802aa82ca0f6,v4.3.28.11361 : mcad[1112]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:2d:a0:f6 radio=wifi0 vap=ath0 sta=00:23:38:10:73:df satisfaction_now=78 anomalies=tcp_latency 2022-04-20 01:50:10.169000 | 11 | unifi-6 | U2HSR,802aa82ca0f6,v4.3.28.11361 : mcad[1112]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:2d:a0:f6 radio=wifi0 vap=ath0 sta=00:23:38:10:73:df satisfaction_now=78 anomalies=tcp_latency 2022-04-20 01:51:46.249000 | 11 | unifi-6 | U2HSR,802aa82ca0f6,v4.3.28.11361 : mcad[1112]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:2d:a0:f6 radio=wifi0 vap=ath0 sta=00:23:38:10:73:df satisfaction_now=73 anomalies=tcp_latency 2022-04-20 01:52:54.987000 | 11 | unifi-6 | U2HSR,802aa82ca0f6,v4.3.28.11361 : mcad[1112]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:2d:a0:f6 radio=wifi0 vap=ath0 sta=00:23:38:10:73:df satisfaction_now=79 anomalies=tcp_latency 2022-04-20 01:53:03.488000 | 11 | unifi-6 | U2HSR,802aa82ca0f6,v4.3.28.11361 : mcad[1112]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:2d:a0:f6 radio=wifi0 vap=ath0 sta=00:23:38:10:73:df satisfaction_now=79 anomalies=tcp_latency 2022-04-20 01:53:08.700000 | 11 | unifi-6 | U2HSR,802aa82ca0f6,v4.3.28.11361 : mcad[1112]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:2d:a0:f6 radio=wifi0 vap=ath0 sta=00:23:38:10:73:df satisfaction_now=79 anomalies=tcp_latency 2022-04-20 01:53:37.502000 | 3 | unifi-1 | 44d9e790d8e3,UAP-AC-Lite-6.0.15+ mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=44:d9:e7:92:d8:e3 radio=wifi1 vap=ath2 sta=3e:64:3f:88:1d:c6 satisfaction_now=69 anomalies=tcp_latency 2022-04-20 01:55:53.790000 | 11 | unifi-6 | U2HSR,802aa82ca0f6,v4.3.28.11361 : mcad[1112]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:2d:a0:f6 radio=wifi0 vap=ath0 sta=00:23:38:10:73:df satisfaction_now=79 anomalies=tcp_latency 2022-04-20 01:55:54.259000 | 4 | unifi-2 | 802aa8100916,UAP-AC-LR-6.0.15+13 mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16 radio=wifi0 vap=ath0 sta=00:23:38:10:73:01 satisfaction_now=79 anomalies=tcp_latency ---------------------------+----------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Function `MATCH_PHRASE` implements ElasticSearch phrase query ( see https://opendistro.github.io/for-elasticsearch-docs/docs/sql/sql-full-text/#match-phrase) and has the same restrictions as ElasticSearch phrase query ( https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html ). Essentially it searches substring but also allows you to use wildcard '*'. We could extend `MATCH_PHRASE` match to make it more specific by including the next term in the original log message: .. code-block:: none MATCH_PHRASE(logText, 'wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16') .. code-block:: none # nsgql "SELECT time,deviceId,device,logText FROM syslog WHERE MATCH_PHRASE(logText, 'wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16') and time between 'now-10m' and now " ---------------------------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- time (local) | deviceId | device | logText ---------------------------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 2022-04-21 03:23:34.969000 | 4 | unifi-2 | 802aa8100916,UAP-AC-LR-6.0.15+13 mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16 radio=wifi0 vap=ath0 sta=00:23:38:10:73:01 satisfaction_now=67 anomalies=tcp_latency 2022-04-21 03:23:46.782000 | 4 | unifi-2 | 802aa8100916,UAP-AC-LR-6.0.15+13 mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16 radio=wifi0 vap=ath0 sta=00:23:38:10:73:01 satisfaction_now=67 anomalies=tcp_latency 2022-04-21 03:23:50.025000 | 4 | unifi-2 | 802aa8100916,UAP-AC-LR-6.0.15+13 mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16 radio=wifi0 vap=ath0 sta=00:23:38:10:73:01 satisfaction_now=67 anomalies=tcp_latency 2022-04-21 03:24:05.089000 | 4 | unifi-2 | 802aa8100916,UAP-AC-LR-6.0.15+13 mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16 radio=wifi0 vap=ath0 sta=00:23:38:10:73:01 satisfaction_now=67 anomalies=tcp_latency 2022-04-21 03:24:35.197000 | 4 | unifi-2 | 802aa8100916,UAP-AC-LR-6.0.15+13 mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16 radio=wifi0 vap=ath0 sta=00:23:38:10:73:01 satisfaction_now=78 anomalies=tcp_latency 2022-04-21 03:24:52.724000 | 4 | unifi-2 | 802aa8100916,UAP-AC-LR-6.0.15+13 mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16 radio=wifi0 vap=ath0 sta=00:23:38:10:73:01 satisfaction_now=78 anomalies=tcp_latency 2022-04-21 03:25:05.346000 | 4 | unifi-2 | 802aa8100916,UAP-AC-LR-6.0.15+13 mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16 radio=wifi0 vap=ath0 sta=00:23:38:10:73:01 satisfaction_now=78 anomalies=tcp_latency 2022-04-21 03:25:06.541000 | 4 | unifi-2 | 802aa8100916,UAP-AC-LR-6.0.15+13 mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16 radio=wifi0 vap=ath0 sta=00:23:38:10:73:01 satisfaction_now=78 anomalies=tcp_latency 2022-04-21 03:25:35.612000 | 4 | unifi-2 | 802aa8100916,UAP-AC-LR-6.0.15+13 mcad: mcad[1330]: wireless_agg_stats.log_sta_anomalies(): bssid=80:2a:a8:11:09:16 radio=wifi0 vap=ath0 sta=00:23:38:10:73:01 satisfaction_now=76 anomalies=tcp_latency ---------------------------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- now the query matches only one access point using its specific BSSID. Once you refine the query to match exactly log records you need, you can build the alert: .. code-block:: python from nw2functions import * def alert_interface_flapping(log): alert( name='wifiAnomalies', input=query("FROM syslog WHERE MATCH_PHRASE(logText, 'wireless_agg_stats.log_sta_anomalies*') GROUP BY deviceId"), condition=lambda _, value: value >= 10, description='WiFi AP reports anomalies', duration=600, percent_duration=100, notification_time=300, streams=['log'], fan_out=True ) This alert will match log records and group them by device ID. Device name and ID in this case refer to the device that generated log record, that is, WiFi access point. The alert analyzes log records sent by devices in the most recent 10 minutes and activates if any of the devices sent more than 10 records. This alert is rather simple and does not differentiate log records by other parameters that can be found there. For example, we could find the value of parameter `satisfaction_now` and compare it to a threshold. Or we could extract the name of the radio interface and build alert that counts anomalies per device-radio pair. These are advanced techniques that require Grok expressions. Alerts that match device and component ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Consider an alert that has to trigger on network interface flapping. The input for this alert is a syslog message that reports interface state transition. The message usually includes some keywords that can be used to identify it, for example "interface down" or "changed state to down", depending on the vendor. The message also includes device name and interface name. The alert we want to build should react to the number of times an interface of a device changes state over specified interval of time. In other words, we should group log records by (device,interface) pair and count the number of times the same pair appears in the log during given interval of time. The first step is to identify device and interface names. Both appear as words in the log message which we need to recognize and extract. This is done with the help of Grok expressions. We assume that the system has been configured with Grok pattern to match these messages (see :ref:`Interface down`). Grok pattern used in referenced example produces at least two keys relevant to our alert: it finds interface name and assigns value of the key `logIfLocalName` and it finds its new state which is assigned to the key `logIfLocalOperStatus`. syslog message .. code-block:: none <13>May 18 11:22:43 172.16.1.10 Line protocol on Interface GigabitEthernet0/10, changed state to down produces document that look like this: .. code-block:: json { "logTimestamp": "2021-05-18T11:22:43.000Z", "logSource": "172.16.1.10", "logSyslogPriority": 13, "logSyslogFacilityCode": 1, "logSyslogFacilityName": "user", "logSyslogSeverityCode": 5, "logSyslogSeverityName": "notice", "logSyslogText": "<13>May 18 11:22:43 172.16.1.10 Line protocol on Interface GigabitEthernet0/10, changed state to down", "logText": "Line protocol on Interface GigabitEthernet0/10, changed state to down", "logIfLocalName": "GigabitEthernet0/10", "logIfLocalOperStatus": "down" } Our new alerts is going to count how many messages like this we have seen for each unique pair device-interface over specified interval of time. If the number is greater than some predefined threshold, the alert activates to indicate interface flapping event. To count events like that, we would need to group syslog events by device name-interface pairs; in SQL this is done with a `GROUP BY` clause in the NsgQL expression passed as argument to Python function `query`. We use GROUP BY to group by device ID and component (interface) index. In the simplest scenario the python function would look like this: .. code-block:: python from nw2functions import * def alert_interface_flapping(log): alert( name='interfaceFlapping', input=query("FROM syslog WHERE logIfLocalOperStatus = 'down' GROUP BY deviceId,ifCompIndex"), condition=lambda _, value: value >= 10, description='Interface flapping', duration=60, percent_duration=100, notification_time=300, streams=['log'], fan_out=True ) Alerting framework recognizes that input query has to be executed against syslog data and transforms original query into the aggregated one in order to get the number of messages that indicating that an interface went down, per unique device-interface pair during specified time interval defined by the parameter `duration`. In this example time interval is 1 min and condition function triggers the alert when returned value is greater or equal to 10. This means the alert triggers when it finds 10 or more log lines that say the same interface of the same device went down during most recent 1 min interval. NetSpyGlass expands partial NsgQL query passed as a parameter to function query() by adding `SELECT` and time interval matchHere is the query similar to the one NetSpyGlass service would generate for the alert (note that it is not exactly the same though): .. code-block:: sql select count(*), deviceId, ifCompIndex from syslog where logIfLocalOperStatus = down and logTimestamp between 'now-1m' and now GROUP BY deviceId, ifCompIndex The value of the ``count(*)`` column produced by the query is passed to the alert condition function as its second argument. In the example shown above alert is going to activate only when number of messages indicating change of interface state to down is greater or equal to 10.