Tech blog: How check_mk monitors logfiles ?

The monitoring of the contents of logfiles is an especially challenging task for a Nagios administrator. The key difficulty is, that log messages are event based by nature - whereas Nagios is based on states. Check_mk's logwatch mechanism overcomes this problem by defining the critical state for a logfile as "no unacknownledged critical log messages".

At the beginning of the monitoring a logfile starts in the state OK - regardless of its contents. When a new critical message is seen in the file, it is stored on the Nagios server for reference by the administrator. The state of the logfile changes to CRITICAL and stays in that state unless the administrator acknowledges the messages. New critical messages arriving while in CRITICAL state are simply being stored and do not change the state.

Check_mk provides a webpage logwatch.php that displays log messages and allows the delete (and thus acknowledge) them in an easy way:

Logwatch on Linux and UNIX

Installing the logwatch extension

Logfiles on Linux and UNIX are monitored with the logwatch extension for the check_mk_agent. In the directory /usr/share/check_mk/agents you find the file mk_logwatch. It is a small Python-programm that must be installed into the plugins directory of the agent (you specify that directory while runningsetup.sh). The default path for the plugins directory is /usr/lib/check_mk_agent/plugins. Please make sure that your host has Python in at least version 2.3 installed. On Linux this is most probably the case. On UNIX you probably have to install it.

On Linux another way is to install the logwatch extension via its RPM or DEB package.

Configuration

Logwatch needs to know which files to monitor and for which patterns to look. This is done in the configuration file logwatch.cfg on each target host. That file is searched in the following directories:

In the directory specified by the environment variable LOGWATCH_DIR
In the check_mk configuration directory you specified during setup.sh.
If LOGWATCH_DIR is not set and mk_logwatch is called manually, then it looks in the currenty directory for that file.

It is also possible to split the config in multiple files. Just create a folder in LOGWATCH_DIR namedlogwatch.d. Inside this directory you can place multipe files ending with .cfg

If you've used the DEB or RPM package for installation or used the default settings for setup as root, the path to the file is /etc/check_mk/logwatch.cfg. That file lists all relevant logfiles and defines patterns that should indicate a critical or warning level if found in a log line. The following example defines some patterns for /var/log/messages:

/etc/check_mk/logwatch.cfg

/var/log/messages
 C Fail event detected on md device
 O Backup created*
 I mdadm.*: Rebuild.*event detected
 W mdadm\[

Each pattern is a regular expression and must be prefixed with one space, one of C, W, O and I and another space. The upper example means:

Lines containing Fail event detected on md device are critical.
Lines containing Backup created on 2012-11-02 are ok.
Lines containing mdadm, then something, then Rebuild, than something else and then event detected will be ignored.
All other lines containing mdadm[ are warnings.
All other lines will also be ignored.

You may list several logfiles separated by spaces:

/etc/check_mk/logwatch.cfg

/var/log/kern /var/log/kern.log
 C panic
 C Oops

It is also allowed to use shell globbing patterns in file names:

/etc/check_mk/logwatch.cfg

/sapdata/*/saptrans.log
 C critical.*error
 C some.*other.*thingy

An arbitrary number of such chunks can be listed in logwatch.cfg. Emtpy lines and comment lines will be ignored. This example defines different patterns for several logfiles:

/etc/check_mk/logwatch.cfg

# This is a comment: monitor system messages
/var/log/messages
 C Fail event detected on md device
 I mdadm.*: Rebuild.*event detected
 W mdadm\[

# Several instances of SAP log into different subdirectories
/sapdata/*/saptrans.log
 C critical.*error
 C some.*other.*thingy

Rewriting lines

Its possible to rewrite lines by simple add a second Rule after the matching pattern beginning with R

/etc/check_mk/logwatch.cfg

/var/log/messages
 C Error: (.*)
 R There is error: \1

You can group multipe matches with ( ) and use them in the rewrite with \1, \2...

Merging multiple lines

Logwatch can be configured to process multiple lines together as one log line, this is useful, e.g. to process java traces. To configure this, you need to add a second rule after the matching pattern beginning with A . For example:

/etc/check_mk/logwatch.cfg

/var/log/messages
 C Error
 A ^\s

This joins each line beginning with a space or tab directly following a line containing Error together to a single line.

Logfile options

mk_logwatch allows to limit the time needed to parse the new messages in a logfile. This helps in cases where logfiles are growing very fast (i.e. due to reoccuring error, and endless loop or similar). Those cases often arise in the context of Java application servers logging long stack traces.

You can limit the number of new lines to be processed in a logfile as well as the time spent during parsing the file. This is done by appending options to the filename lines:

/etc/check_mk/logwatch.cfg

/var/log/foobar.log maxlines=10000 maxtime=3 overflow=W
 C critical.*error
 C some.*other.*thingy

There are also options for limiting the length of the lines in a logfile and for getting a warning of the size of a logfile is too large (e.g. because of a filed logfile rotation).

The options have the following meanings:

maxlines	the maximum number of new log messages that will by parsed in one turn in this logfile
maxtime	the maximum time in seconds that will be spent parsing the new lines in this logfile
overflow	When either the number of lines or the time is exceeded, an artificial logfile message will be appended, so that you will be warned. The class of that message is per default `C`, but you can also set it to `W` or `I`. Setting `overflow=I` will silently ignore any succeeding messages. If you leave out this option, then a `C` is assumed.
nocontext	This option can be used to disable processing of context log messages, which occur together with a pattern matched line. To disable processing, add `nocontext=1` as option.
maxlinesize	1.2.6 The maximum number of characters that are processed of each line of the file. If a line is longer than this, the rest of the line is being truncated and the word `[TRUNCATED]`is being appended to the line. You can filter for that word in the expressions if you like.
maxfilesize	1.2.6 The maximum number of bytes the logfile is expected to be in size. If the size is exceeded, then once there is created an artificial logfile message with the classification `W`. The text of this warning will be: `Maximum allowed logfile size (12345 bytes) exceeded.` You cannot do any classification of this line right in the configuration of the plugin. If you need a reclassification then please do this on the Check_MK server.

Note (1): when the number of new messages or the processing time is exceeded, the non-processed new log messages will be skipped and not parsed even in the next run. That way the agent always keeps in sync with the current end of the logfile. From that follows that you might have to manually check the contents of the logfile if an overflow happened. We propose letting the overflow level set to C.

Note (2): It is not neccessary to specify both maxlines and maxtime. It also allowed to specify only one limit. The default is not to impose any limit at all.

Filtering filenames with regular expressions

Sometimes the file matching patterns with * and ? are not specific enough in order to specify logfiles. In such a case you can use the new options regex or iregex in order to further filter the filenames found by the pattern. Here is an example:

/etc/check_mk/logwatch.cfg

/var/log/*.log regex=/[A-Z]+\.log$
 C foo.*bar
 W some.*text

This just includes files whose path end with a /, followed by one or more upper case letters followed by.log, such as /var/log/FOO.log. The file /var/log/bar.log would be ignored by this line.

regex	Extended regular expression that must be found in the file name. Otherwise the file will be ignored. Use `^` for matching the beginning of the path and `$` for matching the end.
iregex	The same as `regex`, but the match is made case insensitive.

Note: In each logfile line you can use regex and iregex at most once.

State Persistency

In order to only send new messages, mk_logwatch remembers the current byte offset of each logfile seen so far. It keeps that information in /etc/check_mk/logwatch.state. If a logfile is scanned for the very first time, all existing messages are considered to be historic and are ignored - regardless any patterns. This behaviour is important. Otherwise you would be bombarded with thousands of ancient messages when check_mk runs for the first time.

Context

When something bad happens that has usually more impact into the logfile than one single line. In order to make a error diagnosis easier, logwatch always sends all new lines seen in a logfile if at least one of those lines is classified as warning or critical. If you monitor each host once in a minute (a quasi standard with Nagios), you'll then see all messages appeared in that last minute.

Logwatch on Windows

The check_mk_agent.exe for Windows automatically monitors the Windows Eventlog. Its output is fully compatible with that of the logwatch extension for Linux/UNIX. The main difference is that Windows already classifies its messages with Warning or Error. Furthermore the agent automatically monitors all existing event logs it finds, so no configuration is needed by you at all on the target host. It is - however - possible to reclassify messages to a higher or lower level via the configuration variable logwatch_patterns. Messages classified as informational by Windows cannot be reclassified since they are not sent by the agent. Please refer to the article about the Windows agent for details on logwatch_patterns.

The Windows agent also now supports the monitoring of custom textfiles, just like the linux/unix agent. For details please refer to the article Windows logfiles monitoring.

The logwatch web page

Whenever check_mk detects new log messages, it stores them on the Nagios host in a directory that defaults to /var/lib/check_mk/logwatch. Each host gets a subdirectory, each logfile's messages are stored in one file.

The Nagios service that reflects a logfile is in warning or critical state, if that file exists and contains at least one warning or critical message resp.

The /check_mk/logwatch.py web page allows you to nicely browse the messages in that file and acknowledges them, if you consider the problem to be solved. Acknowledgement means deletion of the file. Shortly afterwards the service of the logfile enters OK state in Nagios.

The default Nagios templates of Check_MK automatically create notes_url entries for all logwatch based services to that page.

Limiting the size of unaknowledged messages

In some situations the number of error messages can get quite large in a short time. In order to make the web pages still usage, the logwatch check stops to store new error messages on the monitoring server. The maximum size of a logfile is set to 500000 Bytes. This can be overridden in main.mk by settinglogwatch_max_filesize to another number:

main.mk

# Limit maximum size of stored message per file to 10 KB
logwatch_max_filesize = 10000

Tech blog

My Pages

Tuesday, April 12, 2016

How check_mk monitors logfiles ?