Tuesday, April 12, 2016

Monitoring VMWare ESX with Check_MK

Introduction

Monitoring VMWare ESX host sytems and virtual machines has always been difficult, but with the introduction of ESXi and the strangulation of the command line and the /proc filesystem it is now impossible to install any (useful) software on a host system.
Some people have tried SNMP, and there even is an SNMP agent that can be activated using the command line. But SNMP is deprecated, unsupported and - what is worst - just provides few and mostly useless data.
What remains is the "vSphere API". This is in fact a HTTP based protocol for quering and managing either a standalone host system or a vCenter. For use with Linux there exists a Perl API and one in Python that both make the underlying protocol available to script programming. Traditional Nagios plugins likecheck_esx3.pl or check_vmware_api.pl make use of the Perl API. They all have one major drawback: they consume immense CPU ressources on your monitoring host. Some people even setup a second monitoring server just to offload the ESX checks! The problem is rooted in the Perl API begin not very fast and - most important - each single metric has to be queried with a separate call of the check plugin.
As of version 1.2.3i1 Check_MK now overcomes these problems by implementing a completely new monitoring plugin for VMWare ESX and ESXi: the Check_MK vSphere Agent. Of course it is not only much faster but also supports automatic service discovery (inventory). As usual with Check_MK one call of the plugin per check interval per host system (or per vCenter) is sufficient.
Currently the agent support the following metrics:
Check typeChecked item
esx_vsphere_counters.diskioDisk throughput
esx_vsphere_counters.ifPerformance of network interfaces
esx_vsphere_counters.ramdiskUsage of internal RAM disks
esx_vsphere_counters.uptimeSystem uptime
esx_vsphere_datastoresused space, growth and provisioning of data stores
esx_vsphere_hostsystem.cpu_usageCPU usage of the host sytem
esx_vsphere_hostsystem.mem_usageMemory usage of the host system
esx_vsphere_hostsystem.multipathMultipath state
esx_vsphere_hostsystem.stateGeneral state of the host sytem
esx_vsphere_objectsGeneral information about VMs
esx_vsphere_sensorsAll hardware sensors (fans, temperatures, etc.)
esx_vsphere_vm.cpuCPU usage of virtual machines
esx_vsphere_vm.heartbeatGuest heartbeat of virtual machines
esx_vsphere_vm.mem_usageMemory usage of virtual machines
esx_vsphere_vm.guest_toolsState of VMware guest tools
More might follow in future versions, keep in touch...

Monitoring Host Systems

1. Prerequisites

Before you can start you need to make sure that...
  • ... your Check_MK has at least version 1.2.3i1 (or a version from GIT from Apr 30, 2013 or later).
  • ... you have pysphere-0.1.7 (or later) installed somewhere in your Python path (or use OMD)
  • ... you have at least ESX Version 5.0 or later
  • ... you have created a VMWare user for the monitoring in vSphere - ideally with read-only permissions. Let's assume that its name is harri and its secret is EnIgMa.
Note1: Currently the agent does not support ESX in version 4.1 or earlier. This version misses some information. It's probably not a big deal to account for that in the agent. Patches are welcome ;-)
Note2: You can get PySphere from http://pysphere.googlecode.com. Please read the enclosed README file for how to install PySphere. If you are using OMD then the good news is: everything is already in place, do not download PySphere, do not install it.

Doing the configuration with WATO

Now you can add your ESX hosts (not the VMs for now) to Check_MK. If you are using WATO then please specify Check_MK Agent (Server) as Agent type, even if there is no Check_MK agent on your ESX host. Command line users simply add the host to all_hosts. Lets assume that its name is esxhost01.
Then you need to configure to use the vSphere special agent instead of the normal Check_MK access as a datasource programm. This "special agent" is a Python program that is running locally on your Check_MK server. In WATO this is done with the ruleDatasource Programs / Check state of VMWare ESX via vSphere.
Besides the obvious user name and password for your VMware user you can select an alternative TCP port (rarely used), a timeout for connecting to the ESX host and - most important - the list of data sources that you want to monitor. Please note that even if the new Check_MK ESX agent might be the fastest way to monitor ESX with Nagios, the check will take some time anyway - especially if your ESX host is busy and has many VMs running on it. In that case you could decide to remove some of the information in order to speed up the monitoring.
Also important is Type of query. Here you have the following choices:
Queried host is a host system: Most common case: you directly query an ESX host system.
Queried host is the vCenter: You query a vCenter. You'll get information about all host systems in VMs in this case - at the price of a longer check execution time.
Queried host is the vCenter with Check_MK Agent installed: Same, but also the Check_MK agent on the host is being queried. Note: the vCenter most probably is running on a Windows machine. If you want to monitor that machine with Check_MK as well, then select this option.
After saving this rule you can proceed to the inventory of esxhost01.

Setup without WATO

If you do not like WATO you can also setup the thing on the command line. Create a rule indatasource_programs in that case. The program to call is agent_vsphere and you'll find it in/usr/share/check_mk/agents/special (Default path for manual installations) orshare/check_mk/agents/special (OMD users).
You can call this program manually with --help, if you want:
OMD[mysite]:~$ share/check_mk/agents/special/agent_vsphere --help
Check_MK vSphere Agent

USAGE: agent_vsphere [OPTIONS] HOST
       agent_vsphere -h

ARGUMENTS:
  HOST                          Host name or IP address of vCenter or VMWare HostSystem

OPTIONS:
  -h, --help                    Show this help message and exit
  -u USER, --user USER          Username for vSphere login
  -s SECRET, --secret SECRET    Secret/Password for vSphere login
  -D, --direct                  Assume a directly queried host system (no vCenter). In
                                This we expect data about only one HostSystem to be
                                Found and do not create piggy host data for that host.
  -H, --hostname                Specify a hostname. This is neccessary if this is
                                different from HOST. It is being used in --direct
                                mode as the name of the host system when outputting
                                its power state.
  -a, --agent                   Also retrieve data from the normal Check_MK Agent.
                                This makes sense if you query a vCenter that is
                                Installed on a Windows host that you also want to
                                Monitor with Check_MK.
  -t, --timeout SECS            Set the network timeout to vSphere to SECS seconds.
                                This is also used when connecting the agent (option -a).
                                Default is 60 seconds. Note: the timeout is not only
                                applied to the connection, but also to each individual
                                subquery.
  --debug                       Debug mode: let Python exceptions come through

  -i MODULES, --modules MODULES Modules to query. This is a comma separated list of
                                hostsystem, virtualmachine and storage. Default is to
                                query all modules.

Here is an example how to call this program:
OMD[mysite]:~$ share/check_mk/agents/special/agent_vsphere  -u 'harri' -s 'EnIgMa' \
    -i hostsystem,virtualmachine,datastore,counters --direct \ 
    --hostname 'esxhost01' --timeout 5 10.1.1.111 
<<<check_mk>>>
Version: 5.0
AgentOs: VMware ESXi
<<<esx_vsphere_datastores:sep(9)>>>
[esxabc01-lds]
accessible  True
capacity    578478407680
freeSpace   388398841856
type    VMFS
uncommitted 51973812224
If you've got your command line right you can add this to main.mk:
main.mk
datasource_programs.append((
  "share/check_mk/agents/special/agent_vsphere  -u 'harri' -s 'EnIgMa' "
  "-i hostsystem,virtualmachine,datastore,counters --direct "
  "--hostname '<HOST>' --timeout 5 <IP>", [ "esxhost01" ]
))
After that you should be able to do an inventory as usual:
OMD[mysite]:~$ cmk -I esxhost01
Check_mk version 2013.04.25
Calling external program /omd/sites/esx/share/check_mk/agents/special/agent_vsph
CPU utilization      OK - 1.5% used, 15min average: 0.7%, 0.48GHz/32.78GHz, 2 so
Disk IO SUMMARY      OK - 12.00kB/sec read, 45.00kB/sec write, IOs: 79.00/sec
Hardware Sensors     CRIT - VMware Rollup Health State: Red (Sensor is operating
HostSystem esx       OK - power state: poweredOn
Interface 0          OK - [vmnic0] (up) speed unknown, in: 8.2KBit/s(0.0%/1GBit/
Interface 1          OK - [vmnic1] (up) speed unknown, in: 0Bit/s(0.0%/1GBit/s),
Interface 2          OK - [vmnic2] (up) speed unknown, in: 0Bit/s(0.0%/1GBit/s),
Interface 3          OK - [vmnic3] (up) speed unknown, in: 0Bit/s(0.0%/1GBit/s),
Memory used          OK - 59% used - 14.21GB/23.99GB
Overall state        OK - Enity state: green, Power state: poweredOn
Uptime               OK - up since Mon Apr  8 09:10:00 2013 (22d 05:37:24)
VM LinuxI            WARN - power state: poweredOff, running on [esx.mathias-ket
VM LinuxII.foobar.de WARN - power state: poweredOff, running on [esx.mathias-ket
VM LinuxIII          WARN - power state: poweredOff, running on [esx.mathias-ket
VM LinuxIV           WARN - power state: poweredOff, running on [esx.mathias-ket
VM LinuxV            WARN - power state: poweredOff, running on [esx.mathias-ket
VM OpenSUSE_I        OK - power state: poweredOn, running on [esx.mathias-kettne
VM OpenSUSE_II       OK - power state: poweredOn, running on [esx.mathias-kettne
VM OpenSUSE_III      OK - power state: poweredOn, running on [esx.mathias-kettne
VM OpenSUSE_IV       OK - power state: poweredOn, running on [esx.mathias-kettne
VM OpenSUSE_V        OK - power state: poweredOn, running on [esx.mathias-kettne
VM WindowsXP         OK - power state: poweredOn, running on [esx.mathias-kettne
fs_zmucvm99-lds      OK - 32.9% used (177.03 of 538.8 GB), (levels at 80.0/90.0%
OK - Agent version 5.0, execution time 5.0 sec|execution_time=5.034

Monitoring Virtual Machines

So far we've just monitored the physical host systems. But if you have setup the Check_MK vSphere agent like in our example then you're just a small step away from monitoring the VMs. You just need to know the names of the VMs (as configured in vCenter, not the DNS names) and then add hosts with exactly that names to the monitoring.
Note: if no Check_MK agent is being installed in the virtual machines then you need to set the agent type to No Agentmain.mk users do this by adding the tag |ping to the host.
Then you just do an inventory on the VMs and you are done. Check_MK will use data the has come piggy back from the ESX host. But the services themselves will be attached to the according VM hosts - just as you most probably want it. Here is an example from the command line:
OMD[mysite]:~$ cmk -I vm_guest01
esx_vsphere_vm.cpu 1 new checks
esx_vsphere_vm.heartbeat 1 new checks
esx_vsphere_vm.mem_usage 1 new checks
esx_vsphere_vm.name 1 new checks
OMD[mysite]:~$ cmk -v vm_guest01
Check_mk version 2013.04.25
ESX CPU              OK - demand is 0.009 Ghz, 1 virtual CPUs
ESX Heartbeat        OK - Heartbeat status is green
ESX Memory           OK - Host: 2.21GB, Guest: 0.00B, Ballooned: 0.00B, Private:
ESX Name             OK - OpenSUSE_V
OK - execution time 0.0 sec|execution_time=0.001

Piggy back translation

Sometimes the naming scheme of your virtual machines does not really match that of your hosts in Check_MK. For that case Check_MK provides a rule called Hostname translation for piggybacked hosts. In WATO simply search for piggy in the rule search box. The rule translates ESX names into other names in a flexible way. Configure this rule for the vCenter/ESX host the is queried - not for the VMs!
The translation is done in four optional steps:
  1. Case translation: Here you can simply switch the case to lower or upper in general.
  2. Convert FQHN: As a next optional step every thing after the first dot is being removed from the hostname. For example vm_guest01.foo.bar is converted to vm_guest01.
  3. Regular expression substitution: This is a bit like the stream editor sed: You can specify an extended regular expression - probably with subgroups. In the replacement you can use \1 for the first subgroup, \2 for the second and so on. The following example will tranlate a name of the formvm_01_foobar into foobar01:
    Regular expression: vm_(.*)_(.*)
    Replacement: \2\1
  4. Explicit host name mapping: If this all is not enough you can specify an explicit table of host names and replacement names.

Debugging piggy back


If you wonder why some host is missing in your monitoring you can have a look at the directory where the piggy-backed data is being stored. In OMD this is in tmp/check_mk/piggyback. In manual installations this should be in parallel to the cache directory of Check_MK. Below piggyback this should be one directory of each VM that resides on one of your monitored ESX hosts.

No comments:

Post a Comment