Difference between revisions of "Nagios"

From vwiki
Jump to navigation Jump to search
(→‎Create SNMP Checks: First draft)
m (Another draft)
Line 1: Line 1:
== Introduction ==
Nagios is an open source monitoring tool. Its standard (Core) version is free for download and use with no real limitations, its premium (XI) version offers additional features, most notably a GUI interface with which to configure it.  Configuring Nagios is a bit of head scratcher at first, you seem to have to make lots of config changes in different places to get things working.  But once you've got the concepts in your head, its relatively straight forward.
Nagios is centred around device polling (it can receive SNMP traps, but its a more advanced feature), and the presentation of state data.  Though the first thing to appreciate is that Nagios doesn't actually do any monitoring, at its core it's a task scheduling and state management engine.  It needs third party '''plugins''', which do the actual monitoring a report back the state of the host you're monitoring to it.  There are plugins provided out-of-the-box, which will probably achieve most (if not all) of what you want.
This introduction is intended to explain the basic terminology, and get you going by demonstrating how to get a device or two monitored.
== Terminology ==
* '''host''' - A host is any network device that you want to monitor, be it a server, router, switch, SAN; anything that has an IP address.
* '''hostgroup''' - This is a collection of similar devices that you want to apply similar monitoring too, a host can be in more than one hostgroup.
* '''plugin''' - A plugin is a monitoring module, built to monitor/interface with a specific device, application etc,. It hides Nagios from the specific's of whatever its interfacing with.
* '''command''' - A command is command line call of a plugin with one or more parameters, which defines how you might use a plugin to test a host.
* '''service''' - A service is something that you care about on a host, that you want to test (eg web server response, ping, disk space, CPU,
== Useful Paths etc ==
{|cellpadding="4" cellspacing="0" border="1"
{|cellpadding="4" cellspacing="0" border="1"
|- style="background-color:#bbddff;"
|- style="background-color:#bbddff;"
Line 9: Line 24:
| <code> /usr/lib/nagios/plugins </code>  || Plugin executables
| <code> /usr/lib/nagios/plugins </code>  || Plugin executables
|-
|-
| <code> nagios3 -v /etc/nagios3/nagios.cfg </code>  || Config check
| <code> nagios3 -v /etc/nagios3/nagios.cfg </code>  || Config check - do before a restart to check a new config makes sense
|-
|-
| <code> service nagios3 restart </code>  || Restart service (reloads config)
| <code> service nagios3 restart </code>  || Restart service (reloads config - will fail if config is invalid!)
|}
|}
./usr/share/nagios
./usr/lib/nagios
./var/lib/nagios





Revision as of 12:35, 31 August 2011

Introduction

Nagios is an open source monitoring tool. Its standard (Core) version is free for download and use with no real limitations, its premium (XI) version offers additional features, most notably a GUI interface with which to configure it. Configuring Nagios is a bit of head scratcher at first, you seem to have to make lots of config changes in different places to get things working. But once you've got the concepts in your head, its relatively straight forward.

Nagios is centred around device polling (it can receive SNMP traps, but its a more advanced feature), and the presentation of state data. Though the first thing to appreciate is that Nagios doesn't actually do any monitoring, at its core it's a task scheduling and state management engine. It needs third party plugins, which do the actual monitoring a report back the state of the host you're monitoring to it. There are plugins provided out-of-the-box, which will probably achieve most (if not all) of what you want.

This introduction is intended to explain the basic terminology, and get you going by demonstrating how to get a device or two monitored.

Terminology

  • host - A host is any network device that you want to monitor, be it a server, router, switch, SAN; anything that has an IP address.
  • hostgroup - This is a collection of similar devices that you want to apply similar monitoring too, a host can be in more than one hostgroup.
  • plugin - A plugin is a monitoring module, built to monitor/interface with a specific device, application etc,. It hides Nagios from the specific's of whatever its interfacing with.
  • command - A command is command line call of a plugin with one or more parameters, which defines how you might use a plugin to test a host.
  • service - A service is something that you care about on a host, that you want to test (eg web server response, ping, disk space, CPU,

Useful Paths etc

Path Description
/etc/nagios3/conf.d Config files
/etc/nagios-plugins/config Plugin commands
/usr/lib/nagios/plugins Plugin executables
nagios3 -v /etc/nagios3/nagios.cfg Config check - do before a restart to check a new config makes sense
service nagios3 restart Restart service (reloads config - will fail if config is invalid!)


define service{ use generic-service ; Inherit default values from a template hostgroup_name zimbra-servers service_description IMAP check_command check_imap }

define service{ use generic-service ; Inherit default values from a template hostgroup_name zimbra-servers service_description SMTP check_command check_smtp }

  1. check that MySQL services are up

define service {

       hostgroup_name                  mysql-servers
       service_description             MySQL
       check_command                   check_mysql
       use                             generic-service
       notification_interval           0 ; set > 0 if you want to be renotified

}


define command{

       command_name    check_http_auth
       command_line    /usr/lib/nagios/plugins/check_http -H '$HOSTADDRESS$' -I '$HOSTADDRESS$' -a '$ARG1$'q
       } 


define service{

       use                             generic-service         ; Name of service template to use
       host_name                       localhost
       service_description             HTTP
       check_command                   check_http_auth!user:pass  ; Enter actual user/pass
       }


define hostextinfo{

       hostgroup_name   debian-servers
       notes            Debian GNU/Linux servers
  1. notes_url http://webserver.localhost.localdomain/hostinfo.pl?host=netware1
       icon_image       base/debian.png
       icon_image_alt   Debian GNU/Linux
       vrml_image       debian.png
       statusmap_image  base/debian.gd2
       }

define hostextinfo{

       hostgroup_name   ubuntu-servers
       notes            Ubuntu servers
       icon_image       base/ubuntu.png
       icon_image_alt   Ubuntu
       vrml_image       ubuntu.png
       statusmap_image  base/ubuntu.gd2
       }


Create SNMP Checks

Everything here creates various checks for my QNAP NAS, which I've used as an example.

Define OID's to Poll

Before you start you need to know what SNMP OID's you want to poll, and what they're values should be. For common devices and metrics you can often get by with a Google search or two, but it doesn't take much for you to need to get a bit more involved.

When it comes to investigating what OID's you can poll for a specific device your friend is GetIf.

Having downloaded the MIB and done some probing GetIf, I've decided I need to monitor the following OID's...

OID Description Example Return Data
.1.3.6.1.4.1.24681.1.2.6.0 System Temperature 41 C/105 F
.1.3.6.1.4.1.24681.1.2.17.1.6.1 System Volume 1 Status Ready
.1.3.6.1.4.1.24681.1.2.17.1.5.1 System Volume 1 Space 1.74 TB
.1.3.6.1.4.1.24681.1.2.11.1.7.1 Physical Disk 1 SMART Status GOOD
.1.3.6.1.4.1.24681.1.2.11.1.3.1 Physical Disk 1 Temperature 35 C/95 F


Create Commands

Each type of check needs a command defined for it, which where the SNMP OID that will be checked is defined. Commands are are not specific to a particular host, so could be run against any system for which the check would be valid. There is some flexibility in that if you've certain checks that will be similar (eg checks for status of disk 1, disk 2 etc) then you can add arguments to the checks that can be defined later on.

I created a new file, called /etc/nagios3/conf.d/commands_qnap.cfg and added the following...

define command{
        command_name    check_qnap_sys_temp
        command_line    /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.6.0 -w 45 -c 55 -l Temp -u C
        }
  • -H '$HOSTADDRESS$' - This is a standard wildcard for all check commands, Nagios substitutes the device's IP address
  • -o .1.3.6.1.4.1.24681.1.2.6.0 - The SNMP OID being checked ** .iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemTemperature.0
  • -w 45 - The warning threshold
  • -c 55 - The critical threshold
  • -l Temp - A label for the check (appears in the check's Status Information column in Nagios display)
  • -u C - The units of the metric being checked (appears in the check's Status Information column in Nagios display)


define command{
        command_name    check_qnap_sysvol_status
        command_line    /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.17.1.6.$ARG1$ -l "Volume Status"
        }
  • -o .1.3.6.1.4.1.24681.1.2.17.1.6.$ARG1$ - The SNMP OID being checked, $ARG1$ is used as a wildcard so that if I had more than one volume I could repeat the check for volume 1, 2 etc without creating a separate check command for each.
    • .iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemVolumeTable.SysVolumeEntry.SysVolumeStatus.$ARG1$


define command{
        command_name    check_qnap_sysvol_space
        command_line    /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.17.1.5.$ARG1$ -w $ARG2$: -c $ARG3$: -l "Volume Space" -u TB
        }
  • -o .1.3.6.1.4.1.24681.1.2.17.1.5.$ARG1$ - The SNMP OID being checked, as above $ARG1$ is used as a command parameter so that if I had more than one volume I could repeat the check for volume 1, 2 etc without creating a separate check command for each.
  • -w $ARG2$: - The warning threshold, defining it as a command parameter allows me to alter the service threshold without altering the command definition. The trailing  : makes it a should be more than check rather than the normal should be less than check.
  • -c $ARG2$: - The critical threshold, defining it as a command parameter allows me to alter the service threshold without altering the command definition. The trailing  : makes it a should be more than check rather than the normal should be less than check.
    • .iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemVolumeTable.SysVolumeEntry.SysVolumeFreeSize.$ARG1$


define command{
        command_name    check_qnap_disk_status
        command_line    /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.11.1.7.$ARG1$ -l "SMART Info State"
        }
  • -o .1.3.6.1.4.1.24681.1.2.11.1.7.$ARG1$ - The SNMP OID being checked, similar to above $ARG1$ is used as a command parameter so that I can create separate checks for the individual disks without creating a separate check command for each.
    • .iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemHdTable.HdEntry.HdSmartInfo.$ARG1$
define command{
        command_name    check_qnap_disk_temp
        command_line    /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.11.1.3.$ARG1$ -w 45 -c 55 -l Temp -u C
        }
  • -o .1.3.6.1.4.1.24681.1.2.11.1.3.$ARG1$ - The SNMP OID being checked, as above $ARG1$ is used as a command parameter so that I can create separate checks for the individual disks without creating a separate check command for each.
    • .iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemHdTable.HdEntry.HdTemperature.$ARG1$


Create Services

Services are used to define a generic check command within the bounds of a specific service. So for example, you could define two separate disk space checks, using the same command definition, but with different alerting thresholds depending on your requirements.

Services need to be defined with...

  • hostgroup_name - The hostgroup defines which servers will have the service checks applied to it. For a host to be checked for the service it needs to be a member of the hostgroup, see #Create Hostgroup for further info.
  • service_description - A name for the service check, this is what is displayed in the Service field on the Nagios display
  • check_command - The command (and its parameters, if any) to perform the check.

I created a new file, called /etc/nagios3/conf.d/services_qnap.cfg, in which to add service definitions, examples of which are below...

define service{
        use                     generic-service
        hostgroup_name          qnap-nas
        service_description     Temp Sys
        check_command           check_qnap_sys_temp
        }
define service{
        use                     generic-service
        hostgroup_name          qnap-nas
        service_description     Status SysVol 1
        check_command           check_qnap_sysvol_status!1
        }
  • Note the !1 at the end of the command in order to pass a parameter of 1 (ie 1st volume) to the command
define service{
       use                     generic-service 
       hostgroup_name          qnap-nas
       service_description     Space SysVol 1
       check_command           check_qnap_sysvol_space!1!.5!.25
       }
  • Note the !1!.5!.25 at the end of the command in order to pass parameters for volume 1, warning threshold of .5TB, and critical threshold of .25TB to the command
define service{
       use                     generic-service 
       hostgroup_name          qnap-nas
       service_description     Status Disk 1
       check_command           check_qnap_disk_status!1
       }
define service{
       use                     generic-service 
       hostgroup_name          qnap-nas
       service_description     Temp Disk 1
       check_command           check_qnap_disk_temp!1
       }


Create Hostgroup

The hostgroup definition allows you to group one or more hosts together, in order to have service checks run against them. So in the above I created services that would apply to hosts in the qnap-nas hostgroup. I can then add my NAS server to this hostgroup in order for it to be monitored (hostgroup definitions are normally found in /etc/nagios3/conf.d/hostgroups_nagios2.cfg

define hostgroup {
       hostgroup_name qnap-nas
               alias           QNAP NAS
               members         nas
       }

If I wanted to monitor more than one NAS I could just add further members (comma separated, no spaces). Note that any hosts specified in a hostgroup must themselves have a host definition (normally found in /etc/nagios3/conf.d/hosts.cfg, for example...

define host{
       use                     generic-host  
       host_name               nas
       alias                   NAS
       address                 192.168.1.200
       }