Difference between revisions of "Nagios"

Jump to navigation Jump to search
8,846 bytes added ,  14:54, 3 October 2013
m
Reverted edits by Ipodsoft (talk) to last revision by Sstrutt
(Update table formatting)
m (Reverted edits by Ipodsoft (talk) to last revision by Sstrutt)
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Introduction ==
== Introduction ==
Nagios is an open source monitoring tool. Its standard (Core) version is free for download and use with no real limitations, its premium (XI) version offers additional features, most notably a GUI interface with which to configure it.  Configuring Nagios is a bit of head scratcher at first, you seem to have to make lots of config changes in different places to get things working.  But once you've got the concepts in your head, its relatively straight forward.
Nagios is an open source monitoring tool. Its standard (Core) version is free for download and use with no real limitations, its premium (XI) version offers additional features, most notably a GUI interface with which to configure it.  Configuring Nagios can be challenging at first, requiring edits to multiple config files to get new monitoring working, but once you're got the logic and the pattern understood it becomes quite flexible.  


Nagios is centred around device polling (it can receive SNMP traps, but its a more advanced feature), and the presentation of state data.  Though the first thing to appreciate is that Nagios doesn't actually do any monitoring, at its core it's a task scheduling and state management engine.  It needs third party '''plugins''', which do the actual monitoring a report back the state of the host you're monitoring to it.  There are plugins provided out-of-the-box, which will probably achieve most (if not all) of what you want.
Nagios is centred around device polling (it can receive SNMP traps, but its a more advanced feature), and the presentation of state data.  Though the first thing to appreciate is that Nagios doesn't actually do any monitoring, at its core it's a task scheduling and state management engine.  It needs third party '''plugins''', which do the actual monitoring a report back the state of the host you're monitoring to it.  There are plugins provided out-of-the-box, which will probably achieve most (if not all) of what you want.
This introduction is intended to explain the basic terminology, and get you going by demonstrating how to get a device or two monitored.


== Terminology ==
== Terminology ==
Line 18: Line 16:
! Path  !! Description
! Path  !! Description
|-
|-
| <code> /etc/nagios3/conf.d </code>  || Config files
| <code> /etc/nagios3/conf.d </code>  || Config files - anything in here is parsed as config, filenames are for your convenience and are irrelevant to Nagios
|-
|-
| <code> /etc/nagios-plugins/config </code>  || Plugin commands
| <code> /etc/nagios-plugins/config </code>  || Plugin commands
Line 28: Line 26:
| <code> service nagios3 restart </code>  || Restart service (reloads config - will fail if config is invalid!)
| <code> service nagios3 restart </code>  || Restart service (reloads config - will fail if config is invalid!)
|}
|}


== Create SNMP Checks ==
== Create SNMP Checks ==
Line 34: Line 31:


=== Define OID's to Poll ===
=== Define OID's to Poll ===
Before you start you need to know what SNMP OID's you want to poll, and what they're values should be.  For common devices and metrics you can often get by with a Google search or two, but it doesn't take much for you to need to get a bit more involved.
Before you start you need to know what SNMP OID's you want to poll, and what their values should be.  For common devices and metrics you can often get by with a Google search or two, but it doesn't take much for you to need to get a bit more involved.


When it comes to investigating what OID's you can poll for a specific device your friend is [http://www.wtcs.org/snmp4tpc/getif.htm GetIf].
When it comes to investigating what OID's you can poll for a specific device, your friend is [http://www.wtcs.org/snmp4tpc/getif.htm GetIf].


Having downloaded the MIB and done some probing GetIf, I've decided I need to monitor the following OID's...
Having downloaded the MIB and done some probing GetIf, I've decided I need to monitor the following OID's...
Line 49: Line 46:
|-
|-
| <code> .1.3.6.1.4.1.24681.1.2.17.1.5.1 </code>  || System Volume 1 Space || <code> 1.74 TB </code>
| <code> .1.3.6.1.4.1.24681.1.2.17.1.5.1 </code>  || System Volume 1 Space || <code> 1.74 TB </code>
|-
| <code> .1.3.6.1.4.1.24681.1.2.11.1.4.1 </code>  || Physical Disk 1 Status || <code> ready </code>
|-
|-
| <code> .1.3.6.1.4.1.24681.1.2.11.1.7.1 </code>  || Physical Disk 1 SMART Status || <code> GOOD </code>
| <code> .1.3.6.1.4.1.24681.1.2.11.1.7.1 </code>  || Physical Disk 1 SMART Status || <code> GOOD </code>
Line 61: Line 60:
I created a new file, called <code>/etc/nagios3/conf.d/commands_qnap.cfg</code> and added the following...
I created a new file, called <code>/etc/nagios3/conf.d/commands_qnap.cfg</code> and added the following...


==== System Temperature ====
  define command{
  define command{
         command_name    check_qnap_sys_temp
         command_name    check_qnap_sys_temp
Line 66: Line 66:
         }
         }
* <code> -H '$HOSTADDRESS$' </code> - This is a standard wildcard for all check commands, Nagios substitutes the device's IP address
* <code> -H '$HOSTADDRESS$' </code> - This is a standard wildcard for all check commands, Nagios substitutes the device's IP address
* <code> -o .1.3.6.1.4.1.24681.1.2.6.0 </code> - The SNMP OID being checked ** <code>.iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemTemperature.0</code>
* <code> -o .1.3.6.1.4.1.24681.1.2.6.0 </code> - The SNMP OID being checked  
** <code>.iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemTemperature.0</code>
* <code> -w 45 </code> - The warning threshold
* <code> -w 45 </code> - The warning threshold
* <code> -c 55 </code> - The critical threshold
* <code> -c 55 </code> - The critical threshold
Line 72: Line 73:
* <code> -u C </code> - The units of the metric being checked (appears in the check's Status Information column in Nagios display)
* <code> -u C </code> - The units of the metric being checked (appears in the check's Status Information column in Nagios display)


 
==== Volume Status ====
  define command{
  define command{
         command_name    check_qnap_sysvol_status
         command_name    check_qnap_sysvol_status
         command_line    /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.17.1.6.$ARG1$ -l "Volume Status"
         command_line    /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.17.1.6.$ARG1$ -l "Volume Status" -r "Ready"
         }
         }
* <code> -o .1.3.6.1.4.1.24681.1.2.17.1.6.$ARG1$ </code> - The SNMP OID being checked, $ARG1$ is used as a wildcard so that if I had more than one volume I could repeat the check for volume 1, 2 etc without creating a separate check command for each.  
* <code> -o .1.3.6.1.4.1.24681.1.2.17.1.6.$ARG1$ </code> - The SNMP OID being checked, $ARG1$ is used as a wildcard so that if I had more than one volume I could repeat the check for volume 1, 2 etc without creating a separate check command for each.  
** <code>.iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemVolumeTable.SysVolumeEntry.SysVolumeStatus.$ARG1$</code>
** <code>.iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemVolumeTable.SysVolumeEntry.SysVolumeStatus.$ARG1$</code>
* <code> -r "Ready" </code> - The text expected back from the poll, anything else causes a critical error


 
==== Volume Space ====
  define command{
  define command{
         command_name    check_qnap_sysvol_space
         command_name    check_qnap_sysvol_space
         command_line    /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.17.1.5.$ARG1$ -w $ARG2$: -c $ARG3$: -l "Volume Space" -u TB
         command_line    /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.17.1.5.$ARG1$ -w $ARG2$: -c $ARG3$: -l "Volume Space" -u GB
         }
         }
* <code> -o .1.3.6.1.4.1.24681.1.2.17.1.5.$ARG1$ </code> - The SNMP OID being checked, as above $ARG1$ is used as a command parameter so that if I had more than one volume I could repeat the check for volume 1, 2 etc without creating a separate check command for each.  
* <code> -o .1.3.6.1.4.1.24681.1.2.17.1.5.$ARG1$ </code> - The SNMP OID being checked, as above $ARG1$ is used as a command parameter so that if I had more than one volume I could repeat the check for volume 1, 2 etc without creating a separate check command for each.  
Line 90: Line 92:
** <code>.iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemVolumeTable.SysVolumeEntry.SysVolumeFreeSize.$ARG1$</code>
** <code>.iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemVolumeTable.SysVolumeEntry.SysVolumeFreeSize.$ARG1$</code>


==== Disk Status ====
define command{
        command_name    check_qnap_disk_status
        command_line    /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.11.1.4.$ARG1$ -m /etc/nagios3/mibs/QNAP-NAS.mib -l "Disk Status" -r 0
        }
* <code> -o .1.3.6.1.4.1.24681.1.2.11.1.7.$ARG1$ </code> - The SNMP OID being checked, similar to above $ARG1$ is used as a command parameter so that I can create separate checks for the individual disks without creating a separate check command for each.
** <code>.iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemHdTable.HdEntry.HdStatus.$ARG1$</code>
* <code> -m /etc/nagios3/mibs/QNAP-NAS.mib </code> - Path to the QNAP MIB file.  The value returned is an integer, 0 for ready/good, a negative value for a fault.  In order to translate the value (eg <code>-9</code>) to its actual meaning (eg <code>rwError</code>), Nagios needs access to the MIB file.  You will need to download it from your NAS (from the Network Services | SNMP Settings page), and copy it to path indicated on your Nagios server.
* <code> -r 0 </code> - The data expected back from the poll, 0 maps to <code>ready</code>anything else causes a critical error


==== Disk SMART Status ====
  define command{
  define command{
         command_name    check_qnap_disk_status
         command_name    check_qnap_disk_smart_status
         command_line    /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.11.1.7.$ARG1$ -l "SMART Info State"
         command_line    /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.11.1.7.$ARG1$ -l "SMART Info State" -r "GOOD"
         }
         }
* <code> -o .1.3.6.1.4.1.24681.1.2.11.1.7.$ARG1$ </code> - The SNMP OID being checked, similar to above $ARG1$ is used as a command parameter so that I can create separate checks for the individual disks without creating a separate check command for each.  
* <code> -o .1.3.6.1.4.1.24681.1.2.11.1.7.$ARG1$ </code> - The SNMP OID being checked, similar to above $ARG1$ is used as a command parameter so that I can create separate checks for the individual disks without creating a separate check command for each.  
** <code>.iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemHdTable.HdEntry.HdSmartInfo.$ARG1$</code>
** <code>.iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemHdTable.HdEntry.HdSmartInfo.$ARG1$</code>
* <code> -r "GOOD" </code> - The text expected back from the poll, anything else causes a critical error


==== Disk Temperature ====
  define command{
  define command{
         command_name    check_qnap_disk_temp
         command_name    check_qnap_disk_temp
Line 104: Line 118:
* <code> -o .1.3.6.1.4.1.24681.1.2.11.1.3.$ARG1$ </code> - The SNMP OID being checked, as above $ARG1$ is used as a command parameter so that I can create separate checks for the individual disks without creating a separate check command for each.
* <code> -o .1.3.6.1.4.1.24681.1.2.11.1.3.$ARG1$ </code> - The SNMP OID being checked, as above $ARG1$ is used as a command parameter so that I can create separate checks for the individual disks without creating a separate check command for each.
** <code>.iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemHdTable.HdEntry.HdTemperature.$ARG1$</code>
** <code>.iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemHdTable.HdEntry.HdTemperature.$ARG1$</code>


=== Create Services ===
=== Create Services ===
Line 144: Line 157:
         service_description    Status Disk 1
         service_description    Status Disk 1
         check_command          check_qnap_disk_status!1
         check_command          check_qnap_disk_status!1
        }
define service{
        use                    generic-service
        hostgroup_name          qnap-nas
        service_description    SMART Disk 1
        check_command          check_qnap_disk_smart_status!1
         }
         }


Line 170: Line 190:
         alias                  NAS
         alias                  NAS
         address                192.168.1.200
         address                192.168.1.200
        }
== Check Tuning ==
It's unlikely that you really want everything checked every 5 mins, 24 hours a day.  Some services might get a bit flaky in the middle of the night when there's maintenance tasks running, or don't warrant being checked so frequently.
In general its better to make such changes to generic templates, that can then be applied to one or more service checks.  You can then edit changes centrally, rather than going round and updating services.  Templates can be daisy chained so that subsequent templates override or add to config (see http://nagios.sourceforge.net/docs/3_0/objectinheritance.html for further info).
=== Check Frequency ===
For services that don't need to be checked as often as every 5 mins, create a new ''service template'' for the check internal, and apply to the appropriate ''services''.
Create a new service template for the check interval in <code>/etc/nagios3/conf.d/generic-service_nagios2.cfg</code>, See example below, which changes the check interval to every 30 mins.  Note that the template uses the normal <code>generic-service</code>, and then overrides the <code>normal_check_interval</code>.
# Service template for low frequency checks
define service {
        name                            low-freq-svc-template
        use                            generic-service
        normal_check_interval          30
        }
Update the service config for any services that you want to have the new check interval.  Change the <code>use</code> config line to use the new template name, for example...
define service{
        use                    low-freq-svc-template
        host_name              wib1.domain.com
        service_description    Wibble Sys
        check_command          check_wib_svc
        }
=== Maintenance Windows / Hours of Service ===
For services that don't need to be checked 24x7, create a new ''time period'', and a new ''service template'' for that time period, and apply to the appropriate ''services''.
Define a new time period in <code>/etc/nagios3/conf.d/timeperiods_nagios2.cfg</code>, example excludes 02:00 - 03:00 hrs Sunday, and 02:30 - 03:00 hrs all other days (to not check/notify on a particular day, simply leave that day out of the config)...
define timeperiod{
        timeperiod_name wibblehours
        alias          Wibble Hours
        sunday          00:00-02:00,03:00-24:00
        monday          00:00-02:30,03:00-24:00
        tuesday        00:00-02:30,03:00-24:00
        wednesday      00:00-02:30,03:00-24:00
        thursday        00:00-02:30,03:00-24:00
        friday          00:00-02:30,03:00-24:00
        saturday        00:00-02:30,03:00-24:00
        }
Then create a new service template for the time-period in <code>/etc/nagios3/conf.d/generic-service_nagios2.cfg</code>, See example below.  Note that the template uses the normal <code>generic-service</code>, and then overrides the <code>check_period</code> and <code>notification_period</code> settings.
# Service template for Wibble system services
define service {
        name                            wibble-svc-template
        use                            generic-service
        check_period                    wibblehours
        notification_period            wibblehours
        }
Finally, update the service config for any services that you want to have the new time period.  Change the <code>use</code> config line to use the new template name, for example...
define service{
        use                    wibble-svc-template
        host_name              wib1.domain.com
        service_description    Wibble Sys
        check_command          check_wib_svc
         }
         }


Line 252: Line 329:


== NRPE ==
== NRPE ==
The Nagios Remote Plugin Executor allows Nagios checks to completed on remote servers in a similar fashion to performing checks on the Nagios server.  Whilst its not always necessary, as many remote checks can be performed by probing remotely accessible services such as SNMP or HTTP, there are times when such checks are not suitable, for example...
The '''Nagios Remote Plugin Executor''' allows Nagios checks to completed on remote servers in a similar fashion to performing checks on the Nagios server.  Whilst its not always necessary, as many remote checks can be performed by probing remotely accessible services (such as SNMP or HTTP), there are times when such checks are not suitable, for example...
* Running checks that aren't easily achievable via SNMP
* Running checks that aren't easily achievable via SNMP
* Checking services such as MySQL that should only be accessible local to the server
* Checking local services such as MySQL that aren't accessible remotely from the server
* Running HTTP checks to test your web servers from more than one location
* Running HTTP checks to test your web servers from more than one location
** EG local to server to ensure the web-server itself is OK, and remotely to check that access is likely to OK for global users
** EG local to server to ensure the web-server itself is OK, and remotely to check that access is likely to OK for global users


The NRPE server that runs on remote monitored machines does require quite a few additional packages to be installed (see below for in-exhaustive list), and if you are concerned you try the alternative approach of getting data back from your remote server via SNMP as described in this example [[#Ubuntu_Software_Updates_Monitor|Ubuntu Software Updates Monitor]].  This can make for a more lightweight solution, but will require you to write your own monitoring scripts to be called by the SNMP daemon. Swings and roundabouts.
The NRPE server that runs on remote monitored machines does require quite a few additional packages to be installed (see below for in-exhaustive list), and if you are concerned you can try the alternative approach of getting data back from your remote server via SNMP as described in this example [[#Ubuntu_Software_Updates_Monitor|Ubuntu Software Updates Monitor]].  This can make for a more lightweight solution, but will require you to write your own monitoring scripts to be called by the SNMP daemon.
 
Additional packages required by NRPE...
* mysql-common
* mysql-common
* radiusclient1
* radiusclient1
Line 265: Line 344:
* snmp
* snmp


=== Setup ===
The procedures below will get NRPE running to monitor disk space, load and MySQL service availability on a remote server.
The procedures below will get NRPE running to monitor disk space, load and MySQL service availability on a remote server.


Line 334: Line 414:
  }
  }


== Web Site Content and Response Time Monitoring ==
The stock <code>[http://nagiosplugins.org/man/check_http check_http]</code> is very good at basic web server checks, but once you host multiple sites, or want to monitor that your site is actually returning good pages it starts to lack.  There are also plenty of plugins available for monitoring websites from the [http://exchange.nagios.org/directory/Plugins/Websites%2C-Forms-and-Transactions Nagios Plugin exchange].  However none seemed to match the following requirements (I may have missed one that did)...
* Page content checking
** This makes it possible to verify that the whole LAMP stack is working as expected.
* Web page response time
** Important as just because your site(s) are delivering good content, if it takes over 3 secs, nobody is going to hang around to look at it
* Ability to monitor user/password protected pages
Therefore I took one that almost did, <code>[http://exchange.nagios.org/directory/Plugins/Websites%2C-Forms-and-Transactions/check_http_content/details check_http_content]</code>, and modified it to match my requirements (which I'll upload to the exchange once I've got it working with the <code>Nagios::Plugin</code> Perl module), and called it <code>[http://dl.sandfordit.com/scripts/check_url_content check_url_content]</code> (for the time being its available via the previous link).
=== Script Options ===
{|class="vwikitable"
|-
! Option !! Purpose !! Default
|-
! -U <url>
| URL to retrieve (http or https)
|-
! -m <text>
| Text to match in the output of the URL
|-
! -w <secs>
| Warning time threshold || 3 secs
|-
! -c <secs>
| Critical time threshold || 10 secs
|-
! -t <secs>
| Timeout in seconds to wait for the URL to load || 30 secs
|-
! -u <user>
| Username (only required if server requires authentication)
|-
! -p <pass>
| Password (required if Username specified)
|-
! -r <realm>
| Realm (required if Username specified), when accessing a protected site an Authentication pop-up will display 'The site says "realm"'.
|-
! -h <host>
| Host (optional when Username specified), should be in the following format 'www.domain.com:443'
|}
=== Examples ===
* '''Basic check example'''
define command{
        command_name    check_http_content
        command_line    /usr/lib/nagios/plugins/check_http_content -U $ARG1$ -m $ARG2$
        }
define service{
        use                    generic-service
        host_name          www.sandfordit.com
        service_description    www - vWiki
        check_command          check_http_content!http://www.sandfordit.com/vwiki!'Monitoring system'
        }
* '''All options example'''
define command{
        command_name    check_url_content_opt
        command_line    /usr/lib/nagios/plugins/check_url_content -U $ARG1$ -m $ARG2$ -r $ARG3$ -u $ARG4$ -p $ARG5$ -w $ARG6$ -c $ARG7$ -t $ARG8$
        }
define service{
        use                    generic-service
        host_name              www.sandfordit.com
        service_description    www - Secure
        check_command          check_url_content!'http://www.sandfordit.com/secure'!'This page works'!Realm!user!password!5!10!60
        }
[[Category:Monitoring]]
[[Category:Nagios]]
[[Category:Nagios]]
[[Category:Ubuntu]]
[[Category:Ubuntu]]
[[Category:SNMP]]
[[Category:SNMP]]
[[Category:QNAP]]
[[Category:QNAP]]
[[Category:Applications]]

Navigation menu