Nagios
Introduction
Nagios is an open source monitoring tool. Its standard (Core) version is free for download and use with no real limitations, its premium (XI) version offers additional features, most notably a GUI interface with which to configure it. Configuring Nagios can be challenging at first, requiring edits to multiple config files to get new monitoring working, but once you're got the logic and the pattern understood it becomes quite flexible.
Nagios is centred around device polling (it can receive SNMP traps, but its a more advanced feature), and the presentation of state data. Though the first thing to appreciate is that Nagios doesn't actually do any monitoring, at its core it's a task scheduling and state management engine. It needs third party plugins, which do the actual monitoring a report back the state of the host you're monitoring to it. There are plugins provided out-of-the-box, which will probably achieve most (if not all) of what you want.
Terminology
- host - A host is any network device that you want to monitor, be it a server, router, switch, SAN; anything that has an IP address.
- hostgroup - This is a collection of similar devices that you want to apply similar monitoring too, a host can be in more than one hostgroup.
- plugin - A plugin is a monitoring module, built to monitor/interface with a specific device, application etc,. It hides Nagios from the specific's of whatever its interfacing with.
- command - A command is command line call of a plugin with one or more parameters, which defines how you might use a plugin to test a host.
- service - A service is something that you care about on a host, that you want to test (eg web server response, ping, disk space, CPU,
Useful Paths etc
Path | Description |
---|---|
/etc/nagios3/conf.d |
Config files - anything in here is parsed as config, filenames are for your convenience and are irrelevant to Nagios |
/etc/nagios-plugins/config |
Plugin commands |
/usr/lib/nagios/plugins |
Plugin executables |
nagios3 -v /etc/nagios3/nagios.cfg |
Config check - do before a restart to check a new config makes sense |
service nagios3 restart |
Restart service (reloads config - will fail if config is invalid!) |
Create SNMP Checks
Everything here creates various checks for my QNAP NAS, which I've used as an example.
Define OID's to Poll
Before you start you need to know what SNMP OID's you want to poll, and what their values should be. For common devices and metrics you can often get by with a Google search or two, but it doesn't take much for you to need to get a bit more involved.
When it comes to investigating what OID's you can poll for a specific device, your friend is GetIf.
Having downloaded the MIB and done some probing GetIf, I've decided I need to monitor the following OID's...
OID | Description | Example Return Data |
---|---|---|
.1.3.6.1.4.1.24681.1.2.6.0 |
System Temperature | 41 C/105 F
|
.1.3.6.1.4.1.24681.1.2.17.1.6.1 |
System Volume 1 Status | Ready
|
.1.3.6.1.4.1.24681.1.2.17.1.5.1 |
System Volume 1 Space | 1.74 TB
|
.1.3.6.1.4.1.24681.1.2.11.1.4.1 |
Physical Disk 1 Status | ready
|
.1.3.6.1.4.1.24681.1.2.11.1.7.1 |
Physical Disk 1 SMART Status | GOOD
|
.1.3.6.1.4.1.24681.1.2.11.1.3.1 |
Physical Disk 1 Temperature | 35 C/95 F
|
Create Commands
Each type of check needs a command defined for it, which where the SNMP OID that will be checked is defined. Commands are are not specific to a particular host, so could be run against any system for which the check would be valid. There is some flexibility in that if you've certain checks that will be similar (eg checks for status of disk 1, disk 2 etc) then you can add arguments to the checks that can be defined later on.
I created a new file, called /etc/nagios3/conf.d/commands_qnap.cfg
and added the following...
System Temperature
define command{ command_name check_qnap_sys_temp command_line /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.6.0 -w 45 -c 55 -l Temp -u C }
-H '$HOSTADDRESS$'
- This is a standard wildcard for all check commands, Nagios substitutes the device's IP address-o .1.3.6.1.4.1.24681.1.2.6.0
- The SNMP OID being checked.iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemTemperature.0
-w 45
- The warning threshold-c 55
- The critical threshold-l Temp
- A label for the check (appears in the check's Status Information column in Nagios display)-u C
- The units of the metric being checked (appears in the check's Status Information column in Nagios display)
Volume Status
define command{ command_name check_qnap_sysvol_status command_line /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.17.1.6.$ARG1$ -l "Volume Status" -r "Ready" }
-o .1.3.6.1.4.1.24681.1.2.17.1.6.$ARG1$
- The SNMP OID being checked, $ARG1$ is used as a wildcard so that if I had more than one volume I could repeat the check for volume 1, 2 etc without creating a separate check command for each..iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemVolumeTable.SysVolumeEntry.SysVolumeStatus.$ARG1$
-r "Ready"
- The text expected back from the poll, anything else causes a critical error
Volume Space
define command{ command_name check_qnap_sysvol_space command_line /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.17.1.5.$ARG1$ -w $ARG2$: -c $ARG3$: -l "Volume Space" -u GB }
-o .1.3.6.1.4.1.24681.1.2.17.1.5.$ARG1$
- The SNMP OID being checked, as above $ARG1$ is used as a command parameter so that if I had more than one volume I could repeat the check for volume 1, 2 etc without creating a separate check command for each.-w $ARG2$:
- The warning threshold, defining it as a command parameter allows me to alter the service threshold without altering the command definition. The trailing:
makes it a should be more than check rather than the normal should be less than check.-c $ARG2$:
- The critical threshold, defining it as a command parameter allows me to alter the service threshold without altering the command definition. The trailing:
makes it a should be more than check rather than the normal should be less than check..iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemVolumeTable.SysVolumeEntry.SysVolumeFreeSize.$ARG1$
Disk Status
define command{ command_name check_qnap_disk_status command_line /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.11.1.4.$ARG1$ -m /etc/nagios3/mibs/QNAP-NAS.mib -l "Disk Status" -r 0 }
-o .1.3.6.1.4.1.24681.1.2.11.1.7.$ARG1$
- The SNMP OID being checked, similar to above $ARG1$ is used as a command parameter so that I can create separate checks for the individual disks without creating a separate check command for each..iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemHdTable.HdEntry.HdStatus.$ARG1$
-m /etc/nagios3/mibs/QNAP-NAS.mib
- Path to the QNAP MIB file. The value returned is an integer, 0 for ready/good, a negative value for a fault. In order to translate the value (eg-9
) to its actual meaning (egrwError
), Nagios needs access to the MIB file. You will need to download it from your NAS (from the Network Services | SNMP Settings page), and copy it to path indicated on your Nagios server.-r 0
- The data expected back from the poll, 0 maps toready
anything else causes a critical error
Disk SMART Status
define command{ command_name check_qnap_disk_smart_status command_line /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.11.1.7.$ARG1$ -l "SMART Info State" -r "GOOD" }
-o .1.3.6.1.4.1.24681.1.2.11.1.7.$ARG1$
- The SNMP OID being checked, similar to above $ARG1$ is used as a command parameter so that I can create separate checks for the individual disks without creating a separate check command for each..iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemHdTable.HdEntry.HdSmartInfo.$ARG1$
-r "GOOD"
- The text expected back from the poll, anything else causes a critical error
Disk Temperature
define command{ command_name check_qnap_disk_temp command_line /usr/lib/nagios/plugins/check_snmp -H '$HOSTADDRESS$' -o .1.3.6.1.4.1.24681.1.2.11.1.3.$ARG1$ -w 45 -c 55 -l Temp -u C }
-o .1.3.6.1.4.1.24681.1.2.11.1.3.$ARG1$
- The SNMP OID being checked, as above $ARG1$ is used as a command parameter so that I can create separate checks for the individual disks without creating a separate check command for each..iso.org.dod.internet.private.enterprises.storage.storageSystem.SystemInfo.SystemHdTable.HdEntry.HdTemperature.$ARG1$
Create Services
Services are used to define a generic check command within the bounds of a specific service. So for example, you could define two separate disk space checks, using the same command definition, but with different alerting thresholds depending on your requirements.
Services need to be defined with...
hostgroup_name
- The hostgroup defines which servers will have the service checks applied to it. For a host to be checked for the service it needs to be a member of the hostgroup, see Create Hostgroup for further info.service_description
- A name for the service check, this is what is displayed in the Service field on the Nagios displaycheck_command
- The command (and its parameters, if any) to perform the check.
I created a new file, called /etc/nagios3/conf.d/services_qnap.cfg
, in which to add service definitions, examples of which are below...
define service{ use generic-service hostgroup_name qnap-nas service_description Temp Sys check_command check_qnap_sys_temp }
define service{ use generic-service hostgroup_name qnap-nas service_description Status SysVol 1 check_command check_qnap_sysvol_status!1 }
- Note the
!1
at the end of the command in order to pass a parameter of 1 (ie 1st volume) to the command
define service{ use generic-service hostgroup_name qnap-nas service_description Space SysVol 1 check_command check_qnap_sysvol_space!1!.5!.25 }
- Note the
!1!.5!.25
at the end of the command in order to pass parameters for volume 1, warning threshold of .5TB, and critical threshold of .25TB to the command
define service{ use generic-service hostgroup_name qnap-nas service_description Status Disk 1 check_command check_qnap_disk_status!1 }
define service{ use generic-service hostgroup_name qnap-nas service_description SMART Disk 1 check_command check_qnap_disk_smart_status!1 }
define service{ use generic-service hostgroup_name qnap-nas service_description Temp Disk 1 check_command check_qnap_disk_temp!1 }
Create Hostgroup
The hostgroup definition allows you to group one or more hosts together, in order to have service checks run against them. So in the above I created services that would apply to hosts in the qnap-nas
hostgroup. I can then add my NAS server to this hostgroup in order for it to be monitored (hostgroup definitions are normally found in /etc/nagios3/conf.d/hostgroups_nagios2.cfg
define hostgroup { hostgroup_name qnap-nas alias QNAP NAS members nas }
If I wanted to monitor more than one NAS I could just add further members (comma separated, no spaces). Note that any hosts specified in a hostgroup must themselves have a host definition (normally found in /etc/nagios3/conf.d/hosts.cfg
, for example...
define host{ use generic-host host_name nas alias NAS address 192.168.1.200 }
Check Tuning
It's unlikely that you really want everything checked every 5 mins, 24 hours a day. Some services might get a bit flaky in the middle of the night when there's maintenance tasks running, or don't warrant being checked so frequently.
In general its better to make such changes to generic templates, that can then be applied to one or more service checks. You can then edit changes centrally, rather than going round and updating services. Templates can be daisy chained so that subsequent templates override or add to config (see http://nagios.sourceforge.net/docs/3_0/objectinheritance.html for further info).
Check Frequency
For services that don't need to be checked as often as every 5 mins, create a new service template for the check internal, and apply to the appropriate services.
Create a new service template for the check interval in /etc/nagios3/conf.d/generic-service_nagios2.cfg
, See example below, which changes the check interval to every 30 mins. Note that the template uses the normal generic-service
, and then overrides the normal_check_interval
.
# Service template for low frequency checks define service { name low-freq-svc-template use generic-service normal_check_interval 30 }
Update the service config for any services that you want to have the new check interval. Change the use
config line to use the new template name, for example...
define service{ use low-freq-svc-template host_name wib1.domain.com service_description Wibble Sys check_command check_wib_svc }
Maintenance Windows / Hours of Service
For services that don't need to be checked 24x7, create a new time period, and a new service template for that time period, and apply to the appropriate services.
Define a new time period in /etc/nagios3/conf.d/timeperiods_nagios2.cfg
, example excludes 02:00 - 03:00 hrs Sunday, and 02:30 - 03:00 hrs all other days (to not check/notify on a particular day, simply leave that day out of the config)...
define timeperiod{ timeperiod_name wibblehours alias Wibble Hours sunday 00:00-02:00,03:00-24:00 monday 00:00-02:30,03:00-24:00 tuesday 00:00-02:30,03:00-24:00 wednesday 00:00-02:30,03:00-24:00 thursday 00:00-02:30,03:00-24:00 friday 00:00-02:30,03:00-24:00 saturday 00:00-02:30,03:00-24:00 }
Then create a new service template for the time-period in /etc/nagios3/conf.d/generic-service_nagios2.cfg
, See example below. Note that the template uses the normal generic-service
, and then overrides the check_period
and notification_period
settings.
# Service template for Wibble system services define service { name wibble-svc-template use generic-service check_period wibblehours notification_period wibblehours }
Finally, update the service config for any services that you want to have the new time period. Change the use
config line to use the new template name, for example...
define service{ use wibble-svc-template host_name wib1.domain.com service_description Wibble Sys check_command check_wib_svc }
Ubuntu Software Updates Monitor
I've spend a fair amount of time faffing around to find a method of checking my Ubuntu servers for updates. The inbuilt check_apt
doesn't give me the results as I see when logging in to a server, and other methods I've found don't seem to work at all.
Given that I only really care about my servers that run Ubuntu 10 LTS, I decided to knock up a quick script that makes use of the same mechanism that's used to generate the MotD that you see when you login to the console. As its a simple locally run script, you also need to have NRPE running as well.
- Set-up the check script on all servers to be monitored
- In
/usr/lib/nagios/plugins
download the script - Make the file executable
chmod +x check_apt_upgrade
- Update the
/etc/nagios/nrpe.cfg
to include the check- Add
command[check_apt_upgrade]=/usr/lib/nagios/plugins/check_apt_upgrade
- Add
- Restart the NRPE server service
service nagios-nrpe-server restart
- In
- Update the Nagios server to poll the check
- Add the section below to the appropriate service config file
- Check your Nagios config is valid
nagios3 -v /etc/nagios3/nagios.cfg
- Restart Nagios
service nagios3 restart
define service { hostgroup_name nrpe-std service_description Updates check_command check_nrpe_port!check_apt_upgrade use generic-service notification_interval 0 ; set > 0 if you want to be renotified }
SNMP Based (Michal Ludvig)
The check script that is called by SNMP doesn't work! I've left this here for the time being as the remote SNMP exec mechanism does work, and I expect to use it at some point. When I do, I'll remove this, and document that instead.
This check uses some scripts developed by Michal Ludvig, I've downloaded the scripts to my site, but the originals, complete with his supporting notes can be found here - http://www.logix.cz/michal/devel/nagios. Though note that I've updated the check_snmp_extend.sh
script (didn't work for me, suspect Nagios file locations have changed since script was originally written), all kudos should still go to Michal.
To summarise how it works...
- Nagios uses a local script to SNMP query a remote server you want to check
- The SNMP query triggers another script to be run on the remote server which queries whether there are any updates to install
- The result is returned via SNMP to the calling script on the Nagios server, which in turn passes the data to Nagios
To set it up...
- On your Nagios server...
- Download
check_snmp_exec.sh
to/usr/lib/nagios/plugins
- EG when in folder
/usr/lib/nagios/plugins
dowget http://dl.sandfordit.com/scripts/check_snmp_extend.sh
- EG when in folder
- Make the the script executable
- EG
chmod +x check_snmp_extend.sh
- EG
- Define a command for the check in
/etc/nagios3/conf.d/commands.cfg
(see below - Nagios command)
- Download
- On your monitored servers (do one 1st to test)...
- Download
check-apt-upgrade.pl
to/usr/local/bin/
- EG when in folder
/usr/local/bin/
dowget http://dl.sandfordit.com/scripts/check-apt-upgrade.pl
- EG when in folder
- Make the the script executable
- EG
chmod +x check-apt-upgrade.pl
- EG
- Make the server's SNMP daemon aware of it, edit
/etc/snmp/snmpd.conf
, add the followingextend sw-updates /usr/local/bin/check-apt-upgrade.pl --run
- Restart the SNMP daemon
service snmpd restart
- Download
- Back on the Nagios server...
- Define a service for the check in
/etc/nagios3/conf.d/services_nagios2.cfg
(see below - Nagios service) - Check your Nagios config is valid
nagios3 -v /etc/nagios3/nagios.cfg
- Restart Nagios
service nagios3 restart
- Define a service for the check in
- nagios command...
define command{ command_name check_snmp_extend command_line /usr/lib/nagios/plugins/check_snmp_extend.sh $HOSTADDRESS$ $ARG1$ }
- nagios service...
- SNMP check for Ubuntu server package updates
define service { hostgroup_name ubuntu-servers service_description Updates SNMP check_command check_snmp_extend!sw-updates use generic-service notification_interval 0 ; set > 0 if you want to be renotified }
NRPE
The Nagios Remote Plugin Executor allows Nagios checks to completed on remote servers in a similar fashion to performing checks on the Nagios server. Whilst its not always necessary, as many remote checks can be performed by probing remotely accessible services such as SNMP or HTTP, there are times when such checks are not suitable, for example...
- Running checks that aren't easily achievable via SNMP
- Checking services such as MySQL that should only be accessible local to the server
- Running HTTP checks to test your web servers from more than one location
- EG local to server to ensure the web-server itself is OK, and remotely to check that access is likely to OK for global users
The NRPE server that runs on remote monitored machines does require quite a few additional packages to be installed (see below for in-exhaustive list), and if you are concerned you try the alternative approach of getting data back from your remote server via SNMP as described in this example Ubuntu Software Updates Monitor. This can make for a more lightweight solution, but will require you to write your own monitoring scripts to be called by the SNMP daemon. Swings and roundabouts.
- mysql-common
- radiusclient1
- samba-common
- smbclient
- snmp
The procedures below will get NRPE running to monitor disk space, load and MySQL service availability on a remote server.
- Install the NRPE Plugin on the main Nagios server
apt-get install nagios-nrpe-plugin
- Install the NRPE Server on the remote/monitored server
apt-get install nagios-nrpe-server
- On the remote/monitored server update the config
/etc/nagios/nrpe.cfg
for- Nagios communications...
- EG
server_port=5700
change the port if your monitored server is on the internet - EG
allowed_hosts=192.168.1.25
change to the address of your Nagios server
- EG
- Checks (some may already exist in config)...
- Load:
command[check_load]=/usr/lib/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
- Disk space:
command[check_disks]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /dev/mapper/svr-root
- The disk path must be valid, do a
df -h
and update/dev/mapper/svr-root
as required
- The disk path must be valid, do a
- MySQL:
command[check_mysql]=/usr/lib/nagios/plugins/check_mysql -H 127.0.0.1 -u nagios -p poller
- Assumes you have added the
nagios
user to MySQL, EGmysql -u root -p -e "create user nagios identified by 'poller';"
- Assumes you have added the
- Load:
- Nagios communications...
- Restart service on remote/monitored server to apply config
service nagios-nrpe-server restart
- If necessary (if you're using the server's firewall), open the NRPE port on the remote/monitored server
ufw allow proto tcp from 192.168.1.25 to any port 5700
- On the Nagios server create a hostgoup for the checks (see Nagios Hostgroup below)
- Edit
hostgroups_nagios2.cfg
file
- Edit
- On the Nagios server create a custom NRPE command (see Nagios Command below)
- Edit
commands.cfg
file
- Edit
- On the Nagios server create a NRPE service file (see NRPE Services below)
- Edit
services_nrpe.cfg
file
- Edit
- On the Nagios server, validate the config, and assuming all OK, restart service to apply
nagios3 -v /etc/nagios3/nagios.cfg
service nagios3 restart
- Nagios Hostgroup (
hostgroups_nagios2.cfg
)
define hostgroup { hostgroup_name nrpe-std alias NRPE Standard servers members wiki }
- Nagios Command (
commands.cfg
)
define command { command_name check_nrpe_port command_line /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -p 5700 -c $ARG1$ }
- NRPE Services (
services_nrpe.cfg
)
# NRPE Standard checks define service { hostgroup_name nrpe-std service_description Load check_command check_nrpe_port!check_load use generic-service notification_interval 0 ; set > 0 if you want to be renotified } define service { hostgroup_name nrpe-std service_description Disk Space check_command check_nrpe_port!check_disks use generic-service notification_interval 0 ; set > 0 if you want to be renotified } define service { hostgroup_name nrpe-std service_description MySQL check_command check_nrpe_port!check_mysql use generic-service notification_interval 0 ; set > 0 if you want to be renotified }
Web Site Content and Response Time Monitoring
The stock check_http
is very good at basic web server checks, but once you host multiple sites, or want to monitor that your site is actually returning good pages it starts to lack. There are also plenty of plugins available for monitoring websites from the Nagios Plugin exchange. However none seemed to match the following requirements (I may have missed one that did)...
- Page content checking
- This makes it possible to verify that the whole LAMP stack is working as expected.
- Web page response time
- Important as just because your site(s) are delivering good content, if it takes over 3 secs, nobody is going to hang around to look at it
- Ability to monitor user/password protected pages
Therefore I took one that almost did, check_http_content
, and modified it to match my requirements (which I'll upload to the exchange once I've got it working with the Nagios::Plugin
Perl module), and called it check_url_content
(for the time being its available via the previous link).
Script Options
Option | Purpose | Default |
---|---|---|
-U <url> | URL to retrieve (http or https) | |
-m <text> | Text to match in the output of the URL | |
-w <secs> | Warning time threshold | 3 secs |
-c <secs> | Critical time threshold | 10 secs |
-t <secs> | Timeout in seconds to wait for the URL to load | 30 secs |
-u <user> | Username (only required if server requires authentication) | |
-p <pass> | Password (required if Username specified) | |
-r <realm> | Realm (required if Username specified), when accessing a protected site an Authentication pop-up will display 'The site says "realm"'. | |
-h <host> | Host (optional when Username specified), should be in the following format 'www.domain.com:443' |
Examples
- Basic check example
define command{ command_name check_http_content command_line /usr/lib/nagios/plugins/check_http_content -U $ARG1$ -m $ARG2$ }
define service{ use generic-service host_name www.sandfordit.com service_description www - vWiki check_command check_http_content!http://www.sandfordit.com/vwiki!'Monitoring system' }
- All options example
define command{ command_name check_url_content_opt command_line /usr/lib/nagios/plugins/check_url_content -U $ARG1$ -m $ARG2$ -r $ARG3$ -u $ARG4$ -p $ARG5$ -w $ARG6$ -c $ARG7$ -t $ARG8$ }
define service{ use generic-service host_name www.sandfordit.com service_description www - Secure check_command check_url_content!'http://www.sandfordit.com/secure'!'This page works'!Realm!user!password!5!10!60 }