Part of my work consists of managing the servers on which we do our data analysis. At the moment we’ve got two servers and one virtual machine running. The VM is used as a management server, it runs things like Nagios, Cacti, Subversion, etc.
Today I implemented Nagios event handlers in this setup. The idea behind an event handler is the following: If e.g. a service goes down, Nagios should try to solve this problem itself before notifying the administrator (me). It should, in this case, simply try to restart the service.
The Nagios documentation [1] describes how to do this for a service that runs on the same machine as the Nagios service. In my case, however, the services are running on the to real servers. To me it seemed logical to use NRPE to execute the necessary commands on the remote hosts (since NRPE was already running on those machines anyway).
In order to adapt the scheme from the Nagios docs to work on remote servers as well three things need to be done:
- The command that is executed by the event handler script should be changed to use NRPE
- On the remote machine the
nagios
user (under which the NRPE service is running) should be given some sudo rights so that it is actually allowed to start a service. - The NRPE configuration on the remote machine should of course be changed to include the new command(s) for starting services.
So here we go! First, the Nagios configuration on the management host. In the service definition file I added one line for the event handler to each service. The definition of one service now looks like this (the last line was added):
define service { use generic-service hostgroup_name sge-exec-servers service_description SGE execd check_command check_nrpe_1arg!check_sge_execd notification_interval 0 ; set > 0 if you want to be renotified event_handler restart-service!sge-execd } |
Next, the restart-service
command must be defined. I did that in a file that I called /etc/nagios3/conf.d/event-handlers.cfg
:
define command { command_name restart-service command_line /etc/nagios3/conf.d/event_handler_script.sh $SERVICESTATE$ $SERVICESTATETYPE $ $SERVICEATTEMPT$ $HOSTADDRESS$ $ARG1$ $SERVICEDESC$ } |
The variable $ARG1$
here is the name of the service that needs to be restarted. In this example it is sge-execd
from the event_handler
line in the service definition. The $HOSTADDRESS
will be used in the event handler script to send the right host name to NRPE.
The event_handler_script.sh
referenced here is almost identical to the one in the Nagios documentation. As mentioned in the plan above, I changed it slightly so that it uses NRPE.
#!/bin/sh # # Event handler script for restarting the nrpe server on the local machine # Taken from the Nagios documentation and # http://www.techadre.com/sites/techadre.com/files/event_handler_script_0.txt # Adapted by L.C. Karssen # Time-stamp: <2010-09-14 15:24:33 (root)> # # Note: This script will only restart the nrpe server if the service is # retried 3 times (in a "soft" state) or if the web service somehow # manages to fall into a "hard" error state. # date=`date` # What state is the NRPE service in? case "$1" in OK) # The service just came back up, so don't do anything... ;; WARNING) # We don't really care about warning states, since the service is probably still running... ;; UNKNOWN) # We don't know what might be causing an unknown error, so don't do anything... ;; CRITICAL) # Aha! The BLAH service appears to have a problem - perhaps we should restart the server... # Is this a "soft" or a "hard" state? case "$2" in # We're in a "soft" state, meaning that Nagios is in the middle of retrying the # check before it turns into a "hard" state and contacts get notified... SOFT) # What check attempt are we on? We don't want to restart the web server on the firs\ t # check, because it may just be a fluke! case "$3" in # Wait until the check has been tried 3 times before restarting the web server. # If the check fails on the 4th time (after we restart the web server), the state # type will turn to "hard" and contacts will be notified of the problem. # Hopefully this will restart the web server successfully, so the 4th check will # result in a "soft" recovery. If that happens no one gets notified because we # fixed the problem! 3) echo -n "Restarting service $6 (3rd soft critical state)...\n" # Call NRPE to restart the service on the remote machine /usr/lib/nagios/plugins/check_nrpe -H $4 -c restart-$5 echo "$date - restart $6 - SOFT" >> /tmp/eventhandlers ;; esac ;; # The service somehow managed to turn into a hard error without getting fixed. # It should have been restarted by the code above, but for some reason it didn't. # Let's give it one last try, shall we? # Note: Contacts have already been notified of a problem with the service at this # point (unless you disabled notifications for this service) HARD) case "$3" in 4) echo -n "Restarting $6 service...\n" # Call the init script to restart the NRPE server echo "$date - restart $6 - HARD" >> /tmp/eventhandlers /usr/lib/nagios/plugins/check_nrpe -H $4 -c restart-$5 ;; esac ;; esac ;; esac exit 0 |
Now Nagios can be restarted and should continue its work as usual. Time to make the changes on the remote hosts.
First, we’ll grant the necessary sudo rights to the nagios
user. Run visudo
and add these lines:
## Allow NRPE to restart sevices User_Alias NAGIOS = nagios,nagcmd Cmnd_Alias NAGIOSCOMMANDS = /usr/sbin/service Defaults:NAGIOS !requiretty NAGIOS ALL=(ALL) NOPASSWD: NAGIOSCOMMANDS |
And finally add the required lines in the NRPE config file (/etc/nagios/nrep.cfg
):
command[restart-sge-execd]=/usr/bin/sudo /usr/sbin/service gridengine-exec start |
Restart the NRPE daemon and it should all work. Test it by manually stopping the service.
[1] Nagios documentation on Event Handlers
[2] Two blog posts that describe a similar set up. I used these as a starting point for my own set up.
Nice guide. But i have a doubt. Creating an user and giving it sudo rights without password, is not a huge security leak?
Cheers.
Only the commands required to allow the user to should be allowed without a password (although I would be far more prescriptive than /usr/sbin/service).
Awesome, thank you very much, got a mysql restart implemented with this 🙂
Thanks. This will will give me uninterrupted sleep. 🙂
thanks for the post, very usefull
Thank you for these explanations.
It works like a charm !
check_command check_nrpe_1arg!check_sge_execd
Confused with the above argument- particularly check_nrpe_1… what tdoes that precisely mean?
check_nrpe and check_nrpe_1arg are normally defined in your /etc/nagios/objects/commands.cfg file. You should have a couple of definitions in there similar to these:
define command {
command_name check_nrpe
command_line /usr/lib64/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$
}
# this command runs a program $ARG1$ with no arguments
define command {
command_name check_nrpe_1arg
command_line /usr/lib64/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}
So with these definitions, check_nrpe accepts a command and additional arguments. But check_nrpe_1arg will only run a command. No additional arguments.
I want to write event handler for /mnt Free Space ! In this alert it should delete unnecessary log “find /mnt/log/frengo/openx_custom_log/ -type f -mmin +1000 -name “*” | perl -nle ‘unlink;’ ;
find /mnt/log/nginx/ -type f -mmin +500 -name “*” | perl -nle ‘unlink;'”
Will the procedure be same ?
That also not on all server only on web servers
ps I have created separate hosts groups for webservers n db servers so it should not run on db servers !
Yes, the procedure should be the same.