Notes about open source software, computers, other stuff.

Nagios event handlers for services on remote machines

Part of my work consists of managing the servers on which we do our data analysis. At the moment we’ve got two servers and one virtual machine running. The VM is used as a management server, it runs things like Nagios, Cacti, Subversion, etc.

Today I implemented Nagios event handlers in this setup. The idea behind an event handler is the following: If e.g. a service goes down, Nagios should try to solve this problem itself before notifying the administrator (me). It should, in this case, simply try to restart the service.

The Nagios documentation [1] describes how to do this for a service that runs on the same machine as the Nagios service. In my case, however, the services are running on the to real servers. To me it seemed logical to use NRPE to execute the necessary commands on the remote hosts (since NRPE was already running on those machines anyway).
In order to adapt the scheme from the Nagios docs to work on remote servers as well three things need to be done:

  • The command that is executed by the event handler script should be changed to use NRPE
  • On the remote machine the nagios user (under which the NRPE service is running) should be given some sudo rights so that it is actually allowed to start a service.
  • The NRPE configuration on the remote machine should of course be changed to include the new command(s) for starting services.

So here we go! First, the Nagios configuration on the management host. In the service definition file I added one line for the event handler to each service. The definition of one service now looks like this (the last line was added):

define service {
       use                      generic-service
       hostgroup_name           sge-exec-servers
       service_description      SGE execd
       check_command            check_nrpe_1arg!check_sge_execd
       notification_interval    0 ; set > 0 if you want to be renotified
       event_handler            restart-service!sge-execd
}

Next, the restart-service command must be defined. I did that in a file that I called /etc/nagios3/conf.d/event-handlers.cfg:

define command {
       command_name     restart-service
       command_line     /etc/nagios3/conf.d/event_handler_script.sh $SERVICESTATE$ $SERVICESTATETYPE $ $SERVICEATTEMPT$ $HOSTADDRESS$ $ARG1$ $SERVICEDESC$
}

The variable $ARG1$ here is the name of the service that needs to be restarted. In this example it is sge-execd from the event_handler line in the service definition. The $HOSTADDRESS will be used in the event handler script to send the right host name to NRPE.
The event_handler_script.sh referenced here is almost identical to the one in the Nagios documentation. As mentioned in the plan above, I changed it slightly so that it uses NRPE.

#!/bin/sh                                                                                            
#
# Event handler script for restarting the nrpe server on the local machine
# Taken from the Nagios documentation and
# http://www.techadre.com/sites/techadre.com/files/event_handler_script_0.txt
# Adapted by L.C. Karssen
# Time-stamp: <2010-09-14 15:24:33 (root)>
#
# Note: This script will only restart the nrpe server if the service is
#       retried 3 times (in a "soft" state) or if the web service somehow
#       manages to fall into a "hard" error state.
#
 
date=`date`
 
# What state is the NRPE service in?
case "$1" in
OK)
        # The service just came back up, so don't do anything...
        ;;
WARNING)
        # We don't really care about warning states, since the service is probably still running...
        ;;
UNKNOWN)
        # We don't know what might be causing an unknown error, so don't do anything...
        ;;
CRITICAL)
        # Aha!  The BLAH service appears to have a problem - perhaps we should restart the server...
 
        # Is this a "soft" or a "hard" state?
        case "$2" in
 
        # We're in a "soft" state, meaning that Nagios is in the middle of retrying the
        # check before it turns into a "hard" state and contacts get notified...
        SOFT)
                # What check attempt are we on?  We don't want to restart the web server on the firs\
t
                # check, because it may just be a fluke!
                case "$3" in
 
                # Wait until the check has been tried 3 times before restarting the web server.
                # If the check fails on the 4th time (after we restart the web server), the state
                # type will turn to "hard" and contacts will be notified of the problem.
                # Hopefully this will restart the web server successfully, so the 4th check will
                # result in a "soft" recovery.  If that happens no one gets notified because we
                # fixed the problem!
                3)
                        echo -n "Restarting service $6 (3rd soft critical state)...\n"
                        # Call NRPE to restart the service on the remote machine
                        /usr/lib/nagios/plugins/check_nrpe -H $4 -c restart-$5
                        echo "$date - restart $6 - SOFT"  >> /tmp/eventhandlers
                        ;;
                        esac
                ;;
 
        # The service somehow managed to turn into a hard error without getting fixed.
        # It should have been restarted by the code above, but for some reason it didn't.
        # Let's give it one last try, shall we?
        # Note: Contacts have already been notified of a problem with the service at this
        # point (unless you disabled notifications for this service)
        HARD)
                case "$3" in
 
                4)
                        echo -n "Restarting $6 service...\n"
                        # Call the init script to restart the NRPE server
                        echo "$date - restart $6 - HARD"  >> /tmp/eventhandlers
                        /usr/lib/nagios/plugins/check_nrpe -H $4 -c restart-$5
                        ;;
                        esac
                ;;
        esac
        ;;
esac
exit 0

Now Nagios can be restarted and should continue its work as usual. Time to make the changes on the remote hosts.

First, we’ll grant the necessary sudo rights to the nagios user. Run visudo and add these lines:

## Allow NRPE to restart sevices
User_Alias NAGIOS = nagios,nagcmd
Cmnd_Alias NAGIOSCOMMANDS = /usr/sbin/service
Defaults:NAGIOS !requiretty
NAGIOS    ALL=(ALL)    NOPASSWD: NAGIOSCOMMANDS

And finally add the required lines in the NRPE config file (/etc/nagios/nrep.cfg):

command[restart-sge-execd]=/usr/bin/sudo /usr/sbin/service gridengine-exec start

Restart the NRPE daemon and it should all work. Test it by manually stopping the service.

[1] Nagios documentation on Event Handlers
[2] Two blog posts that describe a similar set up. I used these as a starting point for my own set up.

Related Images:

11 Comments

  1. Hector

    Nice guide. But i have a doubt. Creating an user and giving it sudo rights without password, is not a huge security leak?

    Cheers.

    • Andrew

      Only the commands required to allow the user to should be allowed without a password (although I would be far more prescriptive than /usr/sbin/service).

  2. Jason Robinson

    Awesome, thank you very much, got a mysql restart implemented with this 🙂

  3. Daniel

    Thanks. This will will give me uninterrupted sleep. 🙂

  4. John

    thanks for the post, very usefull

  5. Anthony

    Thank you for these explanations.
    It works like a charm !

  6. Hasan

    check_command check_nrpe_1arg!check_sge_execd
    Confused with the above argument- particularly check_nrpe_1… what tdoes that precisely mean?

    • John

      check_nrpe and check_nrpe_1arg are normally defined in your /etc/nagios/objects/commands.cfg file. You should have a couple of definitions in there similar to these:


      define command {
      command_name check_nrpe
      command_line /usr/lib64/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$
      }

      # this command runs a program $ARG1$ with no arguments
      define command {
      command_name check_nrpe_1arg
      command_line /usr/lib64/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
      }

      So with these definitions, check_nrpe accepts a command and additional arguments. But check_nrpe_1arg will only run a command. No additional arguments.

  7. Ashish Karpe

    I want to write event handler for /mnt Free Space ! In this alert it should delete unnecessary log “find /mnt/log/frengo/openx_custom_log/ -type f -mmin +1000 -name “*” | perl -nle ‘unlink;’ ;
    find /mnt/log/nginx/ -type f -mmin +500 -name “*” | perl -nle ‘unlink;'”

    Will the procedure be same ?

    • Ashish Karpe

      That also not on all server only on web servers
      ps I have created separate hosts groups for webservers n db servers so it should not run on db servers !

    • LCK

      Yes, the procedure should be the same.

Leave a Reply to Ashish Karpe Cancel reply

Your email address will not be published. Required fields are marked *

© 2024 Lennart's weblog

Theme by Anders NorénUp ↑