• Home   /  
  • Archive by category "1"

Linux Watchdog Process Example Essay

Linux Watchdog Daemon - Testing

Back to PSC's home page
Back to Watchdog


Testing of the watchdog before you go live (i.e. have it configured from boot time) is essential, otherwise you risk having a machine in an unusable situation of booting, triggering the watchdog, rebooting, etc.

There are a couple of step/stages in the testing the watchdog to consider:
  • Check the watchdog runs with no test options and successfully opens and refreshes the watchdog device (i.e. like wd_keepalive).
  • Check that each option you enable configuration file works as expected. Do so one at a time.
  • Check that each test/repair script work as expected when you run them from the command line.
  • Check that each test/repair script work when the watchdog runs them (probably a different working directory & PATH).
Some of this is part of the normal installation, some of it customisation.
[top of page]

Precautions

 Before you start installing and testing the watchdog you have to consider the potential consequences. These include:
  • The machine gets in to an endless reboot loop.
  • The file system(s) get corrupted by a hard reset.
  • The machine runs, but is trigger-happy and reboots when not expected.

Reboot Loop

 In the first case, if you get in to such a state you may need to boot in to safe mode, or use a "live CD" (or USB stick) to boot up and edit the machine's settings to disable the watchdog until you figure out what went wrong. To make this easier, you may want to have the grub boot loader show you the options before booting the normal system.

To do this on a typical Ubuntu 12.04 machine modify the /etc/default/grub file to change the following:

#GRUB_HIDDEN_TIMEOUT=0
#GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=5

       
Finally running 'update-grub' (as root or with sudo) to apply those. On reboot you should now get a 5 second countdown and a choice of kernel & safe mode, or memtest shown.
[top of page]

File System Corruption


The second risk, that of an unexpected reset not unmounting the file systems cleanly and causing corruption, is quite small with modern journalling file systems (e.g. ext3/4) but should not be ignored. In particular if you need to use something pretty fragile for one or more mounted file system (like FAT32 for compatibility reasons).

Hence there are a few precautions you should consider:

  • Use a test computer, and not a live or important computer, for practising your installation, configuration and test.
  • You can test most things, except for hardware drivers of course, in a virtual machine (VMware, xen, etc) with little risk, as often they allow snapshots of the file system to provide a pain-free way or rolling back changes.
  • Have an up-to-date backup if you must use something important for testing. You do have a backup?
  • If fragile file systems are in use, can you unmount them first for testing? If only certain tests need those file systems, then try to run as much of the testing as possible without them mounted and test those last.
  • Try to use the sync command just before you run any test that might provoke an event.
  • Initially try testing at times of low disk activity, less I/O means less risk of significant trouble.
  • Make sure no other users are logged-on and working on the test computer, they will not appreciate such a rude interruption!
  • Consider editing /etc/default/rcS to enable automatic fixing of file system problems at boot time.
However, a having some spurious reboot (kernel panic, hardware fault, power outage, etc) is always a possibility, and as far as possible you should configure the operating system and software in such a way that file systems are checked & repaired automatically, and that data & processes have integrity tests and locks to allow a clean re-start/roll-back from any critical phases of operations.

And test them! You should be able to reboot at any time and recover a functional system, but only with testing will you find out if this really is the case.
[top of page]

Trigger Happy

 It can be difficult to configure some of the watchdog's tests in such a manner that they will rescue a hung computer, but are not triggered by normal activity. In particular, the load averages and the memory limits need quite a lot of insight to the machines operational behaviour. Of course, such behaviour can also change without warning if the users alter what they do, when the all are logged-in, etc.

The best advice here is to monitor the machine for a while before configuring the tests, and if that is not possible, to choose limits that are far from the normal use-case so only extreme loads will risk a reboot.

[top of page]

Basic Watchdog Operation

The first test you need to perform is with the "basic installation" of the watchdog daemon (no config file tests enabled, no auto-load scripts) to establish that is can open the watchdog device, and said device is capable of resetting the PC.

Warning: Triggering the reset action is a risk to your machine's file system and application's data integrity! Hence you should make sure as little as possible is running (e.g. email client closed, no one else logged in, etc) and run the 'sync' command just before you execute any test. Also you should really check your machine is not rebuilding or scrubbing any software RAID when doing the tests with:

# more /proc/mdstat


Where '#' is the root command prompt (this check also works as a normal user, typically shown with '$' as the command prompt). If there are no MD devices, or they are all showing OK, then proceed.
[top of page]

Checking for the Watchdog Hardware

If you have successfully loaded the watchdog hardware's driver module (or the 'softdog' emulator) then you should see the entry in /dev corresponding to this. For example:

# ls -l /dev/watch*
crw------- 1 root root 10, 130 May 13 16:27 /dev/watchdog


In this case you edit the test copy of your watchdog.conf file, say ./test.conf in the working directory where you are testing the system and add/modify the line to match. For example:

watchdog-device = /dev/watchdog

You can check the device using the wd_identify utility, or look in syslog after starting the watchdog to see if it agrees with your expectations:

# wd_identify --config-file ./test.conf
W83627HF WDT


In this case we have an Itox EL620 motherboard and it is using the w83627hf_wdt watchdog driver module for the Winbond W83627DHG-P chip. this provides system monitoring (temperature, supply voltage, etc) as well as the watchdog timer. So here the test looks good.
[top of page]

Testing the Watchdog Hardware

Next we need to check that the hardware will run and trigger a reboot if the daemon fails. The simplest option here is to run the watchdog daemon first, and check that it is happy with the driver:

# watchdog --config-file ./test.conf


Then check the results in syslog:

# grep watchdog /var/log/syslog
May 13 17:47:03 test0 watchdog[12089]: starting daemon (6.00):
May 13 17:47:03 test0 watchdog[12089]: int=1s realtime=yes sync=no load=0,0,0
May 13 17:47:03 test0 watchdog[12089]: memory not checked
May 13 17:47:03 test0 watchdog[12089]: ping: no machine to check
May 13 17:47:03 test0 watchdog[12089]: file: no file to check
May 13 17:47:03 test0 watchdog[12089]: pidfile: no server process to check
May 13 17:47:03 test0 watchdog[12089]: interface: no interface to check
May 13 17:47:03 test0 watchdog[12089]: temperature: no sensors to check
May 13 17:47:03 test0 watchdog[12089]: no test binary files
May 13 17:47:03 test0 watchdog[12089]: no repair binary files
May 13 17:47:03 test0 watchdog[12089]: error retry time-out = 60 seconds
May 13 17:47:03 test0 watchdog[12089]: alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no
May 13 17:47:03 test0 watchdog[12089]: watchdog now set to 60 seconds
May 13 17:47:03 test0 watchdog[12089]: hardware watchdog identity: W83627HF WDT


Again we see the same hardware identity as "W83627HF WDT" and no error messages, so we should have the daemon running OK. A quick check of that should confirm it:

# ps -Af | grep watch
root         7     2  0 16:27 ?        00:00:00 [watchdog/0]
root        12     2  0 16:27 ?        00:00:00 [watchdog/1]
root        16     2  0 16:27 ?        00:00:00 [watchdog/2]
root        20     2  0 16:27 ?        00:00:00 [watchdog/3]
root     12089     1  0 17:47 ?        00:00:00 watchdog --config-file ./test.conf
root     12128 11598  0 17:51 pts/0    00:00:00 grep --color=auto watch


Here we can see all process that have 'watch' in their names, and the watchdog daemon is the 5th entry. It can be seen running as a daemon due to the parent ID being 1 as 'init' has taken over. The process ID, 12089 in this example, is also shown in the syslog entry above.

To stop the watchdog cleanly we could use 'pkill watchdog' to send SIGTERM, however, in this case we want to kill the process without closing the watchdog device driver, so instead we execute the following commands:

# touch /forcefsck
# sync
# pkill -9 watchdog
# for n in $(seq 1 60); do echo $n; sleep 1; sync; done


Then we wait...in approximately 60 seconds (the figure reported in syslog here as "watchdog now set to 60 seconds") the machine should reboot as the hardware timer expires.

The command 'touch /forcefsck' tells the machine to check its file systems on reboot even if it thinks they are OK (they won't be, but with a journalling file system they should be recovered automatically and so clean enough). The sync commands are intended to make sure the file system remains as clean as possible when the reset kicks in.

You can get a slightly fancier version of this test in the form of the test-watchdog-reset.sh script in the example scripts download.

Once your machine has rebooted and completed any file system recovery, you should check syslog or boot.log to see it went OK and no real problems were encountered.

If you don't get a reboot, then you need to check:
  • The driver saw the watchdog exit badly, in syslog you should see something like: "w83627hf/thf/hg/dhg WDT: Unexpected close, not stopping watchdog!".
  • The watchdog module was the correct one for your hardware, and;
  • Are there are any BIOS or IPMI settings to enable/disable the watchdog hardware.
If the hardware works OK, then you can concentrate on configuring and testing the health monitoring options.
[top of page]

Testing Accounts

Testing as Root

Normally the watchdog daemon runs as root and so it has the authority to perform all tests (e.g. ping) and if a fault is detected it will bring down all processes and reboot the machine. During testing this can be a bit tiresome, and for a lot of the tests you can run them with a different user account to your normal one and save the risk and wasted time of the forced reboots.

If you need to test as root, for example, to check the network ping test is working as planned, you should consider using the '--no-action' command line option so detected faults do not bring the machine down.

Even so, take considerable care when doing anything as root, because a fault in a test script, etc, could cause serious damage to the machine if run as root. When possible, start your testing as a normal user (see below).
[top of page]

Testing as Normal User

The advantage of testing as a normal user is you can't take the machine down (assuming the watchdog hardware is not in use). However, you can and will kill off all of your own processes if the daemon attempts a shut-down, leading to a fairly brutal logging off!

So when testing as a user-privilege process you should use a separate dummy test account. For example, log in to a terminal window (e.g. open a window and use "su test") before you start the watchdog, and then you can monitor it and trigger test events (e.g. by the example wd_test_action.sh script) from your own log-in to test how it responds.

Even though it will kill off the test-user's log-in if triggered, you will still have the information in syslog and anything you can still see in the terminal window. Again, the '--no-action' command line option can be used to stop it going that far.

[top of page]

Foreground vs Daemon Operation

Normally when you start the watchdog daemon it reads the config file, sets up certain actions (e.g. opening sockets to 'ping' if specified) and then becomes a daemon by forking itself and re-opening the stdin|out|err paths to /dev/null so you see nothing more from it.

To deal with any child processes ("test binary" and "repair binary" actions) it re-directs their stdout & stderr to files in the log directory (/var/log/watchdog/ by default), again so you see nothing coming from them even if they output messages.

The command line option '--foreground' skips the daemonization step, so the watchdog continues to run as a normal program. In addition, it continues to send all status messages to stderr (as well as to syslog) so the operation is visible in real-time. Since it has not closed the normal outputs, any child processes' messages are interleaved with any watchdog messages.

When testing in the foreground the natural thing to do when stopping the program is to use Ctrl+C key stroke. Unfortunately this will not stop the watchdog module (if used) so it could lead to an unexpected reboot! If you are doing foreground testing then the better option is to send SIGTERM to the process from another terminal window.

However, you usually have some grace period after Ctrl+C so you could run the wd_identify program as that will open and then properly close the configured module, thus stopping any reboot. Unless the module is configured with "no way out" in which case testing is tricky, and you have to keep starting wd_keepalive to prevent a reboot (just as the normal service watchdog start|stop command does).

[to be continued...]

[top of page]

Last Updated on 2-Dec-2015 by Paul Crawford
Copyright (c) 2014-15 by Paul S. Crawford. All rights reserved.
Email psc(at)sat(dot)dundee(dot)ac(dot)uk
Absolutely no warranty, use this information at your own risk.

Linux Watchdog Daemon - Overview

  1. Linux Watchdog Daemon - Overview
    1. Introduction
    2. The Watchdog Module
    3. The Watchdog Daemon
    4. Do I need a Watchdog?
Back to PSC's home page
Back to Watchdog

Introduction

A watchdog in computer terms is something, usually hardware-based, that monitors a complex system for “normal” behaviour and if it fails, performs a system reset to hopefully recover normal operation.

You can read more on this at the Wikipedia entry on WDT.

It is intended as a last resort for maintaining a system's availability and, at the very least, to ensure that the administrator can remotely log-in to diagnose and fix faults of a non-persistent manner. Obviously it won't stop a hardware fault from breaking a system, nor is it any good against a persistent software problem, but for a system that is generally well behaved (and particularly if it is located at a remote site and/or is otherwise essential for operations) it serves to improve the overall availability of the system.

If your application cannot tolerate a short outage, then a watchdog alone is not going to solve it, you need to look at other high-availability solutions for hardware (e.g. RAID for disk error protection) and software (clustering & application mirroring) that will provide an acceptable degree of overall system availability.

With the Linux operating system there are two parts to the watchdog:
  • The actual hardware timer and kernel driver module that can force a hard reset, and;
  • The user-space background daemon that refreshes the timer and provides a wider range of health monitoring and recovery options.
Both can function independently, but clearly they are designed to operate together for maximum protection.
[top of page]

The Watchdog Module

Normally the hardware support for a watchdog is simply a timer that is set to some reasonable time-out, and then periodically refreshed by the running software. If for any reason the software stops refreshing the hardware (and has not explicitly shut it down) then it times-out and performs a hardware reset of the computer. In this way even kernel panic type of faults can usually be recovered. Often the chip sets that provide system monitoring (temperature, supply voltages, fan speeds, etc) have a watchdog timer, though one can never be sure if the motherboard manufacturer will have used it!

In the context of the Linux operating system, there is a standard interface to the watchdog hardware provided by the corresponding kernel device driver (module) provided as /dev/watchdog (checking for this is a simple test of the module being loaded). However, such a driver is not usually loaded by default so you may have to manually configure your system to load it. Typically this is done by adding the module name to /etc/modules or (better still so it is loaded on demand) to /etc/default/watchdog by editing watchdog_module="none" to have the module name.

Linux also provides a software watchdog by means of the 'softdog' module. While this it better than nothing, it is far less effective than hardware! Basically if the kernel fails, so does your means of recovery in this case.

The watchdog hardware + driver module provides the most basic of protection. It is started by anything that can periodically write to /dev/watchdog and if that fails for any reason the watchdog hardware times-out and machine is rebooted by means of a hard reset.

However, a hard reset is something that is normally undesirable as it risks file system corruption, so it is much better if you can perform a clean reboot instead.
[top of page]

The Watchdog Daemon

To operated the watchdog device, there is normally a background daemon that can open the device and provide the periodic refresh activity. However, a machine can also get in to a very unusable state without actually terminating the background daemon's operation, therefore the watchdog daemon for Linux can be configured to periodically run a number of basic tests to verify that the machine looks OK.

On failing such tests (possibly with a certain amount of re-try behaviour to avoid being too "trigger happy") the daemon can reboot the machine in a moderately orderly manner in order to keep a log of why it happened, and hopefully avoid file system problems, etc. While doing so, it also has the "insurance" of the hardware timer so if it fails to reboot nicely, there is a hardware reset to follow that up.

This “moderately orderly” shut down is not the normal init-based shut down approach where the proper sequence of shut-down scripts are executed, as that is very likely to fail in a number of the conditions for which watchdog action it is needed (e.g. system out of memory, out of process table space, etc).

So instead it performs the “blunderbuss approach” to stopping all processes by signalling everything with SIGTERM and then after 5 seconds with the non-ignorable SIGKILL, then it tries to update wtmp (so the shut down is recorded), update the random seed (to preserve entropy), sync the CMOS clock to system time (to help ensure the system time is reasonable on reboot), and finally sync and un-mount the file systems before it attempts reset by means of the hardware timer (if that is possible).

The hardware reset approach is preferred over the kernel's reboot API as the kernel stops the watchdog hardware on a normal shut-down or reboot, and thus could hang just after that point without any means of automatic recovery (e.g. a hung RAID card or similar).

There are in fact two daemons used for the watchdog hardware support:

  • 'wd_keepalive' provides only the hardware driver open/refresh/close actions.
  • 'watchdog' provides the driver open/refresh/close actions along with various other system checks.

When the system boots, it starts wd_keepalive as early as possible to protect against serious faults during booting, then once other services are up changes to run the full watchdog. The normal watchdog cannot be started early because some of the tests it could perform might depend on resources that start later in the chain (e.g. network file system, other daemons to monitor, etc). Similarly on shut-down the main watchdog is stopped early and wd_keepalive started in its place to deal gracefully with the stopping of services that might be monitored.
[top of page]

Do I need a Watchdog?

From the introduction it can be seen that most systems that are used "interactively", like a home PC, don't really need it. Basically if it crashes while you are using it then you typically try Ctrl+Alt+Del (maybe also Ctrl+Alt+F1 to try text-mode login) and, if that fails, then simply push the reset button (or hold down the power button for 5 sec) to recover the machine.

Where the watchdog is most useful is situations like ours where you have hardware control computers running continuously or, more commonly, servers operating at remote sites. Both are situations where you may be sleeping or on holiday when it goes wrong and/or recovery involves a tiresome trip to the site. In such cases the last resort of an automated reboot is quite valuable.
[top of page]

Last Updated on 26-Jan-2016 by Paul Crawford
Copyright (c) 2013-16 by Paul S. Crawford. All rights reserved.
Email psc(at)sat(dot)dundee(dot)ac(dot)uk
Absolutely no warranty, use this information at your own risk.

One thought on “Linux Watchdog Process Example Essay

Leave a comment

L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *