In order to automatically recover from guest lockup events, Vultr supports the Linux watchdog service. When you install, configure, and run the watchdog service on your instance, it will interact with a special virtual device which Vultr will monitor. If the watchdog fails, the Vultr control plane will automatically reboot your instance. This document will focus on the wd_keepalive service, which is a simplified daemon exclusively focused on the watchdog hardware device.
These instructions are provided showcasing Debian 11 as an example.
To use Watchdog, you need:
Vultr cloud server instance with appropriate watchdog drivers and kernel module loaded
Install the publicly maintained watchdog software suite
Configure and run the wd_keepalive daemon
Verify your instance recognizes the watchdog device:
ls -al /dev/watchdog*
Install the watchdog software from standard repositories:
apt-get install watchdog
edit /etc/watchdog.conf
and uncomment the watchdog-device line;
# Uncomment this to use the watchdog device driver access "file".
#watchdog-device = /dev/watchdog
edit the Systemd configuration file /lib/systemd/system/wd_keepalive.service
and add the following lines under the [Install]
section.
[Install]
WantedBy=multi-user.target
Rebuild Systemd files, Start the Watchdog service and set it to run at system boot:
systemctl daemon-reload; systemctl start wd_keepalive; systemctl enable wd_keepalive; systemctl status wd_keepalive
To observe the behavior of the watchdog, you can trigger a system hang as follows. Your instance should automatically reboot in about a minute after hanging it with this command:
sync; sleep 2; sync; echo c > /proc/sysrq-trigger
Verify wd_keepalive is running
root@vultr:~# systemctl status wd_keepalive
wd_keepalive.service - watchdog keepalive daemon
Loaded: loaded (/lib/systemd/system/wd_keepalive.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2023-05-30 16:36:49 UTC; 5s ago
Process: 1831 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprobe $watchdog_module (code=exited, status=0/SUCCESS)
Process: 1832 ExecStartPre=/bin/systemctl reset-failed watchdog.service (code=exited, status=0/SUCCESS)
Process: 1833 ExecStart=/usr/sbin/wd_keepalive $watchdog_options (code=exited, status=0/SUCCESS)
Process: 1836 ExecStartPost=/bin/sh -c ln -s /var/run/wd_keepalive.pid /run/sendsigs.omit.d/wd_keepalive.pid (code=exited, status=0/SUCCESS)
Main PID: 1835 (wd_keepalive)
Tasks: 1 (limit: 2233)
Memory: 540.0K
CPU: 14ms
CGroup: /system.slice/wd_keepalive.service
|--1835 /usr/sbin/wd_keepalive
May 30 16:36:49 vultr systemd[1]: Starting watchdog keepalive daemon...
May 30 16:36:49 vultr wd_keepalive[1835]: starting watchdog keepalive daemon (5.16):
May 30 16:36:49 vultr wd_keepalive[1835]: int=1 alive=/dev/watchdog realtime=yes
May 30 16:36:49 vultr wd_keepalive[1835]: watchdog now set to 60 seconds
May 30 16:36:49 vultr wd_keepalive[1835]: hardware watchdog identity: i6300ESB timer
May 30 16:36:49 vultr systemd[1]: Started watchdog keepalive daemon.
root@vultr:~#
Confirm the watchdog device is present:
root@vultr:~# ls -al /dev/watchdog*
crw------- 1 root root 10, 130 May 30 15:57 /dev/watchdog
crw------- 1 root root 244, 0 May 30 15:57 /dev/watchdog0
root@vultr:~#
Identify the watchdog device type:
root@vultr:~# lspci -v | grep -i watch
02:01.0 System peripheral: Intel Corporation 6300ESB Watchdog Timer
root@vultr:~#
Load the appropriate watchdog kernel module:
root@vultr:~# modprobe i6300esb
root@vultr:~#
Note, only the i6300esb
watchdog device is supported and not iTCDO
watchdog. Also, the softdog type is not recommended.
If your instance was deployed a long time ago, you may need to reboot the instance from the Vultr API or Control panel to update the underlying configurations.