Detecting lockups and CPU stalls in the kernel
The meaning of lockup is obvious. The system, and one or more CPU cores, remain in an unresponsive state for a significant period of time. In this section, we'll first briefly learn about watchdogs and move on to learn how to leverage the kernel to detect both hard and soft lockups.
A short note on watchdogs
A watchdog or watchdog timer (WDT) is essentially a program that monitors a system's health and, on finding it lacking in some way, has the ability to reboot the system. Hardware watchdogs latch into the board circuitry and thus have the ability to reset the system when required. Their drivers tend to be very board-specific.
The Linux kernel provides a generic watchdog driver framework, allowing driver authors to fairly easily implement watchdog drivers for specific hardware watchdog chipsets. You can find the framework explained in some detail in the official kernel documentation here: The Linux WatchDog Timer...