Tuesday, April 02, 2013

Watchdog Reset Diagnostics

A watchdog reset occurs when a fault condition occurs that the system deems as potentially dangerous. When such a fault occurs, the system immediately drops to the PROM monitor without taking a core dump. If the watchdog-reboot? parameter is set to true, the system will reboot. No further diagnostics will be possible, unless an error message appears either in the system logs (from immediately before the watchdog reset was executed) or on the console (during hardware diagnostics during the reboot).

If the watchdog-reboot? parameter is set to false, some limited diagnostics are available that may point to a culprit in the reset.

Further complicating the issue, watchdog resets may be caused by hardware or software problems. A software-triggered watchdog reset occurs when two trap errors take place so close together that the first one does not have time to complete before the second one is received by the system. This type of watchdog reset is sometimes called a "CPU" watchdog reset, since it occurs when the CPU receives a trap while the register bit to receive traps is not set.

Since hardware faults may cause traps, a CPU watchdog reset may be caused by either hardware or software failures.

A second type of watchdog reset is a "system" watchdog reset. These are almost always caused by a hardware fault.

If the system is still at the PROM monitor prompt following the watchdog reset, it is possible to execute the following commands to attempt to gather some information about the system state prior to the reset. If at all possible, the system should be observed through some sort of console or tip session that can be used to preserve the output of the PROM monitor session.

Post-Reset Diagnostics

.registers: Displays kernel internal registers.
.locals: Displays the registers in the current register window.
.psr: Displays the Processor Status Register.
f8002010 wector p: (Note: That word is not vector.) This displays messages similar to those in dmesg . They represent any final messages that may have occurred before the reset. See the Sun web site for more information on Watchdog Reset . Note that we have not had much success with this command, but it is recommended by Sun, and hope does spring eternal...
ctrace: Displays the trace of the current thread.

Additional debugging information can be made available to the ctrace command via a module called obpsym. This can be loaded in one of two ways:

  1. modload /platform/sun4x/kernel/misc/obpsym (where x is m, u or d, depending on the system architecture) from the root command line. This method loads the module for this boot only.
  2. forceload: misc/obpsym in the /etc/system file. This method loads the module during future reboots.

Sun recommends using both methods so that the obpsym module is reloaded on each reboot until the problem is diagnosed and resolved.

Once the PROM monitor diagnostics have been run, use sync at the ok> prompt to generate a core dump. This can be analyzed using the suggestions from the Crash Dump Analysis page. If a core is not saved, check the Savecore Troubleshooting page.

Watchdog resets are often caused by a hardware failure, usually requiring a system board or CPU replacement. Less frequently, memory replacements have cleared up the problem. Shortening the SCSI bus sometimes will eliminate the watchdog resets. Any hardware that can send a trap is potentially responsible for a watchdog reset.

Hardware faults may leave traces in log or console error messages. In particular, check for the following:

  • Asynchronous memory error: Indicates a memory problem.
  • Asynchronous memory fault: May be a bus problem between memory and CPU. Try replacing the system board first, then the CPU, then the memory.
  • Ecache parity error: Indicates a problem with the CPU's onboard cache. Replace the CPU.

No comments: