Sunday, April 21, 2013

Troubleshooting Intermittent Problems

Intermittent problems are extremely difficult to troubleshoot. Any reproducible problem can be troubleshot, if for no other reason than that each individual component can be proven to not be the problem through experimentation. Problems that are not reproducible cannot be approached in the same way.

Problems present as intermittent for one of two reasons:

  1. We have not identified the real cause of the problem.
  2. The problem is being caused by failing or flaky hardware.

The first possibility should be addressed by going back to brainstorming hypotheses.

It may be helpful to bring a fresh perspective into the brainstorming session, either by bringing in different people, or by sleeping on the problem.

The second problem is tougher. There are hardware diagnostics tests that can be run to try to identify the failing piece of hardware.

The first thing to do is to perform general maintenance on the system. Re-seat memory chips, processors, expansion boards and hard drives.

Once general maintenance has been performed, test suites like SunVTS can perform stress-testing on a system to try to trigger the failure and identify the failing part.

It may be the case, however, that the costs associated with this level of troubleshooting are prohibitive. In this case, we may want to attempt to shotgun the problem.

Shotgunning is the practice of replacing potentially failing parts without having identified them as actually being flaky. In general, parts are replaced by price point, with the cheapest parts being replaced first.

Though we are likely to inadvertently replace working parts, the cost of the replacement may be cheaper than the costs of the alternatives (like the downtime cost associated with stress testing).

When parts are removed during shotgunning, it is important to discard them rather than keep them as spares. Any part you remove as part of a troubleshooting exercise is questionable. (After all, what if a power surge caused multiple parts to fail? Or what if there was a cascading failure?) It does not make sense to have questionable parts in inventory; such parts would be useless for troubleshooting, and putting questionable parts into service just generates additional downtime down the road.

This practice may violate your service contract if performed without the knowledge and consent of your service provider.

Regardless of the method used to deal with intermittent problems, it is essential to keep good records. Relationships between our problem and other events may only become clear when we look at patterns over time. We may only be confident that we have really resolved the problem if we can demonstrate that we've gone well beyond the usual re-occurrence frequency without the problem re-emerging.

No comments: