Monday, April 01, 2013

A Troubleshooting Methodology

Troubleshooting generally consists of the following steps. Different methodologies may call them by slightly different names, but the similarities are pretty obvious.
  • Investigation
    • Problem Statement: Create a clear, concise statement of the problem.
    • Problem Description: Identify the symptoms. What works? What doesn't?
    • Identify Differences and Changes: What has changed recently? What is unique about this system?
  • Analysis
    • Brainstorm: Gather Hypotheses: What might have caused the problem?
    • Identify Likely Causes: Which hypotheses are most likely?
    • Test Possible Causes: Schedule the testing for the most likely hypotheses. Perform any non-disruptive testing immediately.
  • Implementation
    • Implement the Fix: Complete the repair.
    • Verify the Fix: Is the problem really fixed?
    • Document the Resolution: What did we do? Get a sign-off from the system owner.

Problem Statement

The problem statement must be broad enough to describe the problem, but narrow enough to focus the investigation. It should not contain value judgements. It should be a factual answer to the question "What is wrong?"

Problem Description

Gather all symptoms, including error messages, core dumps, descriptions of any service outages, and contrasting descriptions of what still works. As near as possible, we need to identify the time of the incident.

Identify Differences and Changes

Identify differences between the faulted system and any similar working systems. Also identify any recent changes to the system.

Brainstorm

In this stage, we need to come up with as many possible explanations for the problem as possible. It is sometimes helpful (especially in a group setting) to use an Ishikawa diagram to organize our thoughts so that we don't leave any possibilities unconsidered.

Generate an Ishikawa diagram by drawing a “backbone” arrow pointing to the right at the problem statement. Then attach 4-6 “ribs,” each of which represents a major broad category of items which may contribute to the problem. Each of our components should fit on one or another of these ribs.

Identify Likely Causes

We need to consider how likely each potential cause is. We should only eliminate hypotheses when they are absolutely disproven.

For more complex problems, something like an Interrelationship Diagram may be useful in identifying which potential cause may be might be a root cause.

Interrelationship Diagrams use boxes containing phrases describing the potential causes. Arrows between the potential causes demonstrate influence relationships between these issues. Each relationship can only have an arrow pointing in one direction. (Where the relationship's influence runs in both directions, the troubleshooters must decide which one is predominant.) Items with more “out” arrows than “in” arrows are causes. Items with more “in” arrows are effects.

Test Possible Causes

We need to perform testing in the least disruptive fashion possible. Data should be backed up if possible before testing proceeds.

The best approach is to schedule testing of the most likely hypotheses immediately. Then start to perform any non-disruptive or minimally disruptive testing of hypotheses. If several of the most likely hypotheses can be tested non-disruptively, so much the better. Start with them.

In some cases, it may be possible to test the hypothesis directly in some sort of test environment. This may be as simple as running an alternative copy of a program without overwriting the original. Or it may be as complex as setting up a near copy of the faulted system in a test lab. If a realistic test can be carried out without too great a cost in terms of money or time, it can really help nail down whether we have identified the root cause of the problem.

Depending on the situation, it may even be appropriate to test out the hypotheses by directly applying the fix associated with that problem. If this approach is used, it is important to only perform one test at a time, and back out the results of each failed hypotheses before trying the next one. Otherwise, you will not have a good handle on the root cause of the problem, and you may never be confident that it will not re-emerge at the worst possible moment.

Implement the Fix

The fix needs to be implemented in the least-disruptive, lowest-cost manner possible. Ideally, the fix should be performed in a way that will completely verify that the fix itself has resolved the problem.

Verify the Fix

We need to check that the problem is resolved, and also that we have not introduced any new problems. Each service in your environment should have a test suite associated with it so that you can quickly eliminate the possibility that we have introduced a new problem.

Part of this verification should include a root-cause analysis to make sure that the real problem has been resolved. Band-Aid solutions are not really solutions.

Document the Fix

Over time, the collection of data on resolved problems can become a valuable resource. It can be referenced to deal with similar problems. It can be used to track recurring problems over time, which can help with a root cause analysis. Or it can be used to continue the troubleshooting process if it turns out that the problem was not really resolved after all.

No comments: