Red Hat Enterprise Linux Diagnostics and Troubleshooting
- Section Using the Scientific Method
- Guided Exercise: Using the Scientific Method to Solve a Login Issue
- Collecting Information to Support Troubleshooting
- Guided Exercise: Collecting Information to Support Troubleshooting
- Troubleshooting with Red Hat Resources
- Guided Exercise: Troubleshooting with Red Hat Resources
- Lab: Introducing Troubleshooting Strategy
- Summary
Abstract
| Goal |
Describe effective troubleshooting methods and data collection strategies. |
| Objectives |
|
| Sections |
|
| Lab |
|
Efficient and timely troubleshooting skills can be developed through practice of the widely recognized scientific method. The scientific method is an empirical process for using logic to hypothesize and test theories through observation, refined by experimentation and validation of deductions that are drawn from those hypotheses. With experience, scientific method users acquire knowledge and develop useful conclusions through refinement and elimination of tested hypotheses.
Many technical professionals solve problems by using past experience, having previously seen the same problem, or having been taught how to solve similar problems. Consequently, it is efficient to query colleagues and other knowledge sources about the scenario, after the problem is accurately defined. However, when a problem is difficult to define and is not recognized through experience, making unscientific guesses about the problem cause can waste significant time and effort.
The scientific method is a provable technique for resolving and fixing new and complex problems, but it might not be the quickest resolution for all scenarios. To solve technical problems, you must discover and gather information, discarding what does not fit observations, and produce logical conclusions to uncover root causes. The scientific method consists of these steps:
Collect relevant information.
Create an accurate problem statement.
Formulate testable hypotheses.
Test each hypothesis.
Record and analyze the test results.
Fix and verify the problem resolution.
In many scenarios, you might repeat the scientific method, either in whole or by iterating on certain steps, to discover and verify the root cause of the problem.
Collect relevant information.
The first step, before theorizing a problem statement, is to collect reliable and factual information. Common reasons for failed troubleshooting include incomplete information or misunderstood problem observations. Start by asking questions of the person who reported the problem and other relevant users or support personnel. Focus on, and record in a readable form, the verifiable facts that are related to the problem. Avoid opinions and judgments, but allow for suggestions from others who have proven to be successful with similar troubleshooting.
Useful information can also be found in screen outputs, system and application log files, error messages, and diagnostic tools. Diagnostic or error messages can be entered in Internet search engines to locate reports or resolutions for similar problems.
When applications or systems are known to have previously worked properly, then logic dictates that something must have changed. Use file, application, or system validation or comparison tools to locate changed files or to compare an errant file or system to a known good file or system that is expected to be configured the same.
When you have a clear perception of the problem, try to reproduce the error or failure. Use verbose logging or tool diagnostic modes to provide additional information about the errant process or observed behavior.
Create an accurate problem statement.
The process of creating a problem statement results in a specific definition of the problem in words, preferably written. Creating the statement as a grammatically correct and understood sentence contributes to accurately clarifying the problem. If you are unable to state the problem in a clear sentence that others agree defines that specific problem, then your problem statement needs work. An accurate problem statement explains the problem to be solved, such that a successful problem resolution is the inverse of that statement.
A problem statement includes answers to factual queries about the problem:
What specific system, application, process, or function, is failing, degraded, or down?
What actions or steps can reproduce the problem?
When was the problem first noticed or reported?
Where does the problem occur or where is the behavior observed?
Who experiences the problem? Not who reported it, but what is the scope of its effect?
If the problem is reproducible, the problem statement should include the steps that cause the problem to occur. Here are examples of well-defined problem statements:
Problem 1
Beginning last Friday, all marketing department users are reporting that they are unable to successfully launch or use the mail application, which displays the error message "Data store XYZ is not available." The problem can be consistently reproduced by selecting the mail icon from any marketing department user's menu.
Problem 2
Today, userX reported that they are unable to print from application ABC to printer123, but can print to printer123 from any other application on the same system.
When the problem is resolved and fixed, the result is the inverse of the original problem statement. For example, the inverse of the previous problem statements would be:
Result for problem 1
Currently, all marketing department users are able to successfully launch and use the mail application, without any displayed error messages."
Result for problem 2
Currently, userX is able to print from application ABC to printer123, and can also print to printer123 from any other application on the same system.
Formulate testable hypotheses.
By using the problem statement that you created and the information that you collected and recorded, formulate one or more hypotheses as to the cause of the problem. This step is significantly more productive when performed in a brainstorming group that is comprised of individuals who are capable of effectively using this scientific method.
Do not rush this step. Although any single hypothesis might seem promising, it is more efficient to formulate, at the same time, all possible, practical hypotheses about the cause of the problem. No relevant, sincere suggestion should be dismissed without being tested.
When formulating and recording each hypothesis, also record a validation test method for each. Although it might appear faster to jump to performing each test as each hypothesis occurs, it is more productive to stay in brainstorming mode until all ideas are exhausted and each hypothesis and test method is recorded in an organized, readable form. Here are examples of hypotheses and test methods for the mail application problem:
Data store XYZ is on a disk that failed. Test by locating data store XYZ and accessing other objects on that disk.
Data store XYZ is on a network share that no longer exists or works properly. Test by locating the share and accessing the share directly by using a proper client.
Data store XYZ is on a storage server that has stopped services or has frozen. Test by locating the correct server and accessing it with management tools.
Data store XYZ is on a storage server that cannot be reached due to a network problem. Test by locating the correct network and accessing the interfaces on that network by using network tools.
Test each hypothesis.
Perform each of the tests that you recorded for your hypothesis. Prioritize the tests in the order that you or your group decide is the most likely to quickly find the problem's root cause. Performing each test should result in either discovering the problem's cause or in eliminating that hypothesis from your list.
If a test requires configuration changes or another form of system modification, follow this single, inviolate rule:
Only one change may be made during any single test run.
Never change more than one parameter at a time. If the test fails to verify the problem cause, reset that changed parameter to its original value, and then change only one new parameter before performing the next test run.
Record information about each test run, including the changed parameters, distinguishing test characteristics, and the observed result, in an organized, readable form. Failure to record each ongoing test result in an organized, readable form commonly creates an inability to distinguish or recall previous test results, which might cause you to need to repeat one or more tests from your hypotheses list.
Record and analyze the test results.
During testing, you record the results of each test run, including any observed behaviors and relevant test characteristics. You should also record any new information that you collect that appears to relate to the problem. This information is useful for creating system reliability methods for permanently mitigating this problem scenario.
If the problem cause is not discovered after performing all the hypothesis tests in your list, you could decide to repeat this scientific method, including any newly collected information in the brainstorming process.
You could also decide that you or your group has insufficient knowledge of the application or systems that you are troubleshooting. Accordingly, you could choose to obtain further training or perform additional research before continuing to troubleshoot this problem. If fixing this problem is time-critical, you might need to escalate the problem to an appropriate higher level of support.
Fix and verify the problem resolution.
If you discover the problem cause while performing a hypothesis test, you must still decide how to fix the problem. Some problems might require a temporary fix or workaround, with a permanent fix that is applied after preparation or during a maintenance window. Similar to the earlier steps, every change in your fix plan must include a validation test that conclusively proves that the change is working.
Fixes that you apply should follow the same inviolate rule used during hypothesis testing, that only one change can be made at a time and that the change must be validated before continuing with the next change. Red Hat recommends use of a change management system for applying and tracking changes, such as the Red Hat Ansible Automation Platform. Change management systems provide records that can verify earlier changes, and methods for accurately reverting changes or tested configurations.
After the temporary or permanent fix is applied, test the scenario again against the original problem statement. If the inverse of the problem statement is conclusively true, then you have successfully completed troubleshooting of your problem scenario.