This is part 3 of a 3-part piece on risk management and error trapping in software and hardware development. The first post is located here (and should be read first to provide context on the content below), and part 2 is located here.
Root Cause Analysis and Process Improvement
Once a bug has been discovered and risk analysis / decision-making has been completed (see below), a retrospective-style analysis on the circumstances surrounding the engineering practices which failed to effectively trap the bug completes the cycle.
The purpose of the retrospective is not to assign blame or find fault, but rather to understand the cause of the failure to trap the bug, inspect the layers of the system, and determine if any additional layers, procedures, or process changes could effectively improve collective engineering surety and help to prevent future bugs emerging from similar causes.
Review sequence of events that led to the anomaly / bug.
Determine root cause.
Map the root cause to our defense-in-depth (Swiss cheese) model.
Decide if there are remediation efforts or improvements which would be effective in supporting or restructuring the system to increase its effectiveness at error trapping.
Implement any changes identified, sharing them publicly to ensure everyone understands the changes and the reasoning behind them.
Monitor the changes, adjusting as necessary.
Review sequence of events
With appropriate representatives from engineering teams, certification, hardware, operations, customer success, etc., review the discovery path which led to finding the bug. The point is to understand the processes used, which ones worked, and which let the bug pass through.
Determine root cause and analyze the optimum layers for improvement
What caused the bug? There are many enablers and contributing factors, but typically only one or two root causes. The root cause is one or a possible combination of Organization, Communication, Knowledge, Experience, Discipline, Teamwork, or Leadership.
Organization – typically latent, organizational root causes include things like existing processes, tools, practices, habits, customs, etc., which the company or organization as a whole employs in carrying out its work.
Communication – a failure to convey necessary, important, or vital information to or among an individual or team who required it for the successful accomplishment of their work.
Knowledge – an individual, team, or organization did not possess the knowledge necessary to succeed. This is the root cause for knowledge-based errors.
Experience – an individual, team, or organization did not possess the experience necessary to successfully accomplish a task (as opposed to the knowledge about what to do). Experience is often a root cause in skill-based errors of omission.
Discipline – an individual, team, or organization did not possess the discipline necessary to apply their knowledge and experience to solving a problem. Discipline is often a root cause in skill-based errors of commission.
Teamwork – individuals, possibly at multiple levels, failed to work together as a team, support one another, and check one another against errors. Additional root causes may be knowledge, experience, communication, or discipline.
Leadership – less often seen at smaller organizations, a Leadership failure is typically a root cause when a leader and/or manager has not effectively communicated expectations or empowered execution regarding those expectations.
Map the root cause to the layer(s) which should have trapped the error
Given the root cause analysis, determine where in the system (which layer or layers) the bug should have been trapped. Often there will be multiple locations at which the bug should or could have been trapped, however the best location to identify is the one which most closely corresponds to the root cause of the bug. Consideration should also be given to timeliness. The earlier an error can be caught or prevented (trapped), the less costly it is in terms of both time (to find, fix, and eliminate the bug) and effort (a bug in production requires more effort from more people than a developer discovering a bug while checking their own unit test).
While we should seek to apply fixes at the locations best suited for them, the earliest point at which a bug could have been caught and prevented will often be the optimum place to improve the system.
For example, if a bug was traced back to a team’s discipline in writing and using tests (root cause: discipline and experience), then it would map to layers dealing with testing practices (TDD/ATDD), pair programming, acceptance criteria, definition of “Done,” etc. Those layers to which the team can most readily apply improvements and which will trap the error sooner rather than later should be the focus for improvement efforts.
Decide on improvements to increase system effectiveness
Based on the knowledge gained through analyzing and mapping the root cause, decisions are made on how to improve the effectiveness of the system at the layers identified. Using the testing example above, a team could decide that they need to adjust their definition of Done to include listing which tests a story has been tested against and their pass/fail conditions.
Implement the changes identified, and monitor them for effectiveness.
Should our preventative measures fail to stop a bug from escaping into a production environment, an analysis of the level of risk needs to be explicitly completed. (This is often done, but in an implicit way.) The analysis of the level of risk derives from two areas.
Risk Severity – the degree of impact the bug can be expected to have to the data, operations, or functionality of affected parties (the company, vendors, customers, etc.).
A bug that is so bad, or a feature that is so important, that we would not ship the next release until it is fixed/completed. Could also signify a bug that is currently impacting a customer’s operations, or one that is blocking development.
A bug that needs to be resolved ASAP, but for which we wouldn’t stop everything. Bugs in this category are not impacting operations (a customer’s, or ours), but they are significantly challenging to warrant attention.
Best judgement should be used to determine how this stacks against other work. The bug is serious enough that it needs to be resolved, but the value of other work and timing should be considered. If a bug sits in major for too long, its categorization should be reviewed and either upgraded or downgraded.
A bug that is known, but which we have explicitly de-prioritized. Such a bug will be fixed as time allows.
Should really consider closing this level of bug. At best these should be put into the “Long Tail” for tracking.
Risk Probability – the likelihood, expressed against a percentage, that those potentially affected by the bug will actually experience it (ie., always, only if they have a power outage, or only if the sun aligns with Jupiter during the slackwater phase of a diurnal tide in the northeastern hemisphere between 44 and 45 degrees Latitude).
100% – issue will occur in every case
60-99% – issue will occur in most cases
30-60% – coin-flip; issue may or may not occur
2-30% – issue will occur in less than 50% of cases
1% – occurrence of the issue will be exceptionally rare
Given Risk Severity and Probability, the risk can be assessed according to the following matrix and assigned a Risk Assessment Code (RAC).
The Risk Assessment Codes are a significant factor in Risk decision-making.
Strategic – the risk to the business or customers is significant enough that its realization could threaten operations, basic functioning, and/or professional reputation to the point that the basic survival of the business could be in jeopardy. As Arnold said in Predator: “We make a stand now, or there will be nobody left to go to the chopper!”
Significant – the risk poses considerable, but not life-threatening, challenges for the business or its customers. If left unchecked, these risks may elevate to strategic levels.
Moderate – the risk to business operations, continuity, and/or reputation is significant enough to warrant consideration against other business priorities and issues, but not significant enough to trigger higher responses.
Low – the risk to the business is not significant enough to warrant special consideration of the risk against other priorities. Issues should be dealt with in routine, predictable, and business-as-usual ways.
Negligible – the risk to the business is not significant enough to warrant further consideration except in exceptional circumstances (ie., we literally have nothing better to do).
The risk decision is the point at which a decision is made about the risk. Typically, risk decisions take the form of:
Accept – accept the risk as it is and do not mitigate or take additional steps.
Delay – for less critical issues or dependencies, a decision about whether to accept or mitigate a risk may be delayed until additional information, research, or steps are completed.
Mitigate – establish a mitigation strategy and deal with the risk.
For risk mitigation, feasible Courses of Action (CoAs) should be developed to assist in making the mitigation plan. These potential actions comprise the mitigation and or reaction plan. Specifically, given a specific bug’s risk severity, probability, and resulting RAC, the courses of action are the possible mitigate solutions for the risk. Examples include:
— Pre-release —
Apply software fix / patch
Release without the code integrated (re-build)
Hold the release and await code fix
Cancel the release
— In production —
Add to normal backlog and prioritize with normal workflow
Pull / create a team to triage and fix
Swarm / mob multiple teams on fix
Pull back / recall release
Release an additional fix as a micro-upgrade
For all risk decisions, those decisions should be recorded and those which remain active need to be tracked. There are many methods available for logging and tracking risk decisions, from spreadsheets to documentation to support tickets. There are entire software platforms expressly designed to track and monitor risk status and record decisions taken (or not) about risks.
Decisions to delay risk mitigations are the most important to track, as they require action and at the speed most business move today, a real risk exists of losing track of risk delay decisions. Therefore a Risk Log or Review should be used to routinely review the status of pending risk decisions and reevaluate them. Risk changes constantly, and risks may significantly change in severity and probability overnight. In reviewing risk decisions regularly, leadership is able to simultaneously ensure both that emerging risks are mitigated and that effort is not wasted unnecessarily (as when effort is put against a risk which has significantly declined in impact due to changes external to the business).
I hope you’ve enjoyed this 3-part series. Risk management and error trapping is a complicated and – at times – complex topic. There are many ways to approach these types of systems and many variations on the defense-in-depth model.
The specific implementation your business or organization chooses to adopt should reflect the reality and environment in which you operate, but the basic framework has proven useful across many domains, industries, and is directly adapted from Operational Risk Management as I used to practice and teach it in the military.
Understanding the root cause of your errors, where they slipped through your system, and how to improve your system’s resiliency and robustness are critical skills which you need to develop if they are not already functional. A mindful, purposeful approach to risk decision-making throughout your organization is also critical to your business operations.