The way in which we conceptualize and analyze risk and error management in technology projects has never received quite the same degree of scrutiny which business process frameworks and methodologies such as Scrum, Lean, or Traditional Project Management have. Yet risk is inherent in everything we do, every day, regardless of our industry, sector, work domain, or process.
We actually practice risk management in our everyday lives, often without consciously realizing that what we are doing is designed to manage levels of risk against degrees of potential reward, and either prevent errors from occurring or minimizing their impact when they do.
For example, I recently took a trip to the San Juan islands with my wife and parents. I woke up early, made coffee, roused the troops, and checked the weather. I’d filled up the gas thank the day before, and booked our ferry tickets online. I checked the weather and recommended we each take an extra layer. We departed the house a bit earlier than really necessary, but ended up encountering a detour along the way due to a traffic accident on the Interstate. Nevertheless, we made it to the ferry terminal with about 10 minutes to spare, and just in time to drive onto the ferry and depart for Friday Harbor.
My personal example is relatively simple but, with a little analysis, demonstrates how intuitively we assess and manage risk:
Wake up early: mitigates risk of oversleeping and departing late (which could result further in forgetting important things, leaving coffee pot/equipment on, etc.), waiting on others in the bathroom, and not being able to prepare and enjoy some morning coffee (serious risk).
Check the weather: understanding the environment we are entering into is critical to mitigating environment-related risks, in this case real environmental concerns such as temperature, weather, wind, and precipitation, enabling us to mitigate potentially negative effects and capitalize on positives. Bad weather may even result in our changing our travel plans entirely – a clear form of risk mitigation in which we determine that our chance for a successful journey is low compared against the value we would derive from undertaking the journey in the first place, and decide the goal is not sufficient to accept present risk levels.
Book ferry tickets online: a mitigation against the risk of arriving late and having to wait in line to purchase tickets, which could result in us missing the ferry due to either running out of time or the ferry already being completely booked.
Departing earlier than necessary: a mitigation against unforeseen and unknowable specific risk, in this case the generic risk of en route delays, which we did encounter on this occasion.
As you can see, as a story my preparations for our trip seem rather routine and unremarkable, but when viewed through the lens of risk mitigation and error management, each action and decision can be seen as specifically targeted to mitigate one or more specific risks or minimize the potential effects of an error. Unfortunately, our everyday intuitive actions and mental processes seldom translate into our work environments in such direct and meaningful ways.
Risk and Error Management in Software and Hardware Development – Defense-in-Depth and the Swiss Cheese ModelAny risk management system can be seen as a series of layers designed to employ a variety of means to mitigate risk and prevent errors from progressing further through the system. We call this “trapping errors.” Additionally, each of these layers is often just one part of a larger system. A system constructed with these layers is referred to as having “defense-in-depth.”
Defense-in-depth reflects the simple idea that instead of employing one single, catch-all solution for eliminating risk and trapping errors, a layered approach which employs both latent and active controls in different areas throughout the system will be far more effective in both detecting and preventing errors from escaping.
These layers are often envisioned as slices of Swiss cheese, with each slice representing a different part of the larger system. As a potential risk or error progresses through holes in the system’s layers, it should eventually be trapped in one of the layers.
Risk and errors are then only able to impact the system when all the holes in the system’s Swiss cheese layers “line up.”
Latent and Active Layers
There are two basic types of layers (or traps) in any system: latent and active. In your day to day life, latent traps are things such as the tires on your car or the surface of the road. Active traps are things such as checking the weather, putting on safety gear, wearing a helmet, or deciding not to go out into the weather.
Latent layers in software or hardware development may be things such as the original (legacy) code base, development language(s) used, system architecture & design, hardware (types of disk drives, manufacturer), and so forth. It may even include educational requirements for hiring, hiring practices, and company values.
Active layers in software and hardware development may include release processes, User Story writing and acceptance criteria, and development practices like TDD/ATDD, test automation, code reviews, and pair programming.
Separation of Risk and Error Management Concerns
To better focus on dealing with the most appropriate work at the appropriate time in responding to error detection, triage, and risk mitigation, we can separate our risk and error analysis into the following areas:
During development: focus on trapping errors
Prevention – the practices, procedures, and techniques we undertake in engineering disciplines to help ensure we do not release bugs or errors into our code base or hardware products.
Detection – the methods available to us as individual engineers, teams, and the organization as a whole to find and respond to errors in our code base or hardware products (which includes reporting and tracking).
Risk mitigation: steps for errors that have escaped into certification or production environments
Risk Analysis – the steps required to analyze the severity and impact of an error.
Risk Decision-making – the process of ensuring decisions about risk avoidance, acceptance, or mitigation are made at appropriate levels with full transparency.
Continuous Improvement in every case
Improvement – the process of improving workflows and practices through shared knowledge and experience in order to improve engineering practices and further harden our release cycles. This step uses root cause analysis to help close the holes we find in the layers of our Swiss cheese model.
Here is one conceptualization of what a Defense-in-depth Risk Management model might look like. Bear in mind that this is simply one way to conceive of layers at a more macro level, and each layer could easily itself be broken down into a set of layers, or you could conceive of it as one very large model.
Given our model and our new ability to conceive of Risk and Error Management in this more meaningful and purposeful way, our next step is to understand error causality and what we can do to apply our causal analysis to strengthening our software and hardware risk management and error trapping system.
Continue reading in part 2 of this 3-part series.