What does it mean to optimise for resilience? Why is resilience so valuable to an organisation, and how can operability contribute towards it?
In this article Steve Smith explains what optimising for resilience is, and why it is so valuable to IT delivery. This is part of the Resilience As A Continuous Delivery Enabler series:
RESILIENCE IS GRACEFUL EXTENSIBILITY
When an organisation wants to improve the reliability of its IT services it should optimise for resilience. Resilience is the ability to “absorb or avoid damage without suffering complete failure“, and it is achieved by minimising the Mean Time To Repair (MTTR) of services. Some classes of failure should never occur, some failures are more costly than others, and some safety-critical systems should never have failures, but in general organisations should adhere to John Allspaw’s advice that “being able to recover quickly from failure is more important than having failures less often“.
Resilience can be thought of as graceful extensibility. In Four Concepts for Resilience and their Implications for Systems Safety in the Face of Complexity, David Woods describes graceful extensibility as “the ability of a system to extend its capacity to adapt when surprise events challenge its boundaries“. Optimising for resilience means creating a production environment that can gracefully extend to deal with the unpredictable behaviours, unexpected changes, and periods of failure that will inevitably occur with running IT services. This allows for the cost per unit time and duration of production failures to be minimised, reducing both the direct revenue costs and indirect opportunity costs created by a failure.
Resilience needs to be built into teams and services throughout an organisation. In Resilience Engineering In Practice, Erik Hollnagel et al define the cornerstones of Resilience Engineering as:
- Anticipation is knowing what to expect. This is imagining the potential for future failures, and mitigating for those scenarios in advance
- Monitoring is knowing what to look for. This is inspecting past and present operating conditions, and alerting when anomalies occur
- Response is knowing what to do. This is using guidelines, heuristics, improvisation skills, and situational awareness to mitigate a failure
- Learning is knowing what has happened. This is understanding the circumstances of a near-miss or failure, and sharing the observations
These cornerstones are non-linear and complementary. For example, if a team has a major launch in its near-future it might invest more time in anticipating failure scenarios, which might result in improved monitoring and response capabilities.
CREATING ADAPTIVE CAPACITY WITH OPERABILITY
The graceful extensibility of an organisation is derived from the adaptive capacity of its teams and their services. When an organisation optimises for resilience it can create sources of adaptive capacity by making a long-term investment in the operability of its IT services. Operability is defined as “the ability to keep a system in a safe and reliable functioning condition“, and it is associated with a set of practices:
Each of these operability practices can be linked to a cornerstone of Resilience Engineering. They will produce a more effective incident response, and increase adaptive capacity:
- Anticipation – Automated Infrastructure creates reproducible environments, and a Defensive Architecture limits failure blast radius. Smoke Testing verifies service health, and Chaos Engineering uncovers latent failures in production. Shared On-Call fosters a “You Build It, You Run It” culture and increases situational awareness, and Runbooks are a repository for operational knowledge
- Monitoring – Logging radiates data on traffic, errors, latency, and saturation, and Monitoring visualises service metrics and events in a time series. Anomaly detection identifies events that breach normal operating conditions and Alerting notifies operators of abnormalities to act on. User analytics show success rates for user journeys.
- Response – Feature Toggles allow features to be limited, tested in isolation, or turned off on failure, and Self-Healing automatically restores failed service instances
- Learning – Blameless Post-Mortems uncover the multiple contributors to a near-miss or failure and suggest future preventative measures, while respecting the best efforts of individuals and the dangers of hindsight bias 1
For example, incident response at Fruits-U-Like would be much improved if the organisation was optimised for resilience. Assume its third party registration service starts to struggle under load, new customers cannot checkout their purchases, and the failure cost per unit time is £80K per day. The checkout team would receive an automated alert for the failure, and their logging and monitoring dashboards would show a correlation between checkout and registration failures. The team would be able to triage a third party registration error within 5 minutes, and self-deploy an improvement to connection handling within a day. The failure would have a 1 day repair cost of £80K, with a detection sunk cost of £278 and a remediation opportunity cost of £79,722.
If the checkout team adopted Defensive Architecture techniques they could combine a Circuit Breaker, a Bulkhead, and a Feature Toggle in anticipation of registration errors. If the registration service struggled under load the Circuit Breaker would regulate registration requests to allow a percentage to succeed, and the Bulkhead would warn the checkout frontend to skip registration for some customers. This approach would reduce the failure cost per unit time to a marketing opportunity cost of £5K a day. The checkout team would not receive an alert, but within minutes their dashboards would highlight registration errors and they could use a Feature Toggle to enable anonymous checkouts for new customers. This would allow them to deploy their connection handling fix within 3 hours with no customer impact. The result would be a 3 hour repair cost of £625, with a sunk cost of £18 and an opportunity cost of £607.
1 In How Complex Systems Fail, Richard Cook warns that “hindsight bias remains the primary obstacle to accident investigation. There is no such thing as a root cause in a complex production system, nor a blameworthy individual
The Resilience As A Continuous Delivery Enabler series:
- The Cost And Theatre Of Optimising For Robustness
- Responding To Failure When Optimising For Robustness
- The Value Of Optimising For Resilience
- Resilience As A Continuous Delivery Enabler – TBA