Resilience As A Continuous Delivery Enabler

You are currently browsing articles tagged Resilience As A Continuous Delivery Enabler.

Why is it wrong to assume failures are preventable in IT? Why does optimising for robustness leave organisations ill-equipped to deal with failure, and what are the usual outcomes?

In this article Steve Smith explains why a production environment is always in a state of near-failure, why optimising for robustness results in a brittle incident response process, and why Dual Value Streams are a common countermeasure to failure.

  1. The Cost And Theatre Of Optimising For Robustness
  2. When Optimising For Robustness Fails
  3. The Value Of Optimising For Resilience – TBA
  4. Resilience As A Continuous Delivery Enabler – TBA

THE UNDERINVESTMENT IN PRODUCTION

An organisation that optimises for robustness will attempt to maintain a production environment free from failure. This approach is based on the belief that failures in IT services are caused by isolated, faulty changes that are entirely preventable. A production environment is viewed as a set of homogeneous processes, with predictable interactions occurring in repeatable conditions. This matches the Cynefin definition of a complicated system, in which expert knowledge can be used to predict the cause and effect of events.

Optimising for robustness will inevitably lead to an overinvestment in pre-production risk management, and an underinvestment in production risk management. Symptoms of underinvestment include:

  • Stagnant requirements – “non-functional” requirements are deprioritised for weeks or months at a time
  • Snowflake infrastructure – environments are manually created and maintained in an unreproducible state
  • Inadequate telemetry – logs and metrics are scarce, anomaly detection and alerting are manual, and user analytics lack insights
  • Fragile architecture – services are coupled, service instances are stateful, failures are uncontained, and load vulnerabilities exist
  • Insufficient training – operators are not given the necessary coaching, education, or guidance

This underinvestment creates an inoperable production environment, which makes it difficult for operators to keep IT services in a safe and reliable functioning condition. This will often be deemed acceptable, as production failures are expected to be rare.

THE CONSTANCY OF FAILURE

A production environment of running IT services is not a complicated system. It is an intractable mass of heterogeneous processes, with unpredictable interactions occurring in unrepeatable conditions. It is a complex system of emergent behaviours, in which the cause and effect of an event can only be perceived in retrospect.

As Richard Cook explains in How Complex Systems Fail, “the complexity of these systems makes it impossible for them to run without multiple flaws being present“. A production environment always contains partial faults, and is constantly in a state of near-failure.

A failure will occur when unrelated faults unexpectedly coalesce such that one or more functions cannot succeed. Its revenue cost will be a function of cost per unit time and duration, with cost per unit time the economic impact and duration the time between start and end. Its opportunity costs will come from loss of customer confidence, and increased failure demand slowing feature development.

An organisation optimised for robustness will be ill-equipped to deal with a failure when it does occur. The inoperability of the production environment will produce a brittle incident response:

  • Stagnant requirements and insufficient training will make it difficult to anticipate how services might fail
  • Inadequate telemetry will impede the monitoring of normal versus abnormal operating conditions
  • Snowflake infrastructure and a fragile architecture will prevent a rapid response to failure

For example, at Fruits-U-Like a third party registration service begins to suffer under load. The website rejects new customers on checkout, and a failure begins with a static cost per unit time of £80K per day. A lack of telemetry means the operations team cannot triage for 3 days. After triage an incident is assigned to the checkout team, who improve connection handling within a day. The Change Advisory Board agrees the fix can skip End-To-End Testing, and it is deployed the following day. The failure has a 5 day repair cost of £400K, with a detection sunk cost of £240K and a remediation opportunity cost of £160K.

After a failure, the assumption that failures are caused by individuals will lead to a blame culture. There will be an attitude Sidney Dekker calls the Bad Apple Theory, in which production is considered absolutely reliable bar the actions of a few unreliable employees. The combination of the Bad Apple Theory and hindsight bias will create an oppressive culture of naming, blaming, and shaming the individuals involved. This discourages the sharing of operational knowledge and organisational learnings.

THE DUAL VALUE STREAMS COUNTERMEASURE

An organisation optimised for robustness will be in a state of Discontinuous Delivery. Attempting to increase the Mean Time Between Failures (MTBF) with practices such as End-To-End Testing will increase feature lead times to the extent that business demand will be unsatisfiable. However, the rules for deploying a production fix will be very different.

When a production fix for a failure is available, people will share a sense of urgency. Regardless of how cost per unit time is estimated, there will be a recognition that a sunk cost has been incurred and an opportunity cost needs to be minimised. There will be a consensus that a different approach is required to avoid long feature lead times.

Dual Value Streams is a common countermeasure to failure when optimising for robustness. For each technology value stream in situ, there will actually be two different value streams. The feature value stream will retain all the advertised pre-production risk management practices, and will take weeks or months to complete. The fix value stream will strip out most if not all pre-production activities, and will take days to complete.

At Fruits-U-Like, that means a 12 week feature value stream from code to production and a 5 day fix value stream from failure start to end 2.

 

Dual Value Streams signify Discontinuous Delivery, but they also show potential for Continuous Delivery. The fix value stream indicates the lead times that can be accomplished when people have a shared sense of urgency, actively collaborate on releases, and omit the risk management theatre of extensive pre-production risk management.

1 In The DevOps Handbook by Patrick Debois et al telemetry is defined as a logical grouping of logging, monitoring, anomaly detection, alerting, and user analytics

2 Measuring Continuous Delivery details why deployment failure recovery time should include development time and deployment lead time should not. Deployment failure recovery time is measured from failure start to failure end, while deployment lead time is measured from master commit to production deployment

The Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. Responding To Failure When Optimising For Robustness
  3. The Value Of Optimising For Resilience – TBA
  4. Resilience As A Continuous Delivery Enabler – TBA

Acknowledgements

This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

Tags: , , ,

Why do so many organisations optimise their IT delivery for robustness? What risk management practices are normally involved, and do their capabilities outweigh their costs?

In this article Steve Smith explains what optimising for robustness is, and why it is inadequate for IT delivery. This is part of the Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. When Optimising For Robustness Fails
  3. The Value Of Optimising For Resilience – TBA
  4. Resilience As A Continuous Delivery Enabler – TBA

The robustness tradition

As software continues to eat the world, organisations must position IT at the heart of their business strategy. The speed of IT delivery needs to be capable of satisfying customer demand, and at the same time the reliability of IT services must be ensured to protect daily business operations.

In Practical Reliability Engineering, Patrick O’Connor and Andre Kleyner define reliability as “The probability that [a system] will perform a required function without failure under stated conditions for a stated period of time“. When an organisation has unreliable IT services its business operations are left vulnerable to IT outages, and the cost of downtime could prove ruinous if market conditions are unfavourable. Such an organisation will have an ingrained fear of failure, due to the lack of confidence in those IT services. There will also be a simultaneous belief that failures are preventable, based on the assumption that IT services are predictable and failures are caused by isolated changes.

In these circumstances an organisation will traditionally optimise for robustness. It will focus on maximising the ability of its IT services to “resist change without adapting [their] initial stable configuration, by increasing Mean Time Between Failures (MTBF). It will use robustness-centric risk management practices in its technology value streams to reduce the risk of future failures, such as 1:

  • End-To-End Testing to verify the functionality of a new service version against its unowned dependent services
  • Change Advisory Boards to assess, prioritise, and approve the deployment of new service versions
  • Change Freezes to restrict the deployment of new service versions for a period of time derived from market conditions

Consider a fictional Fruits-U-Like organisation, with development teams working to 2 week iterations and a quarterly release cycle. Fruits-U-Like has optimised itself for robustness ever since a 24 hour website outage 5 years ago. Each release goes through 6 weeks of End-To-End Testing with the testing team, a 2 week Change Advisory Board, and 1 week of preparation with the operations team. There are also several 4 week Change Freezes throughout the year, to coincide with marketing campaigns.

Costs and Risk Management Theatre

Robustness is a desirable capability of an IT service, but optimising for robustness invariably means spending too much time for too little risk reduction. The risk management practices used will be far more costly and less valuable than expected:

If the next Fruits-U-Like release was estimated to be worth £50K per day in new revenue, the 12 week lead time would create a total opportunity cost of £4.2 million. This would include the handover delays between the development, testing, and operations teams due to misaligned priorities. If a Change Freeze delayed the deployment by another 4 weeks the opportunity cost would increase to £5.6 million.

These risk management practices are what Jez Humble calls Risk Management Theatre. They are based on the misguided assumption that preventative controls on everyone will prevent anyone from making a mistake. Furthermore, they actually increase risk by ensuring a large batch size and a sizeable amount of requirements/technology changes per service version 2. They impede knowledge sharing, restrict situational awareness, create enormous opportunity costs, and doom organisations to a state of Discontinuous Delivery.

1 Other practices include manual regression testing, segregation of duties, and uptime incentives for operators

2 The Principles of Product Development Flow by Don Reinertsen describes in detail how large batch sizes increase risk

The Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. When Optimising For Robustness Fails
  3. The Value Of Optimising For Resilience – TBA
  4. Resilience As A Continuous Delivery Enabler – TBA

Acknowledgements

This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

Tags: , , ,