You are currently browsing articles tagged Change.

What does it mean to optimise for resilience? Why is resilience so valuable to an organisation, and how can operability contribute towards it?

In this article Steve Smith explains what optimising for resilience is, and why it is so valuable to IT delivery. This is part of the Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. When Optimising For Robustness Fails
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler – TBA


When an organisation wants to improve the reliability of its IT services it should optimise for resilience. Resilience is the ability to “absorb or avoid damage without suffering complete failure“, and it is achieved by minimising the Mean Time To Repair (MTTR) of services. Some classes of failure should never occur, some failures are more costly than others, and some safety-critical systems should never have failures, but in general organisations should adhere to John Allspaw’s advice that “being able to recover quickly from failure is more important than having failures less often“.

Resilience can be thought of as graceful extensibility. In Four Concepts for Resilience and their Implications for Systems Safety in the Face of Complexity, David Woods describes graceful extensibility as “the ability of a system to extend its capacity to adapt when surprise events challenge its boundaries“. Optimising for resilience means creating a production environment that can gracefully extend to deal with the unpredictable behaviours, unexpected changes, and periods of failure that will inevitably occur with running IT services. This allows for the cost per unit time and duration of production failures to be minimised, reducing both the direct revenue costs and indirect opportunity costs created by a failure.

Resilience needs to be built into teams and services throughout an organisation. In Resilience Engineering In Practice, Erik Hollnagel et al define the cornerstones of Resilience Engineering as:

  • Anticipation is knowing what to expect. This is imagining the potential for future failures, and mitigating for those scenarios in advance
  • Monitoring is knowing what to look for. This is inspecting past and present operating conditions, and alerting when anomalies occur
  • Response is knowing what to do. This is using guidelines, heuristics, improvisation skills, and situational awareness to mitigate a failure
  • Learning is knowing what has happened. This is understanding the circumstances of a near-miss or failure, and sharing the observations

These cornerstones are non-linear and complementary. For example, if a team has a major launch in its near-future it might invest more time in anticipating failure scenarios, which might result in improved monitoring and response capabilities.


The graceful extensibility of an organisation is derived from the adaptive capacity of its teams and their services. When an organisation optimises for resilience it can create sources of adaptive capacity by making a long-term investment in the operability of its IT services. Operability is defined as “the ability to keep a system in a safe and reliable functioning condition“, and it is associated with a set of practices:

Each of these operability practices can be linked to a cornerstone of Resilience Engineering. They will produce a more effective incident response, and increase adaptive capacity:

  • AnticipationAutomated Infrastructure creates reproducible environments, and a Defensive Architecture limits failure blast radius. Smoke Testing verifies service health, and Chaos Engineering uncovers latent failures in production. Shared On-Call fosters a “You Build It, You Run It” culture and increases situational awareness, and Runbooks are a repository for operational knowledge
  • MonitoringLogging radiates data on traffic, errors, latency, and saturation, and Monitoring visualises service metrics and events in a time series. Anomaly detection identifies events that breach normal operating conditions and Alerting notifies operators of abnormalities to act on. User analytics show success rates for user journeys.
  • ResponseFeature Toggles allow features to be limited, tested in isolation, or turned off on failure, and Self-Healing automatically restores failed service instances
  • LearningBlameless Post-Mortems uncover the multiple contributors to a near-miss or failure and suggest future preventative measures, while respecting the best efforts of individuals and the dangers of hindsight bias 1

For example, incident response at Fruits-U-Like would be much improved if the organisation was optimised for resilience. Assume its third party registration service starts to struggle under load, new customers cannot checkout their purchases, and the failure cost per unit time is £80K per day. The checkout team would receive an automated alert for the failure, and their logging and monitoring dashboards would show a correlation between checkout and registration failures. The team would be able to triage a third party registration error within 5 minutes, and self-deploy an improvement to connection handling within a day. The failure would have a 1 day repair cost of £80K, with a detection sunk cost of £278 and a remediation opportunity cost of £79,722.

If the checkout team adopted Defensive Architecture techniques they could combine a Circuit Breaker, a Bulkhead, and a Feature Toggle in anticipation of registration errors. If the registration service struggled under load the Circuit Breaker would regulate registration requests to allow a percentage to succeed, and the Bulkhead would warn the checkout frontend to skip registration for some customers. This approach would reduce the failure cost per unit time to a marketing opportunity cost of £5K a day. The checkout team would not receive an alert, but within minutes their dashboards would highlight registration errors and they could use a Feature Toggle to enable anonymous checkouts for new customers. This would allow them to deploy their connection handling fix within 3 hours with no customer impact. The result would be a 3 hour repair cost of £625, with a sunk cost of £18 and an opportunity cost of £607.

1 In How Complex Systems Fail, Richard Cook warns that “hindsight bias remains the primary obstacle to accident investigation. There is no such thing as a root cause in a complex production system, nor a blameworthy individual

The Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. Responding To Failure When Optimising For Robustness
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler – TBA


This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

Tags: , , ,

Why is it wrong to assume failures are preventable in IT? Why does optimising for robustness leave organisations ill-equipped to deal with failure, and what are the usual outcomes?

In this article Steve Smith explains why a production environment is always in a state of near-failure, why optimising for robustness results in a brittle incident response process, and why Dual Value Streams are a common countermeasure to failure.

  1. The Cost And Theatre Of Optimising For Robustness
  2. When Optimising For Robustness Fails
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler – TBA


An organisation that optimises for robustness will attempt to maintain a production environment free from failure. This approach is based on the belief that failures in IT services are caused by isolated, faulty changes that are entirely preventable. A production environment is viewed as a set of homogeneous processes, with predictable interactions occurring in repeatable conditions. This matches the Cynefin definition of a complicated system, in which expert knowledge can be used to predict the cause and effect of events.

Optimising for robustness will inevitably lead to an overinvestment in pre-production risk management, and an underinvestment in production risk management. Symptoms of underinvestment include:

  • Stagnant requirements – “non-functional” requirements are deprioritised for weeks or months at a time
  • Snowflake infrastructure – environments are manually created and maintained in an unreproducible state
  • Inadequate telemetry – logs and metrics are scarce, anomaly detection and alerting are manual, and user analytics lack insights
  • Fragile architecture – services are coupled, service instances are stateful, failures are uncontained, and load vulnerabilities exist
  • Insufficient training – operators are not given the necessary coaching, education, or guidance

This underinvestment creates an inoperable production environment, which makes it difficult for operators to keep IT services in a safe and reliable functioning condition. This will often be deemed acceptable, as production failures are expected to be rare.


A production environment of running IT services is not a complicated system. It is an intractable mass of heterogeneous processes, with unpredictable interactions occurring in unrepeatable conditions. It is a complex system of emergent behaviours, in which the cause and effect of an event can only be perceived in retrospect.

As Richard Cook explains in How Complex Systems Fail, “the complexity of these systems makes it impossible for them to run without multiple flaws being present“. A production environment always contains partial faults, and is constantly in a state of near-failure.

A failure will occur when unrelated faults unexpectedly coalesce such that one or more functions cannot succeed. Its revenue cost will be a function of cost per unit time and duration, with cost per unit time the economic impact and duration the time between start and end. Its opportunity costs will come from loss of customer confidence, and increased failure demand slowing feature development.

An organisation optimised for robustness will be ill-equipped to deal with a failure when it does occur. The inoperability of the production environment will produce a brittle incident response:

  • Stagnant requirements and insufficient training will make it difficult to anticipate how services might fail
  • Inadequate telemetry will impede the monitoring of normal versus abnormal operating conditions
  • Snowflake infrastructure and a fragile architecture will prevent a rapid response to failure

For example, at Fruits-U-Like a third party registration service begins to suffer under load. The website rejects new customers on checkout, and a failure begins with a static cost per unit time of £80K per day. A lack of telemetry means the operations team cannot triage for 3 days. After triage an incident is assigned to the checkout team, who improve connection handling within a day. The Change Advisory Board agrees the fix can skip End-To-End Testing, and it is deployed the following day. The failure has a 5 day repair cost of £400K, with a detection sunk cost of £240K and a remediation opportunity cost of £160K.

After a failure, the assumption that failures are caused by individuals will lead to a blame culture. There will be an attitude Sidney Dekker calls the Bad Apple Theory, in which production is considered absolutely reliable bar the actions of a few unreliable employees. The combination of the Bad Apple Theory and hindsight bias will create an oppressive culture of naming, blaming, and shaming the individuals involved. This discourages the sharing of operational knowledge and organisational learnings.


An organisation optimised for robustness will be in a state of Discontinuous Delivery. Attempting to increase the Mean Time Between Failures (MTBF) with practices such as End-To-End Testing will increase feature lead times to the extent that business demand will be unsatisfiable. However, the rules for deploying a production fix will be very different.

When a production fix for a failure is available, people will share a sense of urgency. Regardless of how cost per unit time is estimated, there will be a recognition that a sunk cost has been incurred and an opportunity cost needs to be minimised. There will be a consensus that a different approach is required to avoid long feature lead times.

Dual Value Streams is a common countermeasure to failure when optimising for robustness. For each technology value stream in situ, there will actually be two different value streams. The feature value stream will retain all the advertised pre-production risk management practices, and will take weeks or months to complete. The fix value stream will strip out most if not all pre-production activities, and will take days to complete.

At Fruits-U-Like, that means a 12 week feature value stream from code to production and a 5 day fix value stream from failure start to end 2.


Dual Value Streams signify Discontinuous Delivery, but they also show potential for Continuous Delivery. The fix value stream indicates the lead times that can be accomplished when people have a shared sense of urgency, actively collaborate on releases, and omit the risk management theatre.

1 In The DevOps Handbook by Patrick Debois et al telemetry is defined as a logical grouping of logging, monitoring, anomaly detection, alerting, and user analytics

2 Measuring Continuous Delivery details why deployment failure recovery time should include development time and deployment lead time should not. Deployment failure recovery time is measured from failure start to failure end, while deployment lead time is measured from master commit to production deployment

The Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. Responding To Failure When Optimising For Robustness
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler – TBA


This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

Tags: , , ,

Why do so many organisations optimise their IT delivery for robustness? What risk management practices are normally involved, and do their capabilities outweigh their costs?

In this article Steve Smith explains what optimising for robustness is, and why it is inadequate for IT delivery. This is part of the Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. When Optimising For Robustness Fails
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler – TBA

The robustness tradition

As software continues to eat the world, organisations must position IT at the heart of their business strategy. The speed of IT delivery needs to be capable of satisfying customer demand, and at the same time the reliability of IT services must be ensured to protect daily business operations.

In Practical Reliability Engineering, Patrick O’Connor and Andre Kleyner define reliability as “The probability that [a system] will perform a required function without failure under stated conditions for a stated period of time“. When an organisation has unreliable IT services its business operations are left vulnerable to IT outages, and the cost of downtime could prove ruinous if market conditions are unfavourable. Such an organisation will have an ingrained fear of failure, due to the lack of confidence in those IT services. There will also be a simultaneous belief that failures are preventable, based on the assumption that IT services are predictable and failures are caused by isolated changes.

In these circumstances an organisation will traditionally optimise for robustness. It will focus on maximising the ability of its IT services to “resist change without adapting [their] initial stable configuration, by increasing Mean Time Between Failures (MTBF). It will use robustness-centric risk management practices in its technology value streams to reduce the risk of future failures, such as 1:

  • End-To-End Testing to verify the functionality of a new service version against its unowned dependent services
  • Change Advisory Boards to assess, prioritise, and approve the deployment of new service versions
  • Change Freezes to restrict the deployment of new service versions for a period of time derived from market conditions

Consider a fictional Fruits-U-Like organisation, with development teams working to 2 week iterations and a quarterly release cycle. Fruits-U-Like has optimised itself for robustness ever since a 24 hour website outage 5 years ago. Each release goes through 6 weeks of End-To-End Testing with the testing team, a 2 week Change Advisory Board, and 1 week of preparation with the operations team. There are also several 4 week Change Freezes throughout the year, to coincide with marketing campaigns.

Costs and Risk Management Theatre

Robustness is a desirable capability of an IT service, but optimising for robustness invariably means spending too much time for too little risk reduction. The risk management practices used will be far more costly and less valuable than expected:

If the next Fruits-U-Like release was estimated to be worth £50K per day in new revenue, the 12 week lead time would create a total opportunity cost of £4.2 million. This would include the handover delays between the development, testing, and operations teams due to misaligned priorities. If a Change Freeze delayed the deployment by another 4 weeks the opportunity cost would increase to £5.6 million.

These risk management practices are what Jez Humble calls Risk Management Theatre. They are based on the misguided assumption that preventative controls on everyone will prevent anyone from making a mistake. Furthermore, they actually increase risk by ensuring a large batch size and a sizeable amount of requirements/technology changes per service version 2. They impede knowledge sharing, restrict situational awareness, create enormous opportunity costs, and doom organisations to a state of Discontinuous Delivery.

1 Other practices include manual regression testing, segregation of duties, and uptime incentives for operators

2 The Principles of Product Development Flow by Don Reinertsen describes in detail how large batch sizes increase risk

The Resilience As A Continuous Delivery Enabler series:

  1. The Cost And Theatre Of Optimising For Robustness
  2. When Optimising For Robustness Fails
  3. The Value Of Optimising For Resilience
  4. Resilience As A Continuous Delivery Enabler – TBA


This series is indebted to John Allspaw and Dave Snowden for their respective work on Resilience Engineering and Cynefin.

Thanks to Beccy Stafford, Charles Kubicek, Chris O’Dell, Edd Grant, Daniel Mitchell, Martin Jackson, and Thierry de Pauw for their feedback on this series.

Tags: , , ,

Projects kill teams and flow

Given the No Projects definition of a project as “a fixed amount of time and money assigned to deliver a large batch of value add“, it is not surprising that for many organisations a new project heralds the creation of a Project Team:

A project team is a temporary organisational unit responsible for the implementation and delivery of a project

When a new project is assigned a higher priority than business as usual and the Iron Triangle is in full effect, there can be intense pressure to deliver on time and on budget. As a result a Project Team appears to be an attractive option, as costs and progress can be monitored in isolation, and additional personnel can be diverted to the project when necessary. Unfortunately, in addition to managing the increased risk, variability, and overheads associated with a large batch of value-add, a Project Team is fatally compromised by its coupling to the project lifecycle.

The process of forming a team of complementary personnel that establish a shared culture and become highly productive is denied to Project Teams from start to finish. At the start of project implementation, the presence of a budget and a deadline means a Project Team is formed via:

  1. Cannibalisation – impairs productivity as entering team members incur a context switching overhead
  2. Recruitment – devalues cultural fit and required skills as hiring practices are compromised

Furthermore, at the end of project delivery the absence of a budget or a deadline means a Project Team is disbanded via:

  1. Cannibalisation – impairs productivity as exiting team members incur a context switching overhead
  2. Termination – devalues cultural fit and acquired skills as people are undervalued

This maximisation of resource efficiency clearly has a detrimental effect upon flow efficiency. Cannibalising a team member objectifies them as a fungible resource, and devalues their mastery of a particular domain. Project-driven recruitment of a team member ignores Johanna Rothman’s advice that “when you settle for second best, you often get third or fourth best” and “if a candidate’s cultural preferences do not match your organisation, that person will not fit“. Terminating a team member denigrates their accumulated domain knowledge and skills, and can significantly impact staff morale. Overall this strategy is predicated upon the notion that there will be no further business change, and as Allan Kelly warns that “the same people are unlikely to work together again“, it is an extremely dangerous assumption.

The inherent flaws in the Project Team model can be validated by an examination of any professional sports team that has enjoyed a period of sustained success. For example, when Sir Alex Ferguson was interviewed about his management style at Manchester United he described his initial desire to create a “continuity of supply to the first team… the players all grow up together, producing a bond“. This approach fostered a winning culture that valued long-term goals over short-term gains, and led to 20 years of unrivalled dominance. It is unlikely that Manchester United would have experienced the same amount of success had their focus been upon a particular season at the expense of others.

Therefore, the alternative to building a Project Team is to grow a Product Team:

A product team is a permanent organisational unit responsible for the continuous improvement of a product

Following Johanna’s advice to “keep teams of people together and flow the projects through cross-functional teams“, Product Teams are decoupled from project lifecycles and are empowered to pull in work as required. This enables a team to form a shared culture that reduces variability and improves stability, which as observed by Tobias Mayer “leads to enhanced focus and high performance“. Over a period of time a Product Team will master the relevant business and technical domains, which will fuel product innovation and produce a return on investment that rewards us for making the correct strategic decision of favouring products over projects.

Tags: , ,

No Projects

Projects kill flow and teams. Focus on products, not projects

Since the Dawn of Computer Time, enormous sums of money and embarrassing amounts of time have been squandered upon software projects that have delivered little or no return on investment, with projects floundering between segregated Business and IT divisions squabbling over overestimated value-add and underestimated delivery dates. Given Grant Rule’s assertion that “studies too numerous to mention show that software projects are challenged or fail“, why are software projects so prone to failure and why do they persist?

To answer these questions, we must understand what constitutes a software project and why its delivery model is incongruent with product development. If we start with the PRINCE 2 project definition of “a temporary organization that is needed to produce a unique and predefined outcome or result at a pre-specified time using predetermined resources“, we can offer a concise definition as follows:

A project is a fixed amount of time and money assigned to deliver value-add

The key characteristic of a software project appears to be its fixed end date, which as a delivery model has been repeatedly debunked by IT practitioners such as Allan Kelly denouncing “endless, pointless discussions about when it will be done… successful software doesn’t have a pre-specified end date” and Marc Lankhorst arguing that “over 80% of IT spending in large organisations is on maintenance“. However, the fixed end date of a software project is invariably a consequence of its requirement for a collection of value-adding features to be simultaneously delivered, suggesting an augmented definition of:

A project is a fixed amount of time and money assigned to deliver a large batch of value-add

Once we view software projects as large batches of value-add, we can apply The Principles Of Product Development Flow by Don Reinertsen and better understand why so many projects fail:

  1. Increased cycle time – a project might not be deliverable on a particular date unless either demand is throttled or capacity is increased, e.g. artifically reduce user demand or increase staffing levels
  2. Increased variability – a project might be delayed due to unpredictable blockages in the value stream, e.g. testing of features B and C blocked while testing of feature A takes longer than expected
  3. Increased feedback delays – a project might incur significant costs due to slow feedback on bad design decisions and/or defects increasing rework, e.g. failures in feature C not detected until features A and B have passed testing
  4. Increased risk – a project might have an increased probability and cost of failure due to increased requirements/technology change, increased variation, and increased feedback delays
  5. Increased overheads  – a project might endure development inefficiencies due to increased requirements/technology change, e.g. feature C development time increased by need to understand complexity of features A and B
  6. Increased inefficiencies – a project might encounter increased transaction costs due to increased requirements/technology change e.g. feature A slow to release as features B and C also required for release
  7. Increased irresponsibility – a project might suffer from diluted responsibilities, e.g. staff member has responsibility for delivery of feature A but is unincentivised to participate in delivery of features B or C

Don also provides a compelling explanation as to why the project delivery model remains prevalent, by explaining how large batches can become institutionalised as they “appear to have scale economies that increase efficiency [and] appear to reduce variability“. Software projects might indeed appear to be efficient due to perceived value stream inefficiencies and the counter-intuitiveness of batch size reduction, but from a product development standpoint it is an inefficient, ineffective delivery model that impedes value, quality, and flow.

There is a compelling alternative to the project delivery model – product development flow, in which we apply economic theory to Lean product development practices in order to flow product designs through our organisation. Product development flow emphasises the benefits of batch size reduction and encourages a one piece continuous flow delivery model, in order to reduce costs and improve return on investment.

Discarding the project delivery model in favour of product development flow requires an entirely different mindset, as epitomised by Grant urging us to “accommodate the ideas of flow production and lean systems thinking” and Allan affirming that “BAU isn’t a dirty word… enhancing products is Business As Usual, we should be proud of that“. On that basis the No Projects movement was conceived by Joshua Arnold to promote the valuation of products over projects, and anointed as:

Projects kill flow and teams. Focus on products, not projects

Tags: , , , ,

« Older entries