TOAD: RRRAR! Rollbacks, Replays, Reverts, and Regressions

TOAD is

Testing
Observability
And
DevOps

My last blog post was about why TOAD ideas are important and how TOAD ideas interact with each other. But even the most excellent TOAD systems occasionally release bugs to production. This post describes how to understand these problems and how to address them when they happen. And there is an experience report at the end!

Most of the time we learn about problems in the production system because the system is observable. We see a performance problem, or a data problem, or a space problem, and we can use observability tools to narrow down and ultimately fix the cause(s) of the observed problem. There is also an entire class of problems that cannot be discovered with typical observability tools, but only by testing the system. Given that we find such problems in production, what can we do about it?

Rollback and Replay

The most drastic response to a production problem is to "roll back" the system to the last known good state by means of restoring a system backup or backups of some sort. Organizations that adopt a rollback strategy to mitigate production problems will typically have the ability to execute all the transactions since the rollback time a second time, against the repaired system, in order to bring the state of the system back to the current time. This strategy is expensive, and not all that common. (If the organization has to roll back but is not capable of replaying all the system transactions, users are forced to re-do everything they already did once, which is not very nice.)

Revert

A more common strategy to repair a problem in production is to identify the the particular commit to the code base that contains the code that caused the problem, and then revert that particular commit, thus restoring the system to a state where we can attempt to achieve whatever the problem code was attempting to achieve but in a way that does not disrupt production. It is possible to skip the revert step completely, and simply push some new code that corrects the problem; this is typically called a "hot fix"; the net result is the same as a revert plus a new push.

Regression

When a feature that used to work no longer works, we tend to call this a "regression" problem, but "regression" is really a misnomer. It is rare in a TOAD system to break a feature by introducing older, broken code as the word "regression" implies. It is also rare that a feature that used to work but no longer does can be fixed by putting older code in place of newer; much more likely is that we will have to fix the feature by adding even more code. In a TOAD system, a "regression" bug is almost certainly going to be a "progression" bug, caused by adding new code that seemed reasonable but whose implications in production we failed to understand properly.

Regression Testing with TOAD

As I noted in my previous post, Devops describes the nature of the development process from idea to production. Observability tells us what has happened on the system. Only the testing part of TOAD describes what the system *should* do. Put another way, the T in TOAD is our model of how the system should work, it is the part of TOAD that asserts things about the behavior of the system before the system runs, when we then check that our assertions are correct. There is a class of production problems that cannot be identified by observability but only by testing. When we release one of these problems to production, it is a sign that our model of the system as embodied in our tests is either incomplete, or not understood correctly. It follows that every behavior problem released to production that cannot be identified with observability tools should prompt a change to our testing. While we may call this change a "regression test", it actually represents an update to our understanding of our model of the system's behavior.

An Experience Report

In early 2012 I was hired by the Wikimedia Foundation to create a QA and testing practice. Part of this work was to create a shared test environment called "beta labs". (Although it has evolved wildly since 2012, the beta cluster still exists!) Production Wikipedia is a fiendishly complex environment, and emulating it in the beta labs test environment was a hard problem. In late 2012 beta labs had a lot of problems and glitches, and improving it as a model for Wikipedia was one of my top priorities.

The Ops team for Wikipedia has always been exceptionally talented and effective. Wikipedia has had observability baked in from its earliest days. (Testing came along much later, but is a big part of the culture today.) Wikipedia ran for its first decade with essentially no formal testing at all. This was possible because they had from the very first an architecture and a culture where they made reverting easy. Any commit that caused an observable problem could be reverted instantly. (During my tenure there it was possible to earn a t-shirt saying "I broke Wikipedia/But I fixed it")

But some problems are not amenable to observability tools in the Ops sense. In late 2012 we started hearing reports from Wikipedia editors that sub-headings in Wikipedia articles were being rendered in wildly inappropriate sizes: very large, very small, and everything in between.

I immediately recognized these reports. I and a number of other responsible people had seen these corrupt article headers in beta labs in late November 2012. That allowed us to narrow the time window to find the offending commit. Unfortunately, even though several people had seen the problem in the test environment, none of us *believed* that it was a real problem, because we did not yet trust beta labs to be an accurate model of Wikipedia. (That bug was my first real sign that we were on the right track with beta labs, and I *always* paid attention to bugs there afterward!)

The story behind how that bug made it to production is instructive. The source of the bug was a pull request for the Wikipedia editor from a contributor/developer in Europe. They were lobbying hard to get the PR merged, but the PR had not passed code review after several revisions, and a number of responsible developers were not willing to merge that PR, because they suspected it might not be completely sound. Late in the evening on Thanksgiving Day, while the US Wikimedia Foundation staff had Thanksgiving Day holidays, the author of the PR persuaded a European developer to approve the PR and merge it. No one noticed. Merging that PR caused the changed code to be deployed to beta labs automatically, where I and my colleagues saw the bug but did not believe the test environment. From beta labs the code went to production automatically.

This was a bug that could not be identified with normal observability tools, but only by testing-- and the testing failed. The bug was active for a number of days before we were able to revert it, but we were not able to revert all the corrupt sub-headers the bug created. One of our developers had to painstakingly find and fix every one of a couple of hundred bad headers on Wikipedia articles.

RRRAR!

Testing, Observability, and DevOps are interlocking practices that support each other in order to get good code to production quickly. But even so, things can go wrong. Before things go wrong, we should have a plan in place, and know our options: Rollback, Replay, Revert, and Regression are all strategies to consider.

Chris McMahon's Blog

Search This Blog