Skip to main content

Posts

Showing posts from June, 2019

TOAD Goes To 11

TOAD is Testing, Observability And Devops. We think these three things are related. What would happen if we took each of the three aspects of TOAD and put strong emphasis on each in turn? What happens if we take our testing effort as far as it can go, and "dial it to 11"? Observability to 11? Devops to 11? The meaning of "dial to 11" will be different for different organizations: it might mean hiring staff, or investing in tools, or even just emphasizing the mission more than before.

If we dial testing to 11, I think two things will result. For one thing, the number of observable incidents in production will likely go down, because the emphasis on testing means that more problems will be found and fixed before deploying to prod. I also think that rate of deployments (Devops) could potentially increase, because the decrease in observable incidents will make deployments safer. So: increase the T in TOAD to decrease the O and increase the D.

With testing at 11, now…

TOAD: RRRAR! Rollbacks, Replays, Reverts, and Regressions

TOAD is

Testing
Observability
And
DevOps

My last blog post  was about why TOAD ideas are important and how TOAD ideas interact with each other. But even the most excellent TOAD systems occasionally release bugs to production. This post describes how to understand these problems and how to address them when they happen. And there is an experience report at the end!

Most of the time we learn about problems in the production system because the system is observable. We see a performance problem, or a data problem, or a space problem, and we can use observability tools to narrow down and ultimately fix the cause(s) of the observed problem. There is also an entire class of problems that cannot be discovered with typical observability tools, but only by testing the system. Given that we find such problems in production, what can we do about it?
Rollback and Replay The most drastic response to a production problem is to "roll back" the system to the last known good state by means …