Ansible, Simple, and Anti-Fragility

Once upon a time, Red Hat acquired a company called Ansible. Right before the transaction closed, The Boss called and said, “Erich, I want you to make Ansible part of Red Hat. Don’t @#$% it up.” That’s when my real adventure with automation began.

Ok, the tech is nearly as amazing as the people, but how did it fit into our business? To really get this right, I spent a lot of time learning about the why’s of automation – what’s is really good for, and how it changes work.

Four goals of automation

  • Do more of the things
  • Do the things with greater consistency
  • Do the things cheaper
  • Less training to do the things

I suspect there might not be a lot of surprise in that list. Once it’s written down or said aloud, it seems pretty intuitive. What I’ve found interesting about, and unique to Ansible, is how much our design principle of “simple” makes all four better. Simple lets you get started faster on new things (less training), and takes less time and lets you do more things. Simple means fewer mistakes (consistency), and all of these together lead to less expensive operations.

But, what about how automation changes work? At the first pass, those four goals are big changes to work in of themselves. But, the biggest impact comes when things change.

Systems are Complex. Snippet of Apache Gora project represented by Chong and Lee (2015a) 

Here’s an unpopular truth, if you work with systems of any size, you don’t know exactly how they work. If you’re diligent, you may have a great idea how they work. If you’ve worked with them long enough, you have a great intuition of how they work. You might know how to look up how a particular subsystem works, or who to call when another isn’t performing as desired. But you don’t, with precision, know.

That matters because it also means you don’t know all of the assumptions that lead your processes to work. Automation can break in at least three cases:

  • Inputs change
  • Conditions change
  • Things break

Consumption and use of a robust automation system, one which holds up in light of these expected changes, requires both the users and the technology to absorb these changes.

  • We can still do more things, but the volume of things can overwhelm when something breaks. Operator workload can become lumpy and unevenly distributed, especially at peak times, we introduce cognitive overload with new metrics, and other changes to the work itself.
  • The things we automate become more precise, but we see new types of errors emerge: system errors, unmet requirements/edge cases, more complex behaviors to manage.
  • Things are less expensive at the unit cost, pure replacement with automation is doesn’t happen, and is referred to as the “substitution myth.” Automation changes work.
  • Things are easier to do, and require less training. But, there’s an increased need for ongoing training, need to know the system as well as the components, with more emphasis on the system.

There are two approaches for creating more robust systems: handle as many edge cases as possible, which introduces more complexity to the system and makes it harder to fix the edge cases you missed; and simplicity. We chose simplicity.

We chose simple to help the teams understand the systems they’re using. When something breaks, they know where to look, and have a shared language to work with others impacted by the systems.

We chose simple to lower the expertise and effort to get started, creating more opportunity to automate little things in a learning, incremental approach to building hyper-scale automation systems.

We chose simple, because it’s more robust.

We chose simple, because it’s better.

Additional Reading on Automation and Resiliency