5 traits of teams that make on-call less terrible for developers
I understand the frustration of the anti on-call party. If you go to school to be a doctor, you know that being on-call is likely in your future. You didn't know that being on-call would be a part of your job when you started writing code.
Like the anti on-call party, I find no immediate joy being on-call. But today, you're swimming upstream if you are a developer and don't want to be on call. In Who owns on-call?, Increment surveyed over thirty industry leaders about their on-call rotations. All but one had developers on call (Slack), but that is changing:
Crowley says that they’ve recently started to see scalability problems with the old way of operations, however, which led Slack to create a secondary on-call rotation full of developers; software and performance bugs, he says, are becoming much more common than low-level infrastructure problems—bugs that only the development teams know how to fix.
To me, it's a no-brainer: if the root cause of most incidents are hardware and network partition failures, then it makes sense for operations to be on-call. They are the ones familiar with those systems. However, if the majority of problem lie within code, a developer needs to write the fix. The underlying infrastructure our apps sit on top of is becoming remarkably more reliable, which means developers are gaining more responsibility (that's a positive spin, right?).
Over the past decade, I've primarily worked on small, fast-moving development teams. I've always valued my time away from the office and believe our developers should too. Developers have always been apart of these on-call rotations and I haven't hated this experience. Below are five traits I've seen from teams that have a healthy relationship with being on-call:
1. Move fast with stable infrastructure
Your hip startup office walls may be covered with posters of the famous Facebook "Move fast and break things" slogan. What you might not know - Facebook changed that motto to "Move fast with stable infrastructure".
That's a lot less of a catchy, poster-worthy slogan, isn't it?
On my teams, we've prioritized a calm system over everything else. You can't have everything, so this means we'll prioritize things like clear rollback instructions in PRs, logging, error handling, and database query edge cases over code syntax issues.
2. Let your team own their uptime
In a conversation with my dad when our kids were in the baby stage, he remarked that the hardest part was changing diapers. I was shocked. To my wife and I, it was clearly the lack of sleep.
Our kids are just past that stage, but I've already forgotten the feeling of being constantly tired. I can see how that feeling would fade for my father.
Performance incidents on fast-moving, small teams are like this: when an incident occurs, it's important to implement a fix while the pain is still top-of-mind. Otherwise, you'll forget the pain and will be far less motivated to address the problem. Ensure developers know they can - and should - bump resolving future stability woes to the top of the list.
3. Make something boring to drive progress
It cites the examples of the aviation industry’s approach to process, which enables remarkable creativity under stressful conditions by mental automation of routine operations.
My teams have all seen a similar pattern: you can't push forward on the product when it feels like things are constantly falling down around you and there isn't an established checklist for handling incidents. Systematizing one thing frees up the mind for creative work.
4. Have an informal error budget
The Google SRE Book advocates for an error budget:
The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
Few services really require five nines of uptime. High availability stifles forward progress as many issues are triggered by changes to code. For smaller teams or new services, establishing a hard error budget number may less important than establishing a general team-wide compass on uptime expectations and how it changes the release process. If there's been a burst of customer-facing errors, we'll be more conservative on our changes for a period of time.
5. Empathy, man
All of the above falls down if developers don't have empathy for their fellow team members. At some point, everyone makes a change that requires a colleague to cleanup the mess. How the mess-maker reacts in those scenarios is an incredible indicator of how smooth hitting uptime targets will feel. Do they apologize? Do they work to quickly resolve the error, update a runbook, etc? When these are handled well, it's a far better team-building experience than trust falls.
What about tooling?
If engineering teams are prioritizing stable systems, it's very likely that repeat incidents aren't occurring. Instead, you'll be paged on new and slightly different versions of problems. These issues can be difficult to identify in typical dashboard-driven monitoring systems as they root cause gets lost on overview charts. Identifying these outliers - and the conditions that trigger them - is a focus of our APM product at Scout. I'm also a fan of Honeycomb, which is designed for solving high-cardinality problems.
It's a culture thing
I've heard on-call horror stories at teams large and small. If you're looking at joining a new team, you should ask about how on-call is handled, if stability issues outweigh feature progress, how frequently you should expect to be paged, and the general flow of work during and after an incident. It's a litmus test of how they value their own personal time, which extends far beyond an on-call rotation.