In many organizations the on-call is among a wider circle of people to make sure...

beebmam · on March 14, 2022

That's a recipe for disaster in my experience. Specifically: on-call burnout

dgritsko · on March 14, 2022

Interesting, I've had the opposite experience - an on-call rotation that's too wide makes it difficult to know how to respond effectively to an alert, since it's likely to involve a service that you may know nothing about.

Macha · on March 14, 2022

Yeah, someone asking you every 15 minutes for an update while you're trying to read the beginner's guide to service X and connect it to their question isn't a fun time.

beebmam · on March 14, 2022

Ah, I can see how my statement could be misinterpreted. I'm actually in agreement with you here: teams should only be on call for the services they own. Being on call for services you don't own is a total nightmare and leads to an exodus of competent engineers.

xboxnolifes · on March 15, 2022

I think there is still a misunderstanding here. I'm not the person you originally replied to, but I read their message as having focus on the small team aspect. If there is a team of, say, 2-3 devs, their rotations cycle would essentially be every 2-3 weeks. Even if it's only on their service, they'd still be on rotation every 2-3 weeks. Alternatively, if there is a wider group of people, the people on smaller teams wouldn't need to constantly be on rotation.

I'm not sure if one is better than the other, but the assumption of only being on-call for services you own assumes some minimum team size, or as acceptance of being on-call every other week.

dredmorbius · on March 15, 2022

Opportunity.

1. Learn about X.

1.1 See that awareness of X is distributed through the team.

2. Document that X is not documented. Have X dev / support team document X. Configuration, monitoring, troubleshooting, recovery, escallation.

3. Scale expectations. If X is not documented and awareness is limited, then issues involving X are not expected to be resolved by on-call staff and will be addressed when X dev team are available.

4. Scale opportunity/risk of X based on the actual experience and downtime it affords.

SRE is fundamentally a risk-management exercise. Manage the risks. Assess them, mitigate them, reduce them. And measure the damned costs of them.