Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Diary of a first-time on-call engineer (thenewstack.io)
130 points by kiyanwang on March 14, 2022 | hide | past | favorite | 161 comments


Just a reminder that on-call rotations are abusive and unnecessary and should be one of the first targets of a developers union.

If an enterprise thinks their applications should be up 24/7 they should staff for 24/7 support.

On-call is a way for bad businesses to steal money from their workers by expecting them to work 24/7 without paying them 4.2x their salary.

And even if they did pay a full wage per hour on call it wouldn’t be worth it because nothing truly compensates you for the hours of your life you’re losing while tethered to your work phone and computer.


Long ago, I worked in a shop that I feel did it the right way: allow techs to sign up for pager duty a week at a time, and each week would pay a bonus of a few hundred bucks. It was always voluntary, and people were eager to take as many shifts as they could sign up for. We'd round-robin it between all the people who wanted it. If your plans changed, there was always someone happy to pick up an extra bonus on short notice.

Let's go over that again: People liked being on pager duty, and all it took was a few hundred bucks a week.

Of course, this was in a pretty competent company. The pager would only go off a couple times a week, it was usually easy to fix, and there was a well-documented escalation path if you couldn't handle it on your own. So we weren't losing sleep.

If the pager started being painful, fewer people would sign up, giving a signal to management that there was a problem. They could then either increase the bonus, or fix the root problems.

I have no idea why other places don't handle it this way. It worked well for everyone involved.


Germany has laws around this. A German coworker of mine gets almost $800/week that he is on call.


Do you have paragraphs at hand to share? I would like to see if company I work for is compliant with those laws


Courts ruled that when you are on-call, they have to pay at least minimum wage for every hour you are on-call. Additionally, you get tax benefits for working night and Sunday, which your employer has to declare.

That's why you earn so much more during on-call in Germany


That depends on the exact arrangements of your oncall duties. Typical developer oncall where you can be wherever you want as long as you can reach your laptop with an internet connection within x minutes does not fall under these laws.


No, it does. Because whenever you have to be "ready" for work, it counts as work.

>Bereitschaftsdienst Beschäftigte im Bereitschaftsdienst halten sich in der Regel im Unternehmen oder in unmittelbarer Nähe auf, damit sie die Arbeit bei Bedarf sofort bzw. zumindest zeitnah aufnehmen können. Ein typisches Beispiel sind Feuerwehrangestellte, die auf der Wache die Nacht verbringen, aber dort schlafen können. Bereitschaftsdienst gilt in vollem Umfang als Arbeitszeit und muss daher bei der täglichen und wöchentlichen Arbeitszeit voll berücksichtigt werden. Die Vergütung des Bereitschaftsdienstes richtet sich nach dem jeweiligen Arbeitsvertrag, beziehungsweise gültigem Tarifvertrag oder Betriebs/Dienstvereinbarung.

>Arbeitsbereitschaft Bei der Arbeitsbereitschaft sind die betreffenden Beschäftigten im Zustand „wacher Achtsamkeit im Zustande der Entspannung“ am Arbeitsplatz anwesend. Sie sind jederzeit einsatzbereit. Ein Beispiel könnte eine Beschäftigte in einem Elektrizitätswerk sein, die beim Piepsen eines Überwachungsmonitors sofort aktiv werden muss, über längere Phasen aber keine Aufgaben zu erledigen hat. Arbeitsbereitschaft ist Arbeitszeit und muss im vollen Umfang auf die tägliche und wöchentliche Arbeitszeit angerechnet werden.

>Die Vergütung richtet sich auch hier nach dem jeweiligen Arbeitsvertrag, beziehungsweise gültigem Tarifvertrag oder Betriebs/Dienstvereinbarung.

To add to this: If your boss calls you when you are on vacation, you get that full vacation day back. If your boss asks you to be ready in case he might call during your vacation, you also get those vacation days back.


The law distinguishes between Bereitschaftsdienst and Rufbereitschaft. The latter does not count as working hours and is defined by being able to freely choose where you are and what you do during that time (until you get paged of course). That fits the typical developer oncall where you don't have to be on site or close by during your oncall shift.


> is defined by being able to freely choose where you are and what you do during that time

So… not being able to: go on a long run without the phone, go watch a movie, have a big nap, go on a hike without a laptop on you.

Doesn’t sound free to me. My last on call wasn’t bad. But I still didn’t like being bound to my laptop and phone.


If your employer chooses Rufbereitschaft as their on-call mode, then that means it all adds to your working time, and you can't exceed 10hours a day. Having a 2hour incident means no more overtime allowed. Additionally, they need to clock your work during each incident. Lastly, if you have an incident outside of working hours, you are not allowed to pick up work again for another 11 hours, unless it's another incident.


I think if most shops after some incremental bonus for overnight on-call, just nobody would sign up. Salaries are high enough without it.


On-call is just fine as long as:

1. There is a rotation.

2. There are few pages, preferably the median should be 0 per week.

3. Spurious / non-actionable alerts get fixed right away (with very high priority)

4. You're not up more than 1 week per 1-1.5 month.

5. You subtract middle of the night pages from your next working day, with bad nights resulting in a day off. Being on-call doesn't mean working overtime.


6. It's either opt-in or stated as a requirement on interview.

Nobody should be coerced into on-call within a role that did not explicitly require it on joining.


I've upvoted you, but I disagree. Extra work should mean extra pay. Even just having to be ready to work should result in extra pay. You are providing a benefit to the company, and they should compensate you appropriately.

Rolling it into the contract isn't appropriate because anything that is "free" will get abused, eventually. Even if it's fine right now, it won't be eventually. And there's no way to stipulate that the "median should be 0 per week" in the contract, so that'll go out the window as soon as the company needs it to.

I upvoted you because it's obvious you've thought about a lot of the problems with this and I feel you're on the right track. You just didn't go far enough.


> On-call is a way for bad businesses to steal money from their workers by expecting them to work 24/7 without paying them 4.2x their salary.

This is not entirely true. I think the fact that companies don't have to hire separate support/SRE staff is priced into the exorbitant salaries that devs are paid these days. When I started my career around a decade back, there used to be at least 2-3 levels of escalation engineers/support staff between the customer and the developer. Today it's usually zero, and part of those salaries paid to dev instead.

I may be in the minority, but frankly, I'll easily take a pay cut not to have to go through the on-call sh*t every few weeks.


How many developers actually have exorbitant salaries?

Is 54k for 60 hours per week considered exorbitant 10 years ago? Because that’s how much I was paid while being expected to be available 24/7. My cheap apartment was a 3rd of my pay. I was fortunate to have no debt and get a car from my parents.

I was up to 90k about five years later. Which is decent in the Midwest but I wouldn’t consider it exorbitant.

I doubt I’ll see anything near $200k for a long time. By then, it’ll be less than 90k considering inflation.


Show me where that's spelled out in anyone's employment posting or negotiated by any sort of employee rep. It's clearly a grab back by employers trying to save a dime and being stupid about the long term effects.


Uh? My data points are in operations, not dev, and are from working in Europe, so that may be an entirely different reality, but I've always been paid for being on call, plus for every time I was actually called and had to work - either remotely or on-prem, depending on what broke and what access was possible.

My manager once realized that I was sometimes even called by the other oncall engineers if they were stumped, and retroactively compensated me for a "2nd level oncall".

Not even once have I ever felt exploited - it was always sufficiently flexible for my personal life, usually 1 week/month, and always reasonably compensated.


On-call is a stop-gap for when you don't yet have a fully staffed 24/7 SRE team yet, and you need to be pretty big before you can afford something like that. Plus your software needs a big round of professionalization, because it shouldn't break overnight.

I believe the SRE book mentions this; if you want your software to go from on-call engineers to being managed by SRE's, your software needs to be stable, have certain logging requirements, documentation, etc. And, if it breaks more than X times in a row, or causes an SRE to be paged too many times, the pager will be handed back - fix your shit.


That may be the idea, but in practice, companies don't even hire SREs and just have engineers do on-call rotations instead.


My previous employer tried to offer me a "promotion" that among much more accountability for systems also included being on-call. There were no clearly defined parameters, and the real kicker was that this glorious promotion included a $0 pay rise.

They were flabbergasted when I declined, and tried for weeks to get me to accept with "It will be good for your career", etc.

That is a VERY hard no from me.


This take that on-call must require you to be compensated at 100% of your normal pay is rediculous. If I got paid 50% of my salary to be on call (noting that I get paid OT if I actually work) I would consider myself to be robbing the company blind.

The idea there is no middle ground, with


This varies a lot by pages per shift. If you get paged less than once per shift on average, it would be a worse outcome for everyone if you shifted to follow the sun. If you get paged 70 times per shift, yes, that needs some rethinking. The trick is how to move from 70 per shift to less than 1 per shift. I am not sure your approach would lead to that outcome.


Being on call -at-all- just ruins your ability to go socialize. 70 is that same as 1.


How so?

Want to meet with somebody: take your laptop with you in a backpack. Just make sure you have mobile connectivity wherever you're going.

Want to party: just ask somebody to take over the pager for the evening.

If you have <=1 page per rotation, then it's really not annoying.


Where I work it seems quite decent. There's 24/7 operations staff and other overnight staff to handle minor issues. If there's anything that is affecting the site that would get us in legal trouble without a fix or is losing a lot of money, then people will get called out.

The rotation is like a week every 2 months. You get paid a substantial amount extra for that week, and you get paid on top of that if you get a call out. Even if it takes you 10 minutes to resolve or there's no issue, you get paid a minimum of a few hours at 2x pay.

It's insane to me that there's places with multiple call outs a day and people don't get paid extra.

For me, being on call is pay rise for the week. It just means I can't venture too far away from home without my laptop. It's so rare to get called out.


Thats amazing. During my corporate career, the places I worked had no additional pay for getting call outs. Would have loved some extra cash back then (was unmarried - happy to do extra hours)


on-call is brutal, i basically agree with you except on the last point; i think people should be allowed to decide for themselves if they want to do it assuming they are paid handsomely


That doesn’t work. It results in emotional blackmail “don’t be the person letting the team down by refusing to be on call”, “share the burden with your team”, “everyone else has no problems going on call”, “I got a call out on my kids birthday in the park at a party, here’s a photo of me sat in the corner on a laptop, it wasn’t so bad”, “take one for the team, we give you free softdrinks and deli bar”


Y’all need to reevaluate who you’re giving your time to, these reactions are clearly the result of some serious trauma that’s got nothing to do with the concept of being “on call”…


orgs that have on-call engineers that aren't part of the operations and infrastructure come with all the other associated management and strategy problems; seeing how effective orgs work, how they test, how they release, how they manage faults, how they grow and what instruments and strategy are successful is so simple and straightforward it's mind blowing the general pattern of on-call exists; until you see the people making those decisions then the strangeness becomes how did that person end up in their decision making role given how inconceivably misguided all their strategic comments becomes. I once asked the head of engineering how to get teams to work together and he didn't have an answer--and he also had on-call. It's really strange working in technology, you'd think there were actually intelligent people leading this space. Instead it's more a simile to religion, politics.


Wow what? If I'm building a system with tight uptime requirements, I make sure I build things that stay up. Product is the front line for pages with whoever is on the on-call rotation. If a mission critical system I built fails in the middle of the night, I expect to be paged and to hop on and fix it.

I'm very well compensated for that, and have historically been more than happy to fire the contract "support" teams in India or wherever that are trained to follow runbooks to manually fix problems that should self-heal if systems were engineered correctly.

On-call is just a way for me, an overpaid engineer who cares about craftsmanship, to ensure that problems serious enough to warrant a page (infrequent) are dealt with swiftly, and followed up on with an in depth post mortem to ensure they don't happen again.

To be clear I consider getting paged more than two or three times a year unacceptable. Any compamy with on-call should track life interruptions as a KPI and actively work to drive their frequency to zero.


Unless you own the service in isolation, chances are something will break at some point beyond your control. If you own the service in isolation, then you may be on-call 24/7 indefinitely.

I build high scale systems for a living, but I also hike, ski, and have a family. I can't guarantee 20 minute response times 24/7 365 days per year. I generally won't join a team with an on-call more frequent than once per 6 weeks.

Anecdotally, companies fall into a trap where individual engineer's own services that they must be paged for. These engineers don't get the time to make the service reliable, and they also don't want to admit that the service is a tire fire. This works for the company as long as they can continuously pull in new engineers to feed the on-call fire.


> Anecdotally, companies fall into a trap where individual engineer's own services that they must be paged for. These engineers don't get the time to make the service reliable

I think that’s where I’d draw the line if I was in such a situation. If I’m on the hook for dealing with any outages, then I won’t be working on anything else until the service is reliable.


There really is no end to this though, a service can always be made more robust. The reputational damage of missing a page is extremely high. The risk to the company of missing a page can be high (although hopefully not to high if its a solo engineer project).

People need breaks.


> a service can always be made more robust

It can, but I don't think it's too hard to make it robust enough that pages become a rare occurrence. At my company (a small startup with <5 developers - so we're not exactly flush with development capacity) we've managed to reduce outages in a previously quite buggy app to almost nothing over about 18 months (while also developing new features). I don't think we've had an outage in the last year which wasn't due to either:

- An outage at our underling hosting provider (which we can't do anything about anyway)

- A code deploy (which we can plan and control the timing of)


I get it, trying to make on-call not sound so bad. Fair enough.

But then she goes into the "example week" including a weekend page. It seems like there's not much going on, not many pages, not much out of regular hours. And then this prime example of why you do not want to have engineers on-call:

    At 5 a.m., I got paged. I jumped out of bed and ran over to my computer.
    The page self-resolved at 5:01 a.m.
She puts a smiley but this can really drive one mad. Second time this happens (and let's face it, it happens, even if you tweak and adjust those settings over and over) I'm out of there.

I get it, all the talk about "you will build better services because otherwise you will be paged". In the end I think it's just about saving money for dedicated 24/7 network operations if that is what your business requires. If it doesn't have customers 24/7, then don't set up any pagers and support staff, if it does, pay for it!

I've been on call in a SaaS environment before. In my entire time (counted in years) doing that I have had about the same amount of pages in total (!) as are contained in her example week. I carried my laptop and the phone to tether dutifully but I have never been paged by the 24/7 support staff for something like the 5:01 self-resolving alert.


If your alert is going off at 5:00 am and self-resolving at 5:01, and it happens more than once, the prudent course would be to fix the underlying problem or modify the alarming configuration. Nobody benefits from alarms that don’t require action.


Hire in complimentary timezones so no one has to be on call at night


Unless you're hiring 3-4x the engineers, then this doesn't scale. I would much rather occasionally be paged at night than have to be on call all or even most of the time.

Even if that "on call" was only 9-5, getting distracted and losing half a day to tracking down the latest flaky test or owner for a failing dependency really adds up and makes it significantly more difficult to do my day job. When I'm oncall, I'm oncall 24/7 for a week (30 minute SLO) and the expectation is that the amount of non-oncall tasks I'm able to get done that week is significantly impacted. I'm fortunate to work on a team that acknowledges that reality; I know many people, even at large software companies who should know better, whose management is far less reasonable.


It's true but most people do not have the luxury of being able to focus on a single task each day until done. If that is a requirement for you I'm not gonna knock it but I feel it is rare to get.


We’ve been trying to staff a 3x8 follow the sun model for over 6 months and have been unable to fill the roles in EU and especially APAC


We’ve been trying to staff a 3x8 follow the sun model for over 6 months and have been unable to fill the roles in EU and especially APAC.

Pay more. It literally solves all recruitment problems.


Gee, I wish I could control that.


Hmm, living in APAC, it seems like companies in the US are either never hiring for remote positions overseas, or offer laughably low salaries.


Yeah, I wish I had more power here.


Plenty of good devs open to competitive offers down here in APAC.

I joined a US company as part of a follow the sun model. Although ironically I seem to support people who are up at midnight in Europe more than in my area of the world.

Happy to have a chat, see if I can offer any useful feedback :)


Salaries in the EU are not what they were and have improved dramatically. I've found most companies to be wildly off the mark when making offers. Perhaps you need to recalibrate.


It’s true, and it’s sad to see. A long time high performer on our team based in Germany makes a pittance compared to more Junior US based folks. Alas, I cannot do anything about it.


My current on call setup that page would count as one hour worked of my weekly 40.

Ideally I'd be paid extra as well (I get about an hours pay a week as an on call allowance). What makes it work is that I'm the senior dev on the project that I'm on call for - if it hits the fan it really is my problem. And more accurately, in the last two years I've been called out twice. So that extra hour a week of pay... it's just free money.


> it's just free money.

No. The downside of being on-call is not the number of times you get paged at 3am. The real downside is that if you are on call 24/7 and you have to react to an alert within the next 15min that means:

- you cannot go out for dinner

- you cannot go out for a run

- you cannot go and pick up your kids (if the ride takes more than 15 min)

- etc.

The only thing you can do is stay at home with your laptop next to you, or wherever you go, take your laptop with you. That's being on call, and it's NOT free money.


That's not how my on call setup works. I need my phone and my laptop, and I need to be able to respond within half an hour or so. The changes I have to make are in code that then goes through CI and a load test before hitting production, so there's no "you have 10 seconds" stuff.

We don't have anyone on call in the sense you mean, and I've never seen it in the wild. If they want someone right there all the time they've paid them to be in the office where they have proper access to critical systems. When I worked in a medical environment that was what we did, and those people got paid very well for doing overnight "tech support" ... they were bored enough that answering random IT questions seemed worth while for them.


That’s a strange setup. When I was the only developer on a service the only guarantee I could give was that I’d try to make my way to a computer as soon as possible if I got a call, but I didn’t bring my laptop with me all the time.

A lot of stuff can be resolved from your phone though.

After reading a bunch of the other messages here, I think the reason it was ok for me was that I was always in a position to fix the root cause. I wouldn’t ever get the same alert twice if I did my job correctly.


If actually getting paged is rare them just bring your laptop to dinner, that's what I always used to do. Sure, it might be that time you are unlucky and have to work through dinner which sucks a lot. But it shouldn't happen much.

Running is an issue, unless you don't mind a sweaty back to carry your laptop, or just run loops nearish to your house with your phone on you.

What I usually did when picking up kids (close by, about 30 min round trip, YMMV) was to ack the page from my phone and log in when I got back home.

Yes it sucks to be tethered to a laptop. But if after-hours pages are rare, and you aren't on call more than ~25% of the time, for me personally it's acceptable


Reminds me of my first job: on call didn't change much at night because the system was so twitchy (as in, SMS every 15 minutes that then autoresolved) that I just put my phone on silent and ignored it over night.

Of course we also had a night shift guy who would actually call you if he got stuck, so hell if I knew why the SMS system existed. (That role lasted about a year at most - night shift for tech is stupid).


Someone once had the bright idea of forwarding several international toll-free numbers directly to a developers phone on a rotational basis. All to save the internal costs of the in-house operations team. We only ever received spam calls and wrong numbers at odd hours. It was a nightmare and total productivity killer.


I don't do this anymore. Very draining to be an SRE. I remember one Christmas I worked until 2am because of SRE pages on software that, in retrospect, didn't need to even be up at the time. Google used to give us SRE bomber jackets, and while those were super cool in a nerdy way (I never got one because I wasn't an SRE there), it was clear the org had to really go out there with incentives to get people to sign up for SRE. I love the practice of reliability and building fault tolerant systems, but it's just so draining / energizing at the time, then you realize all your efforts (or at least mine and colleagues' at the large companies for which we worked) went into keeping stupid, unimpactful things from falling over briefly. Unless you're SRE for a real, not startup-tongue-in-cheek "mission critical" system it is absolutely not worth it. In my very cynical opinion, I'd need to be an SRE keeping a NASA rocket working, or keeping the ISS up, or working on some SDV system, before I would consider a system to be really mission critical.


It’s only mission critical if someone is going to possibly die (as you mentioned life critical systems). Otherwise, you’re just burning up quality of life for someone else’s profits and arguably weak comp (compared to the family and health harm of SRE roles, and more broadly speaking, on call in general with absurdly short response and RTO SLAs). This isn’t doctor pay except perhaps at FAANG.

(general observations over a longish career)


Well said, that's largely what I was getting at when discussing NASA and self driving vehicle type of applications.


Well the alarm would have been 20 minutes after when it actually happened on Mars.. and any solution won't be there before a same amount of time.

Especially for their helicopter, rover at least stays on the ground throughout all this.


Being on-call I have developed insomnia to the point I'm considering taking sleeping medication nightly...except, of course, the week-nights I'm on call. On-call is just not healthy.


At one point the same applied to our oncall shifts, I'm so glad our managers were able to set apart at least 6 engineer months to tackle oncall issues. Gotten a lot better since then.


I'm curious what sort of response SLO your team had. My understanding was that the 5m and 30m response tiers always came with a multi-site "follow-the-sun" rotation.


Yeah, we’re too small to have that kind of rotation, and consequently our SLO’s are business hours only! If the system is actually on fire / down outside that time then we’ll deal with it, but that only happens maybe once a year at most.


I was a dev turned SRE in one point in my life. Man, the anxiety basically ruined my life. I could see I became very hyper person checking my phone all the time, to the point that it impacted my relationship at that point. If you are inheriting a bad project or service without lot of automated remediations, or processes, paired with a toxic work environment, just don't. No one prepares you for the anxiety the role brings itself with. In the end you would realise the work and 2x-3x rate per hour is probably not worth it. I have seen people getting burned out. You have to have tools and certain processes in place for an SRE/on-call duties. Also, slack notification on phone are actually the worst as brain always thinks that got a message for some reason when you keep your phone in your pocket. Just my opinion.


Yeah if it gets too full on it fries your nerves, it also ruined my concentration. When I went back to pure dev I couldn't sit and work because I always was waiting for some alert to go off.


I find that I can only work on any sort of focus work in the evenings because I know I'll get less alerts.


Wait there's no rotations and have to do dev work while oncall?


Meditating helped me regain my focus so far.


I never had it too bad with being on-call, but my previous team did a piss-poor job of onboarding engineers, writing TSGs, and paying down tech debt. One rotation, I got paged at like 3am and again a few hours later. Bizarrely, it totally messed up my ability to be in the moment with my family for the whole weekend.

And I developed this sort of 'anxiety' every time my phone rang. My stock ringtone never felt harsh before, but the instant my phone started ringing, I could feel my heart pounding. Would this be the call that would finally be more than I could handle? Even after I was done with my rotation, I just got so agitated any time my phone rang.

I switched my ringtone to the Mr. Roger's Neighborhood theme, and I left that team (for other reasons). I'm a much happier person at work and at home now.


Sorry, but not. I keep my 9-5 strict. I'm not into the game of getting more money in exchange for my scarce free time. If the company needs people to work on Sunday mornings, they can hire them (i.e., SREs). I can't understand why it is becoming so normal for regular (senior) developers to become slaves of our companies by working more than 40h/week; the general excuse is "you build it, you run it, you fix it". If we are into writing robust software, I'm all in, but that's a totally different thing.

Edit: I'm not in my 20s anymore and I have a family.


The idea of "you build it you fix it" is to make sure things really are fixed because you value your time. And it works.

When there's an ops team responsible for on-call nobody gives a shit about fixing the code issues that wake them up.


The current development team didn't always build it, either. And the pressures and incentives are often biased away from improving things, no matter how much individual developers might want to.

"You build it, you fix it" has to apply to the powers that be (i.e. the product managers and such), not just the individual contributors.


This is great in principle, BUT.... it only works when the dev team are given the time and automony to prioritise the work.

While many organisations pay lip service to uptime and reliability, the reality is that those are not sexy or exciting and Bob in sales just snared this million dollar client that will sign as soon as we implement frobnication with Othr. So dev team can either work on fixing the reliabilty issue that means they don't get paged at 3am or implement the thing that get the company another $1m. Which do you think they get to do?


While many organisations pay lip service to uptime and reliability

I had a very terse conversation about this, and directly included the words “lip service” to my superiors today about the very topic of on-call and overnight work this morning. I’m fed up, and made it known.

Because IMO you’re 100% right, too many orgs pay lip service to availability and SLAs, yet deliberately choose to underinvest in tooling and resources to help mature the capabilities of incident responders.

Does it bring it revenue? Probably not, but bailing water doesn’t move a ship across open seas either, you still do it because if you don’t surprise, no more fucking ship.

Instead of a well defined monitoring system I get $40/mo for the lowest tier status board that’s generated more false-positives than actual on cable alerts, no matter how I tune it or try to define parameters. Instead of a mature incident response platform I’m demanded to conduct Spreadsheet postmortems. Instead of centralized logging and telemetry I get to troubleshoot exceptions over a bunch of shell windows.

Incident response is still a mess of people slapping keyboards, there’s no triage, there’s no sense of structure or command of the process.

And here I am, an SRE being prioritized to work on backing up service accounts because we might not get a contract with this trucking company if we don’t have that capacity.

Sigh.


Unfortunately, this kind of thing is almost impossible to deal with from a grass roots angle.

Unless you have a culture that values these things from the top down, you're going to give yourself a concussion banging your head on a brick wall.


Mate I’m sure I already have. Been battling a migraine since 10am.

(Joking, it’s really just a plain old sinus headache)


> nobody gives a shit

If you start from the assumption that your employees are not motivated to do a good job and so they need to be punished for doing a bad one… I’m not sure what to tell you.


More like where I worked where that was a thing the ops team were a slave class in a company mainly staffed by devs


Put the product managers who prioritize new stuff over bug fixes on PagerDuty if you really want stuff fixed for real.

It's not like devs choose what to work on & prefer the 70% solution before moving onto the next demand..


If no one gives a shit about the ops team, having a "you build it you fix it" rule is just papering over a much larger problem.

In every job I’ve been in, I’ve been responsible for bad code I didn’t build and had no authority to fix it.

This only gets worse when one team member can’t and never will be on-call. Because everything is focused on “individual responsibility” anyone who isn’t “carrying their weight” gets singled out, instead of actually fixing the systematic problems.

It’s much easier to blame people and make examples out of them than to actually address the real issues.


Sure I’ll fix it. During normal work hours the next day


Which is great, but would also require time given to teams to fix all issues they see in the code and not work on new features.


Its not even a good excuse. Both of my last roles had on-call rotations, and they were really shitty. The code wasn't even mine and I was expected to support it because the original developers had up and left ages ago.


*Manufacturing engineer has entered the chat.

On-call rotations are the worst.

1. If you are also supposed to be available regular business hours, management treats it as an extra that according to them shouldn't take up a significant amount of time/energy. Therefore, you get compensated for it accordingly: a pat on the head and a fractional amount of comp time that you frequently can't use when needed anyway because you still need to close on your normal tasks.

2. Since it doesn't officially take up any time or energy, there is no problem expecting you at work on the dot when you have been on call putting out fires the night before.

3. Getting paged for nonsense is a real problem. If you were asleep, it can take you 10-15 minutes to figure out it was nonsense and another hour to get back to sleep.

4. Following on the last point, more than two or three pages during sleeping hours means you effectively pulled an all-nighter. Pretty crappy when you aren't 20.

5. When everyone is burnt out from the on-call rotation, it is difficult to close major issues, let alone the small things that generate extra pages.


As an SRE Leader (~22 years of IT experience), how does this sound?

1. I'd give you 2x the amount of time off immediately, the next morning. Minimum 2 hours. More than that, I'd ask if you'd prefer to shift your workday later or find some other flexible arrangement.

I have fought management at some places really hard when they said a word about someone's planned work bucket not getting done after being paged as much as OP has in a week. I've made it clear it was management's fault, not yours.

2. Absolutely not. You'd be given the option to get rotated off Pagerduty as soon as your life was disrupted. I'd personally take over your shifts if I couldn't find someone on our team who was willing to take over.

3. This is SRE priority #1 and I felt terrible for OP. We cannot run a marathon if we're always sprinting out of bed for crap. Alert Fatigue is real https://www.pagerduty.com/blog/lets-talk-about-alert-fatigue...

4. Bingo. So, the whole organization has to reduce velocity so we can pay back the debts. If the CTO doesn't totally empathize with this, they're going to trip over their own shoes.


Not OP, but another manufacturing engineer. Manufacturing has been on a slow decline for decades due to outsourcing and automation. The need to offer a competitive wage and working conditions is not what it is in tech, and company culture can get pretty cutthroat depending on the industry you're working in. You could find a job elsewhere, but it likely won't be much different. In fact, a rotation for being on-call would be an improvement for me, as I've never worked anywhere where I'm not on-call all the time. People even try to call me when they know I'm on vacation, which is one of many reasons I keep a personal phone that they don't know the number for.

I make better money than most people I know around my age, but it's nowhere near tech money and it's not enough for the bullshit. Unfortunately, it is good enough that pivoting into another field and taking a pay cut would be painful.


For 1, how are you guranteeing that there's no pages in that time?

My sleep is wrecked for more than a week after a bad on-call - what I actually need is a week or two off to get my life back on track


For 1, Hand off the pager to another engineer for daytime.

I totally get the sleep thing, though!


> I pulled out my laptop, tethered my phone and popped online. Sitting at a playground picnic table, I re-ran the test that had failed and alerted me. It passed.

Boy would I be annoyed if I got paged because someone else's commit made a test fail. This should be dealt with by the dev that pushed that change. The fact that it made a test fail should also mean it's not in prod yet, making me wonder why someone got paged in the first place.


It may have been a monitoring/smoke test type of thing, that performs some customer-like interaction with the deployed service, and makes sure the expected result comes out the other side, rather than a build-time test.


Yeah, paged for a failing test? Nah.


I'm very annoyed lately by how the expectation of any Internet service is to be up all the time. I don't think it had to be this way. I can't go to the grocery store after 10pm, and that's OK, but you're telling me [SaaS app #9,000,000] is down for maintenance for 1 hour at 1 AM, outrageous!

I get that some systems need to have very high uptime, but it seems like we've become far too obsessed with the idea for a bunch of trivial crap.


I agree. We could experiment with HN. Hey HN admins, pick up 1h a day to shut down HN. Do it for a week and then we discuss the outcome.


As an Australian, I ask for actually clear communication so I can adjust my expectations.

Mainly because I expect the 1 hr-a-day shutdown would be in my prime time.


What's even crazier to me is plenty of hugely popular services take plenty of huge downtimes but shiny new SaaS startup is expected to stay up.

StackOverflow seems to go down a couple times a month, USPS had a few hour maintenance window, many banks still have huge maintenance windows.


Your grocery store serves a geo-fenced area. SaaS #9000000 is global. 1am for you is right in the middle of an important deadline for somewhere else in the world.


But yet, SaaS #9000000 has a skeleton support staff located in the same U.S. timezone. If you have a popular global app, pay for a global support team to support it.


Couldn't agree more. Especially if your app is business critical for your clients.


Pinboard rather famously (and of course, tongue-in-cheek) advertises nine fives availability.

Scale expectations.

Recognise that each additional point of reliability increases costs (and time to achieve) by roughly an order of magnitude.


Before you accept any offer from FANG/MANG/hot startup, *make sure* you understand the on-call expectations. Example: at Amazon, *even data scientists* do on-call. Practically all SWEs before Principal do. In AWS, first-line managers do on-call (they typically don't in Devices/Consumer). In places like Alexa, you can expect to have shitty 5-6 week on-call rotations and extra helpings for holidays. AWS at least advertises this in their job postings, other orgs (cough Alexa) do not. So buyer beware.


Are there any books that describe building an on-call practice? If not, I volunteer to write one. It's not complicated, but a lot of teams miss some of the fundamentals that improve the overall results. On-call's purpose is not just to wake people up in the middle of the night, it's to drive continuous reduction of on-call incidents. You should get to the point where you don't even need an on-call (but have it just in case).


+1 the point of on call was to wake up the team who caused the problem in the first place, and presumably have the ability to fix it via code changes etc.

This is in contrast to the model where there's a team that fields issues and tries their best to fix them, but ultimately has to put fix-it tickets to the software owners.

The on call rotations I've seen end up being a hybrid where you get the worst of both worlds. The team has enough responsibilities that no one person on the team knows how to handle everything and are so inundated with pages that it just becomes accepted that one person from the team every X weeks is going to be nothing but keeping the service from burning down.


Google have a set of SRE books, though how much they address SRE at Google Scale rather than Your Actual Situation is a point you may wish to consider.

https://sre.google/books/


I remember being on call my first time ever.

The pager goes off and I return the page.

It was the trauma surgeon.

“Hi there I have a guy who shot himself in the head. There is brain oozing out of the hole in his head. What do you want to do?

Later on, I switched careers from neurosurgery to computer science.

My first time being on call for Google was not nearly as stressful.


well you can't just leave us hanging; what did you do?


I guess you just put a band-aid on it and click 'resolved'


Close ticket - "As designed".


underrated comment.


The example week is quite concerning. She’s called multiple times for things that have no impact or importance. That needs to be fixed immediately.

My company has strict rules on what is allowed to escalate to on-call and the SLAs to handle that alert. It had to be breaking our overnight batch cycle or we’re not getting up. Brittle processes are stamped out immediately and management backs us up.


Thanks for sharing this. We are going through a similar organizational shift at $JOB where a large team has been split into several sub-teams that own services. It was valuable to see how you adapted the on-call rotation to this new structure.

On using a public channel for coordinating incident response, we have struggled in the past with stakeholders joining the meeting and offering well intended but ultimately distracting input during the incident. We’ve found it best to have one responder play the role of “incident commander” and manage external communication / rope in more stakeholders as needed. This helps avoid conditions that make the incident even more stressful, like say a member of upper management demanding frequent updates or spreading FUD.


We had a team of 3 or 4 developers in a small company with a few shrink-wrapped SW products. We took turns being on call for a week at a time. At first it was scary but I learned so much, talking directly to end users and discussing their issues definitely made me a better developer. At first we had a "tech support" engineer but it turned out that 90% of the calls had to be escalated anyway so we just switched to developers handling the calls.

At the next job I was also on call for systems we put on a manufacturing line. That was a lot more exciting and involved driving to the plant and try to get things going when production stops. Losing $10000 per hour when the line is stopped wakes you up real fast even if you just jumped out of bed 20 minutes ago.


I think what makes those work is being part of a small team. It's not so much that if Bob releases bad code you can go over and punch Bob, as that no-one wants to be responsible for waking their teammates up in the night to start with.

Another case of: can be done properly is a good work environment, and can be literal death when done badly.


> After getting more information, we realized it was not affecting customers and could wait until the team that owned the service came online.

This felt like a major red flag to me - why was the team that owned the offending service not the one receiving the page?


We considered that model, but decided against mandatory participation in off-hours on-call. The off-hours rotation is voluntary and paid.

This is a trade-off in order to reduce the number of people who have to be on-call at a given time and maintain a voluntary model. Generally, though, the person on-call will be from a team that owns at least a portion of the services they’re on-call for, and the services are distributed among the two rotations thoughtfully.


In many organizations the on-call is among a wider circle of people to make sure devs in small teams aren't on-call for their team's service every other week.


That's a recipe for disaster in my experience. Specifically: on-call burnout


Interesting, I've had the opposite experience - an on-call rotation that's too wide makes it difficult to know how to respond effectively to an alert, since it's likely to involve a service that you may know nothing about.


Yeah, someone asking you every 15 minutes for an update while you're trying to read the beginner's guide to service X and connect it to their question isn't a fun time.


Ah, I can see how my statement could be misinterpreted. I'm actually in agreement with you here: teams should only be on call for the services they own. Being on call for services you don't own is a total nightmare and leads to an exodus of competent engineers.


I think there is still a misunderstanding here. I'm not the person you originally replied to, but I read their message as having focus on the small team aspect. If there is a team of, say, 2-3 devs, their rotations cycle would essentially be every 2-3 weeks. Even if it's only on their service, they'd still be on rotation every 2-3 weeks. Alternatively, if there is a wider group of people, the people on smaller teams wouldn't need to constantly be on rotation.

I'm not sure if one is better than the other, but the assumption of only being on-call for services you own assumes some minimum team size, or as acceptance of being on-call every other week.


Opportunity.

1. Learn about X.

1.1 See that awareness of X is distributed through the team.

2. Document that X is not documented. Have X dev / support team document X. Configuration, monitoring, troubleshooting, recovery, escallation.

3. Scale expectations. If X is not documented and awareness is limited, then issues involving X are not expected to be resolved by on-call staff and will be addressed when X dev team are available.

4. Scale opportunity/risk of X based on the actual experience and downtime it affords.

SRE is fundamentally a risk-management exercise. Manage the risks. Assess them, mitigate them, reduce them. And measure the damned costs of them.


I've had jobs where there's been on-call and you can have the pager all week and nothing happens. I was very fortunate to be in a team that had a well matured and very stable service, however when it did go down, it went down hard. Recovery was a well oiled and documented procedure and the environment could be brought back up in under an hour (provided nothing else went wrong -- it rarely did).

Today, my job is very different. It's a whole different playbook. 6 months into my role, I was asked to go on the on-call rotation. I learned enough to get through it, thankfully one of my colleagues were available in case I hit a situation I couldn't handle. Almost 2 years into the job and I don't sweat it anymore, I know enough of my service to fix most issues and now appreciate the escalation procedure. I will kick it into gear, even if I need a storage admin to give me a few GB of disk space to get me through the weekend. :)


This is kind of shocking. Most of these are false alerts and negatively impacting this persons day to day life (who takes a laptop to a park!).

If that was me I'd either take myself off the on call until the false positives had been fixed or look for a new job if that wasn't allowed. I suspect that this was forced upon this person.

Being bubbly and nice about being abused like that is terrible and bad for the industry as it signals that you can get away with accidently calling people at all hours. I would say for each false positive a top priority ticket should have been put in such that all work is dropped until it's fixed. The "What I Learned" section is mind blowing. She thinks she got lucky !

Sounds like the company has many problems though like "a piece of system architecture due for retirement and no longer serving customer traffic" I mean wtf !


This is what burned me out on the best project I've ever worked, dev wise.

I still get triggered when I hear the Nokia ringtone.

First, we were all offered good rates for on-call support so we had plenty of people who wanted in so I wouldn't do any support and leave it to the guys who wanted the money.

Then some of the guys left the team and got replaced, they offered a different deal to the new guys, less lucrative, therefore, they didn't want to do it.

As a senior I had to start doing support, I thought the extra money would maybe make me feel ok, then we had less and less people on rota so I was doing support most weekends which started affecting my marriage.

I escalated the issue (again), nothing was done, one month later I gave my notice on the highest paying and best dev job I had.

I was on-call for full month straight during my notice, that's how bad it got. And I steer clear from dev with support roles now.


This person is straight-up brainwashed, lol.


I think that there’s a tendency online to attempt to gaslight certain ideas into reality and it creates a tension because they are wrong but it can be unwise to call them out.


Don't worry, it's not robbing you of the the evenings and weekends, which "maybe you like", it's a fun new challenging opportunity. Have you had a chance to panic yet on Saturday morning at 3 AM? Would you like to? It's such a thrill.


See also: exciting hackathons where you get to work late nights and weekends on "fun ideas" that the leadership would like implemented ASAP.


What's been everyone's experiences/takes/opinions, specifically with not doing Hackathons? My org is moving them to quarterly-we used to have one yearly-I skipped the last one, probably going to skip the next one but I wonder how long that's going to last before someone in management sticks a head up and asks why I'm not participating (a question for which I have an answer ready to go)

(edit: and by "hackathon" because I realize there's a bit of room for ambiguity, I mean internal corporate Hackathons where the company "allows" dev teams to take a "break" from "regular" work in order to ideate and build new features and functionality that--if their idea gets chosen--becomes "regular" work in and of itself. It's those kinds of hackathons I've started removing myself from)


It's not a hackathon if my idea needs to be chosen and ordained by Papal seal. I already have that, it's called work.


Testify.


I'd just pick something from the backlog that I don't think is prioritized highly enough, and do that.


How is gaslighting relevant here?


As soon as I read "volunteer on-call team" my wtf-o-meter got pegged.


Probably the author is 22 years old or so; otherwise I can't understand it.


Although they said it was a "paid volunteer" (which I don't think is what a volunteer is) so maybe it was really good - time and a half at least I'd hope.


I'd say work addict is a better term. The dream of all employers.


I can't help but notice none of these items actually needed to get fixed.


I think you may be correct.

>> We spent about 45 minutes looking at the alert catalog, the runbook for the services and trying to fix the underlying problem. After getting more information, we realized it was not affecting customers and could wait until the team that owned the service came online.

Well, now, that is not a sev 1 alert. If you're on-call, you have the power to adjust alerts! Change it to a sev 2 so it does not alert future on-calls unneccessarily.


> If you're on-call, you have the power to adjust alerts!

Amen.

Sounds like the author went even further: rather than just tweaking an alert, she deleted one, then deleted the whole service. That won't page again!

> Change it to a sev 2 so it does not alert future on-calls unnecessarily.

Or restructure the alert. I prefer SLO-oriented pages like "we've burned 1% of our quarterly error budget in the last hour" rather than cause-based pages like "CPU usage has exceeded 90% for 10 minutes". If you are getting paged based on something that doesn't affect customers, it might be a sign that you have the wrong alerts so you can't tell if they represent something important or not. Manually toggling the severity each time it's wrong might not be good enough.

In terms of overall operational load though this week doesn't look bad at all: four surprise pages in seven days. After her fixes the next week should be lighter (knock on wood).


Spent about a decade on-call...It is an interesting and perhaps even worthwhile experience to do it for a little bit.

IMHO there are two angles to achieve higher reliability:

1. Build systems which go down less on their own (which is quite difficult)

2. Get good at fixing them fast when they go down (which is much easier)

A trivial example of 1 might be moving from no RAID to RAID1, an example of 2 might be getting on-call proficient at quickly responding to a dead disk and restoring from backups.

(But I did say 1. was hard, even in this simple example maybe you used hardware raid, and are finding out just how crappy HW raid can be ;)

Companies attempt both angle 1 & 2 to differing degrees. I worked a lot in 2, and it sucked. We were a proficient fire department in a city of straw houses and gas stoves.

Aside from the general suckage of on-call (24 hour days, many, many nights lost sleep), the work is by it's nature high risk, high urgency, high impact, high risk - but is rarely rewarded sufficiently in pay or promotion opportunities.

Having spent so much time on angle 2, I've decided it is largely shallow work, and to grow intellectually I needed harder more interesting problems, which meant trying angle 1.

Consequently now my title is SWE and I don't do out of hours on-call, and it is glorious. I do notice I see the world differently to my less scarred fellow SWE. It definitely changed how I write software, and view the product priorities.

At heart I still consider myself an SRE. It's just I had to place myself in the most impactful place to achieve it, as a regular product/feature dev.


I hate on-calls. I used to love programming, building systems, making cool things. But this is about 20% of my job as a SWE in a big tech company. Now I need to maintain several fragile systems, deal with angry users, deal with management which pushes me to add features rather than making our systems more reliable...

I'm staying primarily for the money but I'd rather do something else and keep programming as a hobby.


> Now I need to maintain several fragile systems, deal with angry users, deal with management which pushes me to add features rather than making our systems more reliable...

Only one of these is a real problem, the other 2 are essentially derived from that.


Career-long sysadmin/SRE/SRE Team Lead, here. I've worked at large shops (10-30 million end users) and some shops where 99.98% is the SLA to prevent millions of dollars of losses in supply chain.

First of all, I appreciated this diary because Anna took the task with a positive attitude and as a learning experience. Thanks for writing this. To see an old problem through new eyes is inspiring.

I have numerous, "hot-take" criticisms of your current organization's practices, but I'm not sure I have all the context yet. The one suggestion I will make is: if you're not already using it - clone https://github.com/pagerduty/incident-response-docs/ and modify it to meet your organization's needs. Then, have it blessed as policy by management and train SREs and Devs on it.

To the other comments: I see there's a lot of people here who say they'd never do the SRE job, or return to doing it. I'm not discounting your fear or feelings of burnout. Been there. But, hear me out:

DevOps is not just about CI/CD pipelines and monitoring and Pagerduty. It's about having a culture where developers don't throw operational or security poop over a wall of confusion at sysadmin types as well as at their peers. This kind of organizational dysfunction can be devastating to a business.

DevOps at it's best is about about empathy. One of the best places I ever worked was filled with developers who had true empathy. They realized that an error or omission in their work could would wake up their Ops team at stupid o'clock in the morning, repeatedly - leading to all the things that drive SRE's and on-call folks literally insane. They practised strict TDD.

These developers volunteered to be second-tier on call after the ops team did triage, out of the kindness of their hearts for their coworkers. Management also led a culture of defending time to find permanent solutions to drive measured improvements in SLI.

SRE isn't about waking up at stupid o'clock every night to press buttons. It's about having a culture of driving permanent fixes and compensating by using cost-effective and appropriate cloud architectures. It's also about leading the working agreements with engineering teams to do blameless post-incident retros together and making the work bring your teams closer instead of pushing them apart.

I can't help but take away that a lot of you feel like On-Call heroics are what SRE is about. It's more difficult than that, but also less stressful, and simultaneously more rewarding when you get it right.


Being on call is awful.


>Perhaps I got exceptionally lucky this week, but overall, it was enthralling, and I’m glad I volunteered.

You were very very lucky. Most of my on-call stuff has been constantly the same issues again and again. I have stopped caring. Server A for Customer B fails updating because customer B refuses to pay for extra disk space, time and time again. Corporate president C cannot email because the idiot does not respect maintenance windows he himself agreed upon. Refuses to pay for more server resources though...


So, false positives including something that 'self-resolved' and a flaky test? Time was wasted here.

I like to promote ownership by making each engineer responsible for their area. So, they must understand the business so that they can judge if a fix can wait.

We're fortunate to have a 24/7 customer support team who both attempts to identify quick workarounds and aggregates reports before notifying the head of the relevant team.


The worst are the attempts to sneak-in on-call responsibilities. A while ago I interviewed for a "DevOps" role and then turns out that in addition to DevOps, they also want you to troubleshoot user (business) issues, be on-call and ideally do some on-demand coding too. Sure, why not also throw in project management and marketing as well for good measure...


This is great. Having the engineers who built the thing be the same people who respond to pages is a great way to incentivize robust systems. If you built it, you likely know how to fix it better than anyone else, and if you don't want to be disturbed in the evening, you will think about how to better deal with faults during the early stages of development.


I'm sorry, I'm going to have to chew you out here.

Firstly: when was the last time your on call rotation comprised exclusively programs that you had written? How many systems does your team look after that it "inherited" from elsewhere?

Secondly, how many organisations allow developers to prioritise solving pages over other forms of revenue generating work? How many teams demand negotiation with a "product owner" before a piece of work can be added to the board?

Thirdly, how many pages are truly easy or tractable to fix? I have worked in teams where we were constantly paged due to external API failures, but intra team disagreements meant we could neither punt the responsibility elsewhere nor resolve the problem. The problem wasn't technical, it was social, but there's no PagerDuty for absentee Engineering Managers.

Fourthly, how often is the Real Problem™ for the reliability and uptime of a given system, actually at the program level, and not at architecture or system level? How many programs are hamstrung by internals alone? Once you get past the basics, most of the _really_ significant decisions affecting reliability are at a system design level that Johnny or Jane Developer isn't going to be empowered to fix in a Scrum "sprint".

Really, this "shitty on call incentivises robust systems" argument is facile. It's paper thin. Put it out in the sunlight for a second and it crumbles. It only makes any sense, tentatively, under idealised conditions where developers alone are responsible for non functional requirements and the usual relationship of employer / employee is suspended.

Think about it for a second. It's just rot.


I've been part of an on call rotation at my current job and previous job. Both had the same rules because I and other engineers insisted on it.

* You are only on-call for systems you can directly change.

* The person who is on call has full flexibility to work on anything that improves the life of the on-call person for that time.

* Anything that wakes people up in the middle of the night gets prioritized above new feature work.

* There is a pager set of monitors, and a message only set of monitors. If a page goes off, and there is nothing the on-call engineer can do about it, it gets moved to the message only channel or removed, because it is a bad monitor.

I discussed this list of rules when I interviewed and the job description included being on-call. It wasn't a negotiation. If those aren't followed, I remove the monitors. I'm sorry you had a terrible work environment, but I encourage everyone to have professional standards. I hope you find a place where you can.


> Firstly: when was the last time your on call rotation comprised exclusively programs that you had written? How many systems does your team look after that it "inherited" from elsewhere?

Agree it's not fair to be on-call for things you can't fix. That's silly.

But, you're surely not saying when you push a change to a 24/7 mission critical system, and you're on the "git blame" - you're not interested in taking responsibility if the consequences occur after 5pm? And now an SRE has to learn everything you know, in order to fix it, instead of asking for your help?

Also - think about it a different way, you personally, are not the one I'd ask to participate in on-call. It's your whole team, maybe even your whole department. You and your team/department should work out how to field that together because you have more context into each others work/life balance, projects, skills, risky commits, etc. Your team would manage your own rotation.

> Secondly, how many organizations allow developers to prioritise solving pages over other forms of revenue generating work? How many teams demand negotiation with a "product owner" before a piece of work can be added to the board?

Tell me about it. Totally agree. This is where I focus much of my energy as a part of leadership teams - making other leadership realize they're over-promising on feature delivery without concern for their poor operations.

Frankly, fuck sales-driven development for this. I even work with marketing departments to value-stream map so we include good ops & security as part of our core prop to customers and investors. If they don't get it? Well, I guess the fish is already rotting from the head, as they say.

> Thirdly, how many pages are truly easy or tractable to fix?

If they're repeated ad nausea? Then those alerts need to go. Now. And if they can't go, because Management don't see the pain they're causing? See my response to #2. Culture change first.

> Fourthly, how often is the Real Problem™ for the reliability and uptime of a given system, actually at the program level, and not at architecture or system level?

Depends on your industry/platform. Good SRE sysadmin troubleshooter types will do everything they can before they push the escalation button to engineer tier. At least they'll get all the data and metrics they can and document the situation at hand before the engineer comes online.

Sometimes you don't really know what caused a problem until you all get together and retro it. Better to stay blameless even beyond when you have consensus on the technical factors.

One time, it was a leap-second messing up some 3rd party vendor software, with consequences for our code base that nobody ever expected.

Another time, it was a tired SRE combined with a lack of proper procedure combined with a code mistake with no test coverage.

Another time, it was a binder that fell on a customer's space bar effectively DoS'ing a system that had no rate limiting.

Some of those got fixed pretty quickly by an SRE.

BUT - whatever the cause... the times that our MTTR got truly trashed? (We're talking DAYS...) It was because the person who wrote the code and had all the context, had already quit long ago - leaving us with no proper logging and no one to help us. The organization had never prioritized replacing their code or hiring someone who could understand it.

> Really, this "shitty on call incentivises robust systems" argument is facile.

You're right in the sense that, considering all of the above, I would need your buy-in. To get it, I'd seek to eliminate the "rot". I'm truly sorry you got burned in the past. I have also been there. When it was too much for me, I quit, and found somewhere more ideal. Now I just keep trying to work with people I trust as they move around in the industry, to avoid all the org B.S. you've rightly brought up.


Sorry, but not. I keep my 9-5 strict. I'm not into the game of getting more money in exchange for my scarce free time. If the company needs people to work on Sunday mornings, they can hire them (i.e., SREs). I can't understand why it is becoming so normal for regular (senior) developers to become slaves of our companies by working more than 40h/week; the general excuse is "you build it, you run it, you fix it". If we are into writing robust software, I'm all in, but that's a totally different thing.


In a world where people leave every 1 - 2 years to stay current with the market, more often than not the failure will be related to what someone else did a long time ago.


That sounds amazing until you get a system that is complicated and the part of the system you work on relies on an unstable dependency upon which you have no control. Oh no, my service isn't responding to 99.9% of requests within 250ms!? Boom, page at 2am. The service that is responsible for the alert either doesn't have monitoring set up correctly or at all or they aggregate their metrics differently, so on average, all of their calls are 5ms, but for all of your calls, maybe they're taking 2-3 seconds.

It's a nightmare that I escaped after three years. I was driven from a happy person to someone who hated his life. It took me a while to realize it was my job. The worst part is my management told us that the on-call pay was already "baked into your salary." I switched jobs internally. No more on-call. Strangely, I was able to keep my pay without having to wake up in the middle of the night any more.

Oh yeah, and you aren't the only one who wakes up. Someone sleeping in your bed may be an insomniac and may have just fallen asleep. Your alert wakes them. Now they don't get to sleep for the rest of the night. Now you and they are both sleep deprived and irritable. It obliterates personal relationships. Maybe your kids overhear you explaining to someone what is going on in the next room. Whenever anyone asks you about your job, PTSD triggers and you spend ten minutes venting.

I am so, so, so glad I'm no longer on call. I don't know what they'd have to pay me to get back on it, but it's at least 2x, and even then, I couldn't last for over a year.

You don't get to go to movies. You don't get to join your friends when they go biking or hiking or camping. You get interrupted in the middle of parent-teacher conference. You have a laptop next to you at the 4th of July party. You get interrupted in the middle of a shower. You don't get to host Thanksgiving or Christmas because you may have to work in the middle of making dinner. You're about the roll the dice to take the first turn of a game you've been promising your kids you'd play with them in the first attempt in a long time to be a decent parent, and your alarm goes off, you work for the next three hours, and they play without you.

There is no 40-hour work week for software engineers. There is no union. You work your normal days then you also get to work normal nights. Then you work a normal day again either fixing the problem that caused the alert or spend 6 hours in post-mortems explaining what happened when instead you should be sleeping.


Multiple holidays ruined by being on-call (New Years' Eve more than once due to year-end bugs, sitting in another room on a laptop and phone while everyone has Thanksgiving), countless weekends, disturbed sleep patterns. "Daring" to try to enjoy your weekend like this person said. I for one am tired of it.


WTF, you're on call when on holiday? That seems awful. I'd say "do teams really do that" but obviously they do. That's just terrible management IMO, and a real incentive to take up hiking or anything out of cellphone coverage.


I always love taking holiday oncall. Code freezes mean they're usually quiet, and we get paid for the full 12h (about 3x the oncall pay we would get on a normal weekday).


Super glad that our app only goes down when 3rd parties go down (so - nothing for me to do). It's nice to have a stable app using infra you don't need to maintain.


Article is complete drivel tbh....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: