> Anecdotally, companies fall into a trap where individual engineer's own services that they must be paged for. These engineers don't get the time to make the service reliable
I think that’s where I’d draw the line if I was in such a situation. If I’m on the hook for dealing with any outages, then I won’t be working on anything else until the service is reliable.
There really is no end to this though, a service can always be made more robust. The reputational damage of missing a page is extremely high. The risk to the company of missing a page can be high (although hopefully not to high if its a solo engineer project).
It can, but I don't think it's too hard to make it robust enough that pages become a rare occurrence. At my company (a small startup with <5 developers - so we're not exactly flush with development capacity) we've managed to reduce outages in a previously quite buggy app to almost nothing over about 18 months (while also developing new features). I don't think we've had an outage in the last year which wasn't due to either:
- An outage at our underling hosting provider (which we can't do anything about anyway)
- A code deploy (which we can plan and control the timing of)
I think that’s where I’d draw the line if I was in such a situation. If I’m on the hook for dealing with any outages, then I won’t be working on anything else until the service is reliable.