Post-mortem for last week's incident at Kagi

blantonl · on Jan 16, 2024

At first, by what turned out to be a complete coincidence, the incident occurred at precisely the same time that we were performing an infrastructure upgrade to our VMs with additional RAM resources

I can assure you that these "coincidences" happen all the time, and will cause you to question your very existence when you are troubleshooting them. And if you panic while questioning your very existence, you'll invariably push a hotfix that breaks something else and then you are in a world of hurt. \

Muphy's law is a cruel thing to sysadmins and developers.

JohnMakin · on Jan 17, 2024

Completely agree. I’ve triaged many outages with varying degrees of severity in my career so far, and the worst ones were always caused by someone panic-jumping onto some red herring rather than coming up with a sensible explanation as to why that would be a fix other than “it happened at the same time.”

I have a saying I really like to throw around, which is “if you don’t know why/how you fixed it, you may not have”

tetha · on Jan 17, 2024

> I have a saying I really like to throw around, which is “if you don’t know why/how you fixed it, you may not have”

There is coincidental function, and planned function.

If you just throw numbers at a configuration parameter until it works and then call it "fixed",it may be fixed, but that's just coincidental. If the system load changes, you can't really tell how to modify the value afterwards.

On the other hand, if you can explain why a certain setting will fix it, that's planned function, and if something changes, you most likely know how to proceed from there.

throwaway_673 · on Jan 17, 2024

This is unfortunately very true.

Example:

At my work place we were made aware of some instability issues and the first place to look is of course the logs. It had apparently been happening for a while, but the most important parts of the system were not affected noticeably. This was early morning and we didn't have much traffic.

We noticed a lot of network connection errors for SQS, so one of the team members started blaming Amazon.

My response was that if SQS had these instabilities, we would have noticed posts on HN or downtime on other sites. The developer ignored this and jumped into the code to add toggles and debug log messages.

I spent 2 more minutes and found other cases where we had connection issues to different services. I concluded that the reason why we saw so many SQS errors was because it is called much more frequently.

The other developer pushed changes to prod which caused even more problems, because now the service could not even connect to the database.

He wrongly assumed that the older versions of the service, which was still running, was holding on to the connections and killed them in order to make room for his new version. We went from a system with instabilities, to a system that was down.

The reason was a network error that was fixed by someone in the ops team.

bombcar · on Jan 17, 2024

My way of looking at it is if I can’t rebreak it, I haven’t found the cause yet.

oldandboring · on Jan 17, 2024

> I have a saying I really like to throw around, which is “if you don’t know why/how you fixed it, you may not have”

Thanks, I'm going to steal this.

Similarly we say: you must be sure you fixed THE problem, not just A problem.

lamontcg · on Jan 17, 2024

I hate it when button mashers are in charge of incident response.

Things like when traffic ramps up in the morning and the site falls over and someone in charge blames the deployment the night before and screams "rollback", and then it eventually turns out to have nothing at all to do with the deployment.

dmoy · on Jan 17, 2024

To be fair, if your rollbacks are cheap (1-3 button pushes), safe, and automated, it doesn't really hurt to try a rollback while you look for other causes.

rurp · on Jan 17, 2024

Not to mention the times someone has said that an issue couldn't possibly have been caused by a recent change they made, only to find out that it was.

lamontcg · on Jan 17, 2024

Most of my production experience precedes CI/CD as a concept by a number of years. You still don't know though if the rollback fixed it or if just bouncing the software alone fixed some kind of brownout, and you may not be able to replicate what happened and get stuck unsure of what the bug was and be in a state where you don't know if you can roll forward again or not. Then a developer throws a bit of spackle on the code to address what the problem "might" be -- then when you roll forward people assume that fixed it, which it might not, which is how cargo cults get formed.

JohnMakin · on Jan 17, 2024

Code changes should be trivial to roll back. Infra changes can be very difficult to roll back, however.

hammyhavoc · on Jan 17, 2024

I read "varying degrees of severity" as "varying degrees of sanity" and was all "omg, he just like me fr" until I reread it.

༼ ༎ຶ ᆺ ༎ຶ༽

sa46 · on Jan 17, 2024

Oh man, last week we had a small outage. Database queries took much longer than normal. I just so happened to be doing ad-hoc querying on the same table at the same time.

“Luckily”, the problem was unrelated to my querying but two coincidences are proper scary.

dylan604 · on Jan 17, 2024

nah, unless you were the intern, you should be safe

it's like Star Trek where you're fine unless you're the one wearing the red shirt

ashton314 · on Jan 17, 2024

If I ever run a company, all the interns will get red shirts

V-eHGsd_ · on Jan 17, 2024

The contractor badges at google were red until at least 2015. And as it was explained to me, it was not a coincidence.

ok_dad · on Jan 17, 2024

Think about the optics of your funny joke. You’re saying you’ll advertise that the interns are replaceable fodder. Even if that’s true, it’s quite a dick move.

ethbr1 · on Jan 17, 2024

Or acknowledging that interns perform a valuable role and celebrating their selfless sacrifice of their very lives for the good of the company.

That we should all die such a noble corporate death, at the hands of a colleague possessed by strange galactic energies.

zer00eyz · on Jan 17, 2024

Interns for the Crimson Permanent Assurance...

If you know you know, if you dont: https://www.youtube.com/watch?v=aSO9OFJNMBA

PNewling · on Jan 17, 2024

It also could have just been a joke, not one they were actually going to put into practice...

dylan604 · on Jan 17, 2024

Some people just come across as the type of person totally not any fun to hang out with.

ok_dad · on Jan 17, 2024

I’ve seen enough brainless execs and startup founders around here that you have to take stupid ideas like that seriously. I hope it was a joke, but it’s not very funny for an intern.

mst · on Jan 17, 2024

Our last batch named -themselves- "minions" and there's still little yellow dudes everywhere in the images they posted to internal comms platforms.

I agree that some care would need to be taken to ensure that the redshirts felt like active participants in the joke rather than the butt of it, but at least in a reasonably informal environment it could, in fact, be very funny -to- the interns themselves.

(though as tech lead, I always make sure that there are a decent number of jokes flying around directed squarely at me, which probably helps set a baseline of how seriously to take such things - and if in doubt about any particular joke, I just aim it at myself ;)

DiggyJohnson · on Jan 17, 2024

You definitely do not have to take stupid ideas like that seriously. Simply your life.

dylan604 · on Jan 17, 2024

A little more credit to people for making a funny and a little less assumptions that people are that boneheaded would make the world a better place to be sure. I know for one, I rather enjoy being able to see the humor and having a laugh rather than getting all enraged and venting on the internet to show some sort of self perceived superiority.

ok_dad · on Jan 17, 2024

Nah the ones spewing self perceived superiority are the folks who would call interns red shirts or make other off color jokes that end up hurting others and don’t even know it.

jnsaff2 · on Jan 17, 2024

Yeah, the "coincidence" leads you to jump to conclusions about your change being the cause.

This is very human, we do it all the time.

However having been through these enough times I have picked up a habit of questioning more assumptions and not flagging something as verified data before it is.

I'm still nowhere near perfect at removing these biases and early conclusions but it has helped.

Having an open mind is hard work.

tnolet · on Jan 17, 2024

Oh yes, the amount of times I rolled back a change during an outage that had nothing to do with the outage...

It's a critical skill for engineers: being able to critically reason, debug and "test in isolation" any changes that address outages. Much harder than it seems and typically a "senior" skill to have.

lesam · on Jan 17, 2024

Assuming you have a tested rollback strategy, I’m all for rolling back first and asking questions later.

It at least can often eliminate the rolled back change as the cause, even if it doesn’t fix the problem.

eklavya · on Jan 17, 2024

That's essentially on call 101, you try to mitigate as fast as possible and root cause later when you are not half asleep. Surprised to see some comments here.

muhammadusman · on Jan 16, 2024

I was one of the users that went and reported this issue on Discord. I love Kagi but I was a bit disappointed to see that their status page showed everything was up and running. I think that made me a bit uneasy and it shows their status pages are not given priority during incidents that are affecting real users. I hope in the future the status page is accurately updated.

In the past, services I heavily rely on (e.g. Github), have updated their status pages immediately and this allows me to rest assured that people are aware of the issue and it's not an issue with my devices. When this happened with Kagi, I was looking up the nearest grocery stores open since we were getting snow later that day so it was almost like I got let down b/c I had to go to Google for this.

I will continue using Kagi b/c 99.9% of the other time I've used it, it has been better than Google but I hope the authors of the post-mortem do mean it when they say they'll be moving their status page code to a different service/platform.

And thanks again Zac for being transparent and writing this up. This is part of good engineering!

Terretta · on Jan 16, 2024

> In the past, services I heavily rely on (e.g. Github), have updated their status pages immediately

Also in the past, other times GitHub has not updated its status page immediately.

phyzome · on Jan 17, 2024

As an engineer on call, I have been in this conversation so many times:

"Hey, should we go red?" "I don't know, are we sure it's an outage, or just a metrics issue?" "How many users are affected again?" "I can check, but I'm trying to read stack traces right now." "Look, can we just report the issue?" "Not sure which services to list in the outage"

...and so on. Basically, putting anything up on the status page is a conversation, and the conversation consumes engineer time and attention, and that's more time before the incident is resolved. You have to balance communication and actually fixing the damn thing, and it's not always clear what the right balance is.

If you have enough people, you can have a Technical Incident Manager handle the comms and you can throw additional engineers at the communications side of it, but that's not always possible. (Some systems are niche, underdocumented, underinstrumented, etc.)

My personal preference? Throw up a big vague "we're investigating a possible problem" at the first sign of trouble, and then fill in details (or retract it) at leisure. But none of the companies I've worked at like that idea, so... [shrug]

Gareth321 · on Jan 17, 2024

This is exactly why those status pages are almost always a lie. Either they need to be fully automated without some middle manager hemming and hawing, or they shouldn’t be there at all. From a customer’s perspective, I’ve been burned so many times on those status pages that I ignore them completely. I just assume they’re a lie. So I’ll contact support straight away - the very thing these status pages were intended to mitigate.

AndrewKemendo · on Jan 17, 2024

The simple fix is to have a “last update: date time”

Or you can build a team to automate everything and force everyone and everything into a rigid update frequency which becomes a metric that applies to everyone and becomes the bane of the existence of your whole engineering organization

asah · on Jan 17, 2024

meh - no status page is perfectly in sync with reality, even if it's updated automatically. There's always lag and IRL, there's often partial outages.

Therefore, one should treat status pages conservatively as "definitely an outage" rather than "maybe an outage."

virtue3 · on Jan 17, 2024

I think your bit at the end is the most important.

ANY communication is better than no communication "everything is fine, it must be you" is the worst feeling in these cases. Especially if your business is reliant on said service and you can't figure out why you are borked (eg the github ones).

throwaway167 · on Jan 17, 2024

Your point highlights thinking about what's being designed.

everything is fine is different from nothing has been reported. A green is misleading, there should be no green as green is unknown, there should be nothing with a note that there's nothing, and that's not the same as a green light.

taneq · on Jan 17, 2024

Once an ISP support person insisted that I drive down to the shop and buy a phone handset so I could confirm presence of a dial tone on a line that my vdsl modem had line sync on before they’d tell me their upstream provider had an outage. I was… unimpressed.

petesergeant · on Jan 17, 2024

> ANY communication is better than no communication

Better for the consumer, although not necessarily better for the provider if they have an SLA.

smsm42 · on Jan 17, 2024

IMHO, any significant growth in 500s (that's what I was getting during the outage) warrants mention on status page. I've seen a lot of stuff, so if I see an acknowledged outage, I'll just wait for people to do their jobs. Stuff happens. If I see unacknowledged one, I get worried that people who need to know don't and that undermines my confidence in the whole setup. I'd never complain if status page says maybe there's a problem but I don't see one. I will complain in the opposite case.

PeterStuer · on Jan 17, 2024

And that is before 'going red' has ties to performance metrics with SLA impacts ...

DANmode · on Jan 17, 2024

Which then means not going yellow or red technically constitutes fraud.

PeterStuer · on Jan 17, 2024

Not necessarily. The situation can be genuinely unclear to the point where it is a judgement call, and then it becomes a matter of how to weigh the consequences.

DANmode · on Jan 17, 2024

If you're asking how many users are affected, and your service is listed as green...

PeterStuer · on Jan 17, 2024

What if the answer is 0.00001%?

saagarjha · on Jan 17, 2024

Still seems like a yellow to me.

TeeWEE · on Jan 17, 2024

Connect your status page to actual metrics and decide a treshold for downtime. Boom you’re done.

sjsdaiuasgdia · on Jan 17, 2024

Stage 1: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

Problems: Delayed or missed updates. Customers complain that you're not being honest about outages.

Stage 2: Status is automatically set based on the outcome of some monitoring check or functional test.

Problems: Any issue with the system that performs the "up or not?" source of truth test can result in a status change regardless of whether an actual problem exists. "Override automatic status updates" becomes one of the first steps performed during incident response, turning this into "status is manually set, but with extra steps". Customers complain that you're not being honest about outages and latency still sucks.

Stage 3: Status is automatically set based on a consensus of results from tests run from multiple points scattered across the public internet.

Problems: You now have a network of remote nodes to maintain yourself or pay someone else to maintain. The more reliable you want this monitoring to be, the more you need to spend. The cost justification discussions in an enterprise get harder as that cost rises. Meanwhile, many customers continue to say you're not being honest because they can't tell the difference between a local issue and an actual outage. Some customers might notice better alignment between the status page and their experience, but they're content, so they have little motivation to reach out and thank you for the honesty.

Eventually, the monitoring service gets axed because we can just manually update the status page after all.

Stage 4: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

lrem · on Jan 17, 2024

Does anyone serious do this?

That’s an honest question, from a pretty experienced SRE.

darkwater · on Jan 17, 2024

In a world of unicorns and rainbows, absolutely. In the real world, it's as you probably already know: it's not that easy in a complex enough system.

Quick counter-example for GP: what if the 500 spike is due to a spike in malformed requests from a single (maybe malicious) user?

laeri · on Jan 17, 2024

A malformed request should not lead to a 500, they should be handled and validated.

darkwater · on Jan 17, 2024

Well, in the real world it might. It should trigger a bug creation and a fix to the code, but not an incident. Now all of a sudden to decide this you need more complex and/or specific queries in your monitoring system (or a good ML-based alert system), so complexity is already going up.

laeri · on Jan 18, 2024

Query input validation is nearly a solved problem. If you don't I would argue this is an incident if in this case 500's are returned.

jabradoodle · on Jan 17, 2024

You need to validate your inputs and return 4xx

darkwater · on Jan 17, 2024

Yeah and you also shall not write bugs in your code. Real world has bugs, even trivial ones.

jabradoodle · on Jan 17, 2024

If your service is returning 5xx, that is the the definition of a server error, of course that is degraded service. Instead we have pointless dashboards that are green an hour after everything is broken.

Returning 4xx on a client error isn't hard and is usually handled largely by your framework of choice.

Your argument is a strawman

zbentley · on Jan 17, 2024

> Returning 4xx on a client error isn't hard and is usually handled largely by your framework of choice.

> Your argument is a strawman

That's....super not true. Malformed requests with gibberish (or, more likely, hacker/pentest- generated) headers will cause e.g. Django to return 5xx easily.

That's just the example I'm familiar with, but cursory searching indicates reports of similar failures emitted by core framework or standard middleware code for Rails, Next.js, and Spring.

laeri · on Jan 18, 2024

If input validation is not present in your framework of choice then the framework clearly has problems.

If you do not validate your inputs properly I am not sure what you are doing when you have a user facing applications of this size. Validating inputs is the lowest hanging fruit for preventing hacking threats.

jabradoodle · on Jan 17, 2024

Usually handled by the framework, you may have to write some code, I'd expect my saas provider to write code so that I know whether their service is available or not.

jon_adler · on Jan 17, 2024

True, however it also doesn’t impact other users and doesn’t justify reporting an incident on the status page.

tazjin · on Jan 17, 2024

https://www.buildkitestatus.com/

lambdaba · on Jan 16, 2024

I'm only replying to the praise here - I too, although I haven't fully switched, had a very enticing moment with Kagi when it returned a result that couldn't even be found on Google at any page in the results. This really sold me on Kagi and I've been going back and forth with some queries, but I have to say that between LLMs, Perplexity, and Google often answering my queries right on the search page, I just don't have that many queries left for Kagi.

If Kagi would somehow merge with Perplexity, now that would be something.

spdif899 · on Jan 16, 2024

Kagi does offer AI features in their higher subscription tier, including summary, research assistance, and a couple others. Plus I think they have basically a frontend for GPT-4 that uses their search engine for browsing, and they just added vision support to it today.

I don't subscribe to those features or any AI tool yet, just pointing out there could be a version of Kagi that is able to replace your Chatgpt sub and save you money

lambdaba · on Jan 17, 2024

Is it as good as Perplexity though? I use ChatGPT for different purposes, I just thought that if Kagi would ally with Perplexity and benefit from its index (I'm not sure what Perplexity uses), it could get really good. I've only recently tried using Perplexity and I get more use out of it than I would with Kagi, it doesn't just do summarization, but I haven't seen what Kagi does with research assistance.

jci · on Jan 17, 2024

It's been a while since I've used Perplexity, but I've been finding the Kagi Assistant super useful. I'm on the ultimate plan, so I get access to the `Expert` assistant. It's been pretty great.

https://help.kagi.com/kagi/ai/assistant.html

herpdyderp · on Jan 17, 2024

I envy your experiences with other services. I've never seen any service's status page show downtime when or even soon after I start experiencing it. Often they simply never show it at all.

Neikius · on Jan 16, 2024

Microsoft is notorious for their lax status page updates...

wiml · on Jan 17, 2024

Is there anyone who isn't?

NetOpWibby · on Jan 17, 2024

It's worth noting that the status page software they use doesn't auto-update automatically.

> Please note that with all that cState can do, it cannot do automatic monitoring out of the box.

https://github.com/cstate/cstate

ParetoOptimal · on Jan 17, 2024

I guess a status page that doesn't auto-update is good for PR, but it's not very useful to show... you know... the status.

NetOpWibby · on Jan 17, 2024

Yeah I thought that was weird. An auto-updating page is worth the constant pings to the infra IMHO.

rwiggins · on Jan 17, 2024

Aaaahhh, it's crazy how much this incident resonates with me!

I've personally handled this exact same kind of outage more times than I'd care to admit. And just like the fine folks at Kagi, I've fallen into the same rabbit hole (database connection pool health) and tried all the same mitigations - futilely throwing new instances at the problem, the belief that if I could just "reset" traffic it'd all be fixed, etc...

It doesn't help that the usual saturation metrics (CPU%, IOPS, ...) for databases typically don't move very much during outages like these. You see high query latency, sure, but you go looking and think: "well, it still has CPU and IOPS headroom..." without realizing, as always, lock contention lurks.

In my experience, 98% of the time, any weirdness with DB connection pools is a result of weirdness in the DB itself. Not sure what RDBMS Kagi's running, but I'd highly recommend graphing global I/O wait time (seconds per second) and global lock acquisition time (seconds per second) for the DB. And also query execution time (seconds per second) per (normalized) query. Add a CPU utilization chart and you've got a dashboard that will let you quickly identify most at-scale perf issues.

Separately: I'm a bit surprised that search queries trigger RDBMS writes. I would've figured the RDBMS would only be used for things like user settings, login management, etc. I wonder if Kagi's doing usage accounting (e.g. incrementing a counter) in the RDBMS. That'd be an absolute classic failure mode at scale.

WhitneyLand · on Jan 17, 2024

I was wondering the same thing.

They would have some writes indirectly due to searches, say if someone chooses to block a search result. They’re also going to have some history and analytics surely.

But yeah it’s not obvious what should cause per search write lock contention…

rwiggins · on Jan 17, 2024

You know, in retrospect, I think Kagi expects O(thousands) searches per month per user, so doing per-user usage accounting in the DB is fine -- thanks to row-level locking.

Well, at least until you get a user who does 60k "in a short time period"... :-)

bostik · on Jan 17, 2024

It's the outliers and "surely nobody could be THAT awful" that kill you. Every time.

lanstin · on Jan 20, 2024

I once had a stock alert product running on a backend I wrote. One person signed up for alerts for every single Nasdaq ticker there was. We didn’t expect that.

louthy · on Jan 16, 2024

This is something that every start-up company ends up going through at some point. I’ve been there and it’s painful!

Sometimes you just don’t have enough time or resource to build the capabilities that would stop an issue like this. Sometimes you didn’t even think a particular issue could even happen and it comes and bites you.

The transparency is important, the learning is important, but also (sometimes) the compensation is important. Kagi should consider giving some search credits for the time that we were unable to use the service. Especially as the real-time response was inadequate (as they freely admit).

An outage for a paid-for service is not the same as an outage of a you’re-the-product service

fragmede · on Jan 16, 2024

That speaks volumes about the observability they have of their internal systems. It's easy for me to say they should have seen it sooner, but the right datadog dashboards and splunk queries should have made that clear as day much faster. Hopefully they take it as a learning experience and invest in better monitoring.

z64 · on Jan 16, 2024

Hi there, I'm Zac, Kagi's tech lead / author of the post-mortem etc.

This has 100% been a learning experience for us, but I can provide some finer context re: observability.

Kagi is a small team. The number of staff we have capable of responding to an event like this is essentially 3 people, seated across 3 timezones. For myself and my right-hand dev, this is actually our very first step in our web careers - this is to say that we are not some SV vets who have seen it all already. To say that we have a lot to learn is a given, from building Kagi from nothing though, I am proud of how far we've come & where we're going.

Observability is something we started taking more seriously in the past 6 months or so. We have tons of dashboards now, and alerts that go right to our company chat channels and ping relevant people. And as the primary owner of our DB, GCP's query insights are a godsend. During the incident both our monitoring went off, as well as query insights showing the "culprit" query - but, we could have monitoring in the world, and still lack the experience to interpret it and understand what the root cause is or most efficient action to mitigate is.

In other words, we don't have the wisdom yet to not be "gaslit" by our own systems if we're not careful. Only in hindsight can I say that GCP's query insights was 100% on the money, and not some bug in application space.

All said, our growth has enabled us to expand our team quite a bit now. We have had SRE consultations before, and intend to bring on more full or part-time support to help keep things moving forward.

kelnos · on Jan 16, 2024

Hi Zac, thank you for chiming in here. Been using Kagi since the private beta, and have been overwhelmingly impressed by the service since I first used it.

Don't worry too much about all the people being harsh in the comments here. There's always a tendency for HN users to pile on with criticism whenever anyone has an outage.

I've always found this bizarre, because I've worked at places with worse issues, and more holes in monitoring or whatever than a lot of the companies that get skewered here. Perhaps many of us are just insecure about our own infra and project our feeling onto other companies when they have outages.

Y'all are doing fine, and I think it's to your credit that you're able to run Kagi's users table off a single, fairly cheap primary database instance. I've worked at places that haven't much thought to optimization, and "solve" scaling problems by throwing more and bigger hardware at it, and then wonder later on why they're bleeding cash on infrastructure. Of course, by that point, those inefficiencies are much more difficult to fix.

As for monitoring, unfortunately sometimes you don't know everything you need to monitor until something bad happens because your monitoring was missing something critical that you didn't realize was critical. That's fine; seems like y'all are aware and are plugging those holes. I'm sure there will be more of those holes in the future, that's just life.

At any rate, keep doing what you're doing, and I know the next time you get hit with something bad, things will be a bit better.

z64 · on Jan 16, 2024

Very kind, thank you! (and everyone else too, many heartwarming replies)

pembrook · on Jan 17, 2024

Agreed. Even though this site is on YC’s domain, I think only a few of the folks in the comments are actually early-adopting startup types. Probably just due to power law statistics, I’d guess most commenters are big company worker bees who’ve never worked on/at a seed stage startup.

If everything at Kagi was FAANG-level bulletproof, with extensive processes around outages/redundancy, then the team absolutely would not be making the best use of their time/resources.

If you’re risk averse and aren’t comfortable encountering bugs/issues like this, don’t try any new software product of moderate complexity for about 7-10 years.

Tempest1981 · on Jan 17, 2024

> people being harsh in the comments here

I've read most of the comments here, and don't recall anything negative, just supportive.

tetha · on Jan 16, 2024

Mh, I work quite a bit in the OPs-side and monitoring and observability are part of my job, for a bit of time now too.

I'll say: Effective observability, monitoring and alerting of complex systems is a really hard problem.

Like, you look at a graph of a metric, and there are spikes. But... are the spikes even abnormal? Are the spikes caused by the layer below, because our storage array is failing? Are the spikes caused by ... well also the storage layer.. because the application is slamming the database with bullshit queries? Or maybe your data is collected incorrectly. Or you select the wrong data, which is then summarized misleadingly.

Been in most of these situations. The monitoring means everything, and nothing, at the same time.

And in the application case, little common industry wisdom will help you. Yes, your in-house code is slamming the database with crap, and thus all the layers in between are saturating and people are angry. I guess you'd add monitoring and instrumentation... while production is down.

At that point, I think we're at a similar point of "Safety rules are written in blood" - "the most effective monitoring boards are found while prod is down".

And that's just the road to find the function in code that's a problem. That's when product tells you how this is critical to a business critical customer.

callalex · on Jan 17, 2024

Running voodoo analysis on graph spikes is indeed a fool’s errand. What you really need is load testing on every component of your system, and alerts for when you approach known, tested limits. Of course this is easier said than done and things will still be missed, but I’ve done both approaches and only one of them had pagers needlessly waking me in the middle of the night enough to go on sleepless swearing rants to coworkers.

tetha · on Jan 17, 2024

Yeh. Or, during a complex investigation, you need to setup a hypothesis explaining these spikes in order to eventually establish causality surrounding these spikes. And once you have causality, you can start fixing.

For example, I've had a disagreement with another engineer there during a larger outage. We eventually got to the idea: If we click that button in the application, the database dies a violent death. Their first reaction was: So we never click that button again. My reaction was: We put all attention on the DB and click that button a couple of times.

If we can reliably trigger misbehavior of the system, we're back in control. If we're scared, we're not in control.

primitivesuave · on Jan 16, 2024

I really appreciate you sharing these candid insights. Let me tell you (after over a decade of deploying cloud services), some rogue user will always figure out how to throw an unforeseen wrench into your system as the service gets more popular. Even worse than an outage is when someone figures out how to explode your cloud computing costs :)

alberth · on Jan 17, 2024

Unsolicited suggestion.

Don’t host your status page (status.kagi.com), as a subdomain of your main site (DNS issues can cause both your main site and status site to go offline - so use something like kagistatus.com).

And host it with a webhost who doesn’t use any common infra as you.

jjtheblunt · on Jan 16, 2024

I bet a silent majority are thinking "well done, Zac, all the same".

xwolfi · on Jan 17, 2024

I work in a giant investment bank with hundreds of people who can answer across all time zones. We still f up, we still don't always know where problems lie and we still sometimes can spend hours on a simple DoS.

You'll only get better at guessing what the issue could be: an exploit by a user is something you'll remember forever and will overly protect against from now on, until you hit some other completely different problem which your metric will be unprepared for, and you'll fumble around and do another post mortem promising to look at that new class of issues etc.

You'll marvel at the diversity of potential issues, especially in human-facing services like yours. But you'll probably have another long loss of service again one day, and you're right to insist on the transparency / speed of signaling to your users: they can forgive everything as long as you give them an early signal, a discount and an apology, in my experience.

ayberk · on Jan 16, 2024

Kudos for being so open -- after seeing numerous "incidents" at AWS and GCE I can say that two rules always hold with respect to observability:

- You don't have enough.

- You have too much.

Usually either something will be missing or some red herring will cost you valuable time. You're already doing much better than most people by taking it seriously :)

JohnMakin · on Jan 17, 2024

> Kagi is a small team.

I figured this was the case when you said “our devops engineer” singular and not “one of our devops engineers.”

I’m glad you’re willing at this stage to invest in SRE. It’s a decision a lot of companies only make when they absolutely have to or have their backs against a wall.

digitalsin · on Jan 17, 2024

I use Kagi every single day, ever since the beta. I don't remember the last time I used that other search engine, the G one..can't remember the name. Anyway, absolutely love Kagi and the work you guys do. Thank you!

siquick · on Jan 17, 2024

The best feedback I can give is that my whole family now uses Kagi over Google Search and I’m a regular user of the summariser tool too.

Big ups, you’re smashing it

timwis · on Jan 16, 2024

Thank you for sharing! I’m surprised to hear that, given how impressive your product is, but I’m an even bigger fan now.

nanocat · on Jan 16, 2024

Sounds like you’re doing great to me. Thank you for being so open!

ijhuygft776 · on Jan 17, 2024

Kagi, an efficient company. Thanks Zac and the rest of the team

hacker_newz · on Jan 16, 2024

What are "the right datadog dashboards and splunk queries"?

blantonl · on Jan 16, 2024

Lots and lots of money to catch what you don't know, which means "oh crap, now we need to log this also"

fragmede · on Jan 16, 2024

This is easy for me to say in hindsight, but a graph of queries per user, for the top 100 accounts would have lit up the issue like a Christmas tree. But it's the kind of chart you stuff on the full page of charts that are usually not interesting. So many problems have been avoided by "huh, that looks weird" on a dashboard.

In a more targeted search, asking Splunk to show out where the traffic is coming from on a map, and a pie chart of the top 10 accounts they're coming from would also be illuminating.

But again, this is easy for me to say in hindsight, after the issue has been helpfully explained to me in a blog post. I don't claim could have done any better if I were there, my point is that you need to see inside the system to be operate it well.

eep_social · on Jan 17, 2024

If daily volume is ~400k a sharp 60k spike should show even on a straight rps dashboard. I suspect that didn’t exist because most cloudy SaaS tools don’t seem to put any emphasis on traffic over a particular unit time, like in the Cloudflare site dashboard one must hover over the graph and subtract timestamps to find out if the number of request is per hour, minute, etc. Splunk is similarly bad — the longer the period query covers, the less granularity you get and the minimum seems to be minutely.

Drill down by user, locale, etc would just make it even easier to figure out what’s going on once you spot the spike and start to dig.

My unsolicited advice for Zac in case they’re reading is — start thinking about SLIs (I for indicators) sooner rather than later to help you think through cases like this.

therealdrag0 · on Jan 19, 2024

For splunk you can set unit by setting span like this | timechart span=30s count by X

scblzn · on Jan 16, 2024

surely these ones for datadog: https://nitter.net/TurnerNovak/status/1654577231937544192

/s

mathverse · on Jan 16, 2024

Kagi is a startup with low margins and high operational costs.

boomboomsubban · on Jan 16, 2024

One user running a scraper took the service down for seven hours? I know it's easy to sit on the outside and say they should have seen this coming, but how does nobody in testing go "what happens if a ton of searches happen?"

z64 · on Jan 16, 2024

Hi there, this is Zac from Kagi. I just posted some other details here that might be of interest:

https://news.ycombinator.com/item?id=39019936

TL;DR - we are a tiny, young team at the center, and everyone has a closet full of hats they wear. No dedicated SRE team yet.

> "what happens if a ton of searches happen?"

In fairness, you can checkout https://kagi.com/stats - "a lot of searches" is already happening, approaching 400k per day, and systems still operate with plenty of capacity day-to-day, in addition to some auto-scaling measures.

The devil is in the details of some users exploting a pathological case. Our lack of experience (now rightfully gained) is knowing what organic or pathological traffic we could have predicted and simulated ahead of time.

Load-simulating 20,000 users searching concurrently sounds like it would have been a sound experiment early on, and we did do some things resembling this. But considering this incident, it still would not have caught this issue. We have also had maybe 10 people run security scanners on our production services at this point that generated more traffic than this incident.

It is extremely difficult to balance this kind of development when we also have features to build, and clearly we could do with more of it! As mentioned in my other post, we are looking to expand the team in the near term so that we are not spread so thin on these sorts of efforts.

There is a lot that could be said in hindsight, but I hope that is a bit more transparent WRT how we ended up here.

smcleod · on Jan 16, 2024

Zac, I think you’re doing great handling and communicating this. Keep up the great work and have fun learning while you’re at it!

rconti · on Jan 17, 2024

What does "pathological" mean in this context?

fancy_pantser · on Jan 17, 2024

being such to a degree that is extreme, excessive, or markedly abnormal (with a connotation of it happening on purpose)

SadCordDrone · on Jan 17, 2024

What does " being such to a degree that is extreme, excessive, or markedly abnormal (with a connotation of it happening on purpose) " mean in this context?

fancy_pantser · on Jan 17, 2024

pathological

andrewaylett · on Jan 17, 2024

Their scale is (at least compared to anyone operating "at scale") tiny. 400k searches daily, I don't think it's unreasonable for them to struggle with an unexpected extra 60k over a small number of hours. Especially when it's the first time someone's done that to them.

For comparison, the stuff I work on is definitely not FAANG-scale but it's decidedly larger (at least in request rate) than Kagi. I'm sure they'll learn quickly, but in the meantime I'm almost hoping that they do have more issues like this -- it's a sign that they're moving in the right direction.

sedatk · on Jan 17, 2024

I’m a paid user of Kagi, and having experienced downtime made me realize how much I took Google’s reliability for granted. Google has never gone down on me, maybe once in the last two decades. Losing access to your search engine is quite crippling. I LOVE Kagi, that’s why I pay for it, but experiencing downtime in my second month was quite off-putting. I love post-mortems, but I hope to never read them. :)

That said, I hope this experience makes Kagi even more resilient and reliable.

tsegers · on Jan 17, 2024

As another paying user of Kagi I wonder what prevented you from using another search engine for the six hours that Kagi was unavailable. Search engines are not like your email provider or ISP in that you're locked in.

wiether · on Jan 17, 2024

> I wonder what prevented you from using another search engine

Well, seeing that Kagi was down, I tried to switch to Google by doing "!g xxx" which gave me the same result than my previous "xxx" search.

I took me a few seconds to realize how stupid I was and then typing "google.com" in the address bar.

sedatk · on Jan 17, 2024

The thing is, I never thought Kagi was down and thought that it must be a problem with my configuration or connection. That was how much I trusted in Kagi. I didn’t spend whole downtime online though.

pbronez · on Jan 17, 2024

100%. This outage and the (unrelated) bug on the new mobile safari extension were jarring. I definitely rely on snappy & ubiquitous to Kagi!

precision1k · on Jan 17, 2024

Reminds me of a time I was running a proof-of-concept for a new networking tool at a customer site, and about two minutes after we got it running their entire network went down. We were in a sandboxed area so there was no way our product could of caused a network wide outage, but in my head I'm thinking: "there's no way, right. . . .RIGHT?!?!".

tryauuum · on Jan 17, 2024

what was the problem? some leaking abstractions?

sabujp · on Jan 17, 2024

    We were later in contact with an account that we blocked who claimed they were
    using their account to perform automated scraping of our results, which is not
    something our terms allow for."

Set QPS limits for every possible incoming RPC / API / HTTP request , especially public ones!

leesalminen · on Jan 17, 2024

So much this. I learned this the hard way.

We had a search function with typeahead abilities. I had intentionally removed the rate limit from that endpoint to support fast typers.

One day around 6AM, someone in Tennessee came into work and put their purse down on their keyboard. The purse depressed a single key and started hitting the API with each keystroke.

Of course after 15 minutes of this the db became very unhappy. Then a web server crashed because the db was lagging too much. Cascading failures until that whole prod cluster crashed.

Needless to say the rate limit was readded that day ;).

o11c · on Jan 17, 2024

This is a reminder that "we want to support bursts" is much more common thing than "we want a higher ratelimit". Often multiple levels of bursts are reasonable (e.g. support 10 requests per minute, but only 100 requests per day; support 10 requests per user, but only 100 requests across all users).

There are several ways to track history with just a couple variables (or, if you do have the history, but only accessing a couple of variables); the key observation is that you usually don't have to be exact, only put a bound on it.

For history approximations in general, one thing I'm generally fond of is using an exponential moving average (often with λ=1/8 so it can be done with shifts `ema -= ema>>3; ema += datum>>3` and it's obvious overflow can't happen). You do have to be careful that you aren't getting hyperbolic behavior though; I'm not sure I would use this for a rate limiter in particular.

AtNightWeCode · on Jan 17, 2024

And a public endpoint is any Internet facing endpoint including the ones where the user needs to be logged in. People seems to forget that.

smcleod · on Jan 16, 2024

Good write up. I always appreciate Kagi’s honesty and transparency like this. Great product, great service.

renewiltord · on Jan 16, 2024

Interesting. The classic problem. You offer to not meter something and then someone will use it to max capacity. Then you’re forced to place a limit so that one user won’t hose everyone else.

JumpCrisscross · on Jan 16, 2024

> you’re forced to place a limit so that one user won’t hose everyone else

Soft limits at the tail of usage comports with the term unlimited as it's commonly used. For Kagi, a rate limit derived from how quickly a human can type makes sense.

fotta · on Jan 16, 2024

> We were later in contact with an account that we blocked who claimed they were using their account to perform automated scraping of our results, which is not something our terms allow for.

I mean beyond that it was a user that was violating the TOS. This isn’t really a bait and switch scenario (although it could be reasonably construed as such).

lolc · on Jan 17, 2024

I've been using it for a few weeks now and when it didn't load right away last week I was at a loss what to do. I wondered "what is Kagi and why can't my browser search anymore?" It's really well built to get out of your way and I'd all forgotten about it. Eventually I realized I could use another search engine. The bother!

Before this post-mortem dropped I'd also forgotten the incident. Props to the team that doesn't make me think when I search!

And my sympathy in this incident. It's rough when things coincide like that and cause you to look at the wrong metrics.

_xnmw · on Jan 17, 2024

> This didn’t exactly come as a surprise to us, as for the entirety of Kagi’s life so far we have actually used the cheapest, single-core database available to us on GCP!

Wow, love that you guys are keeping it lean. Have you considered something like PolyScale to handle sudden spikes in read load and squeeze more performance out of it?

mightyham · on Jan 17, 2024

As much as people gush over Kagi on HN, I still have yet to actually try it because I cannot for the life of me get authentication to work. Even after immediately resetting my password, I get an "incorrect email or password" or "try again later" error on the login page. I've tried at least 3 times over the last few months with the same results each time.

If such a fundamental part of a web service company's website is broken, it makes me weary of their competence.

freediver · on Jan 17, 2024

Have you tried contacting Kagi support to help debug the issue?

mightyham · on Jan 17, 2024

No and I think it's kind of absurd that I would have to reach out to support staff so that I can get something as basic as account log in working correctly.

returningfory2 · on Jan 17, 2024

I'm confident that account login is working for most people. Potentially something weird has happened with you account and it requires manual intervention. This is exactly what support is for - handling weird situations.

saagarjha · on Jan 17, 2024

I feel like it is a lot more absurd to not reach out to the resources that are equipped specifically to handle problems like yours.

mightyham · on Jan 17, 2024

My point is that this is a service that I was curious about but have no real need for. I'm mostly satisfied by using Google or DDG. As a customer, it's pretty absurd that the onus is on me to spend a fairly significant amount of time contacting support in order to simply evaluate a niche product. Furthermore, the company is a tech company, so the fact that their authentication is bugged, seems like more than enough of a reason to not spend any more of my time evaluating their service. I literally can't think of any web based products I currently use which have, at any point, had bugged authentication.

dotnet00 · on Jan 17, 2024

It's understandable that such a basic seeming issue would negatively impact your opinion of the service, but it's also worth considering that such a basic issue must surely be some sort of unique edge case if the vast majority of other people are claiming to be happily using the service (which implies being able to log into their accounts).

Of course you don't owe Kagi anything so you don't have to reach out to support, but just something to consider before questioning someone's competence.

dghlsakjg · on Jan 17, 2024

It is a beta project.

You would expect this to work, but you also shouldn't be surprised that a beta project isn't perfect.

If you need the ability to login reliably and a search engine that never goes down, stick to google.

If you want to help a new entrant with a product that reliably outperforms Google search get their product battle-ready, then give them a bit of a chance to make it right.

lostlogin · on Jan 17, 2024

Push on - it’s a great service.

fwsgonzo · on Jan 16, 2024

It wasn't that long ago that I had heard about Kagi the first time. Now I use it every day, and the fact that I can pin cppreference.com to the top is just such a boon.

layoric · on Jan 16, 2024

"This didn’t exactly come as a surprise to us, as for the entirety of Kagi’s life so far we have actually used the cheapest, single-core database available to us on GCP!"

Outages suck, but I love the fact that they are building such a lean product. Been paying for Kagi as a part of de-Google-ifying my use of online services and the experience so far (I wasn't impacted by this outage) has been great.

A few years ago I built a global SaaS (first employee and SWE) in the weather space which was backed by a single DB, and while it had more than just 1 core (8 when I left from memory), I think a lot of developers reach for distributed DBs far too early. Modern hardware can do a lot, products like AWS Aurora are impressive, but they come with their own complexities (and MUCH higher costs).

z64 · on Jan 16, 2024

Very cool! The deluge of cloud solutions are absolutely one of those things that can be a distraction from figuring out "What do I actually need the computer to do?".

Internally I try to promote solutions with the fewest moving cloud-parts, and channel the quiet wisdom of those running services way larger than ours with something like an sqlite3 file that they rsync... I know they're out there. Not to downplay the feats of engineering of huge distributed solutions, but sometimes things can be that simple!

layoric · on Jan 16, 2024

Distracting is right! I watched your CrystalConf video and was happy to see the familiar Postgres + Redis combo :). I remember worrying about running out of legs with Redis (being single threaded), but with a combo of pipelining and changing the data structures used it ended up being the piece of infra that had the most headroom.

Monitoring was probably the biggest value for outsourcing to another SaaS. I used Runscope, AWS dashboards and own elasticsearch and it was pretty cost effective for the API that was doing ~2M API calls a day.

The other risk of cloud solutions is the crazy cost spikes. I remember consulting a partner, much larger, weather company on reducing their costs of a similar global weather product where they chose AWS DynamoDB as their main datastore, and their bill just for DynamoDB was twice our companies cloud bill. All because it was slightly "easier" to not think about compute requirements!

Any ways, thanks for the postmortem, hopefully your channeling of quiet wisdom continues to branch out to others! :)

lopkeny12ko · on Jan 17, 2024

> While we do offer unlimited searches to our users, such volume was a clear abuse of our platform, against our terms of use. As such, we removed searching capability from the offending accounts.

Don't advertise something as unlimited if it's not actually unlimited.

karlshea · on Jan 17, 2024

Unlimited searches as a real user is not the same thing as running a bot against the service.

eviks · on Jan 17, 2024

Real users can also run a bot for their real use cases, the correct way out is not to mislead

callalex · on Jan 17, 2024

Just advertise it as 10,000 per day if that’s what it actually is. Any sane user will see a number like that and know they don’t have to worry about the limit.

lopkeny12ko · on Jan 17, 2024

What difference does it make? So would it be "legitimate use" if I sat at a computer and manually clicked 100 times to do something, while it is "abuse" for me to write a script that automates exactly the same thing? Is it only considered abuse if the action isn't tedious and tiring for me as a user?

saagarjha · on Jan 17, 2024

Yes. This is why a buffet that offers you unlimited food requires you to actually eat it, and not just pack it into boxes to take home.

wiseowise · on Jan 17, 2024

Don’t abuse the system.

hackernewds · on Jan 17, 2024

confused how both it is advertised, and against ToS

Havoc · on Jan 16, 2024

Ironically the reports of instability got me make an account with them. It's been on my todo list but forgot about it

MyFedora · on Jan 17, 2024

> While we do offer unlimited searches to our users, such volume was a clear abuse of our platform, against our terms of use.

Disclose that it's fair use upfront. Leveraging unlimited searches =/= Abuse. Never bait and switch unlimited to fair use, nobody reads fine print. Customers who pay for unlimited expect the service to scale the infrastructure to handle whatever they throw at it. I think that's a reasonable expectation since they paid for unlimited, but got limited.

asmor · on Jan 17, 2024

the presence of a paid search API should be enough of a hint that a $10 account trying to read back the entire index is not going to be tolerated.

https://help.kagi.com/kagi/api/search.html

though the lack of alerting and the lack of consideration for bad actors in general (just look for the kagi feedback thread on suicide results, you can kagi it) seems pretty consistent

nijave · on Jan 21, 2024

Not necessarily. It's common to charge business users a different amount (>0) than personal use

xigoi · on Jan 17, 2024

What does this have to do with the suicide debate?

prh8 · on Jan 17, 2024

Totally missed this incident, just want to say thank you to the Kagi team for an awesome product. Love the search and also Orion.

dotnet00 · on Jan 17, 2024

This was fun to see playing out, it made me realize how I didn't even think about how impressive it was that you guys (being a small team) were pushing out updates with very few bugs or visible downtime.

I didn't mind the downtime, just gave me an excuse to take a break and focus on more casual coding.

joelcollinsdc · on Jan 17, 2024

Always scary when you learn you can be DOS’d by a single user that learned to use bash and a for loop.

vfxdev · on Jan 18, 2024

This incidcent also shows that no security audit was ever performed on this platform, as a DoS vulnerability would've been one of the first findings there. Makes you wonder what other vulnerabilities are in their infrastructure...

refulgentis · on Jan 17, 2024

I didn't think it was possible to only buy a single core outside of an edge function.

LeoNatan25 · on Jan 17, 2024

Ouch, a seven hour downtime for a paid search engine service. Maybe “every startup goes through this”, as some comment here stated, but not all startups are created equal. I wonder how much this incident will cost them long term.

thecleaner · on Jan 17, 2024

What I gather is this was a rate limiting issue. Rate limiting is a standard pattern for API platforms but I wonder how many consumer facing services implement it.

lostlogin · on Jan 17, 2024

I cracked up when I saw this.

My search went bad and I assumed I’d broken something, tried a few things, took a break, tried a few more and it was working. I moved on.

It wasn’t ever at my end.

AtNightWeCode · on Jan 17, 2024

I would put something like Cloudflare in front of a service like this. Bot protection may handle it and if not a technician can fix it in 5 minutes.

WhereIsTheTruth · on Jan 17, 2024

If 60k requests hurts your infra, then your infra is poorly built and you cheap out on resources

Also straight out blocking the account instead of rate-limiting? Yeah, poor infrastructure

Kagi is aimed at power users, they should do and provide better

saagarjha · on Jan 17, 2024

You spend extra on resources and people lambast you for your bloated and expensive stack. You run it lean and people like you say you’re cheaping out.

andrewaylett · on Jan 17, 2024

60k requests isn't a useful metric on its own. How long is that over, and what's their normal rate?

rmrf100 · on Jan 17, 2024

Muphy's law

zilti · on Jan 16, 2024

[flagged]

HaZeust · on Jan 16, 2024

I'll bite; how do you figure?

zilti · on Jan 17, 2024

GCP prices are ludicrously high. You're far better off to just get a VPS and self-host Postgres and your webserver. Which is what we did at my last startup; only thing we used GCP for was build servers, and only because we got a ton of free credits. We could've gotten a paid intern for the money we saved.

jacob019 · on Jan 16, 2024

If you're listening, Kagi, please add an à la carte plan for search. Maybe hide it behind the API options as not to disrupt your normal plans. I love the search and I'm happy to pay, but I'm cost sensitive now and it's the only way that I'm going to feel comfortable using it long-term.

spiderice · on Jan 16, 2024

They have a $5 for 300 searches option. Is that not what you're referring to?

JumpCrisscross · on Jan 16, 2024

> Is that not what you're referring to?

It sounds like they'd prefer a 2¢ per search option via API.

jacob019 · on Jan 16, 2024

Yes, exactly. Or 5 cents for the first 100 searches, 2 cents after, something like that.

binsquare · on Jan 17, 2024

They do have the FastGPT api which is approx 1.5cents per api call

lostlogin · on Jan 17, 2024

So raise this to 5c for first 100 and 2c thereafter and we have a happy customer?

jacob019 · on Jan 17, 2024

Not search.

BeetleB · on Jan 17, 2024

They used to have a $5/mo option with N searches, and then charge some cents per search after that. For a lot of people, the net amount would still be under $10/mo.

3np · on Jan 17, 2024

Wouldn't some level of per-account rate-limiting make sense? Say, 1000 searches per hour? It's commendable and impressive that Kagi has apparently been able to get this far and perform this consistently without any account-level automated rate-limiting but the only alternative is an inevitable cat-and-mouse and whack-a-mole of cutting off paying customers who knowingly or not violate the ToS. Returning 429 with actionable messages makes it clear to users what they should expect.

You obviously want the block-interval to be long enough to not cause too much additional churn on the database.

Applying restrictions on IP-level when you can avoid it is just a world of frustration for everyone involved.

eep_social · on Jan 17, 2024

> we need to set some automated limits to help us enforce this. From analyzing our user’s usage, we have picked some limits that no good-faith user of Kagi should reasonably hit.

> These new limits should already be in place by the time of this post, and we will monitor their impact and continue to tune them as needed.