At first, by what turned out to be a complete coincidence, the incident occurred at precisely the same time that we were performing an infrastructure upgrade to our VMs with additional RAM resources
I can assure you that these "coincidences" happen all the time, and will cause you to question your very existence when you are troubleshooting them. And if you panic while questioning your very existence, you'll invariably push a hotfix that breaks something else and then you are in a world of hurt. \
Muphy's law is a cruel thing to sysadmins and developers.
Completely agree. I’ve triaged many outages with varying degrees of severity in my career so far, and the worst ones were always caused by someone panic-jumping onto some red herring rather than coming up with a sensible explanation as to why that would be a fix other than “it happened at the same time.”
I have a saying I really like to throw around, which is “if you don’t know why/how you fixed it, you may not have”
> I have a saying I really like to throw around, which is “if you don’t know why/how you fixed it, you may not have”
There is coincidental function, and planned function.
If you just throw numbers at a configuration parameter until it works and then call it "fixed",it may be fixed, but that's just coincidental. If the system load changes, you can't really tell how to modify the value afterwards.
On the other hand, if you can explain why a certain setting will fix it, that's planned function, and if something changes, you most likely know how to proceed from there.
At my work place we were made aware of some instability issues and the first place to look is of course the logs. It had apparently been happening for a while, but the most important parts of the system were not affected noticeably. This was early morning and we didn't have much traffic.
We noticed a lot of network connection errors for SQS, so one of the team members started blaming Amazon.
My response was that if SQS had these instabilities, we would have noticed posts on HN or downtime on other sites. The developer ignored this and jumped into the code to add toggles and debug log messages.
I spent 2 more minutes and found other cases where we had connection issues to different services. I concluded that the reason why we saw so many SQS errors was because it is called much more frequently.
The other developer pushed changes to prod which caused even more problems, because now the service could not even connect to the database.
He wrongly assumed that the older versions of the service, which was still running, was holding on to the connections and killed them in order to make room for his new version. We went from a system with instabilities, to a system that was down.
The reason was a network error that was fixed by someone in the ops team.
I hate it when button mashers are in charge of incident response.
Things like when traffic ramps up in the morning and the site falls over and someone in charge blames the deployment the night before and screams "rollback", and then it eventually turns out to have nothing at all to do with the deployment.
To be fair, if your rollbacks are cheap (1-3 button pushes), safe, and automated, it doesn't really hurt to try a rollback while you look for other causes.
Most of my production experience precedes CI/CD as a concept by a number of years. You still don't know though if the rollback fixed it or if just bouncing the software alone fixed some kind of brownout, and you may not be able to replicate what happened and get stuck unsure of what the bug was and be in a state where you don't know if you can roll forward again or not. Then a developer throws a bit of spackle on the code to address what the problem "might" be -- then when you roll forward people assume that fixed it, which it might not, which is how cargo cults get formed.
Oh man, last week we had a small outage. Database queries took much longer than normal. I just so happened to be doing ad-hoc querying on the same table at the same time.
“Luckily”, the problem was unrelated to my querying but two coincidences are proper scary.
Think about the optics of your funny joke. You’re saying you’ll advertise that the interns are replaceable fodder. Even if that’s true, it’s quite a dick move.
I’ve seen enough brainless execs and startup founders around here that you have to take stupid ideas like that seriously. I hope it was a joke, but it’s not very funny for an intern.
Our last batch named -themselves- "minions" and there's still little yellow dudes everywhere in the images they posted to internal comms platforms.
I agree that some care would need to be taken to ensure that the redshirts felt like active participants in the joke rather than the butt of it, but at least in a reasonably informal environment it could, in fact, be very funny -to- the interns themselves.
(though as tech lead, I always make sure that there are a decent number of jokes flying around directed squarely at me, which probably helps set a baseline of how seriously to take such things - and if in doubt about any particular joke, I just aim it at myself ;)
A little more credit to people for making a funny and a little less assumptions that people are that boneheaded would make the world a better place to be sure. I know for one, I rather enjoy being able to see the humor and having a laugh rather than getting all enraged and venting on the internet to show some sort of self perceived superiority.
Nah the ones spewing self perceived superiority are the folks who would call interns red shirts or make other off color jokes that end up hurting others and don’t even know it.
Yeah, the "coincidence" leads you to jump to conclusions about your change being the cause.
This is very human, we do it all the time.
However having been through these enough times I have picked up a habit of questioning more assumptions and not flagging something as verified data before it is.
I'm still nowhere near perfect at removing these biases and early conclusions but it has helped.
Oh yes, the amount of times I rolled back a change during an outage that had nothing to do with the outage...
It's a critical skill for engineers: being able to critically reason, debug and "test in isolation" any changes that address outages. Much harder than it seems and typically a "senior" skill to have.
That's essentially on call 101, you try to mitigate as fast as possible and root cause later when you are not half asleep. Surprised to see some comments here.
I was one of the users that went and reported this issue on Discord. I love Kagi but I was a bit disappointed to see that their status page showed everything was up and running. I think that made me a bit uneasy and it shows their status pages are not given priority during incidents that are affecting real users. I hope in the future the status page is accurately updated.
In the past, services I heavily rely on (e.g. Github), have updated their status pages immediately and this allows me to rest assured that people are aware of the issue and it's not an issue with my devices. When this happened with Kagi, I was looking up the nearest grocery stores open since we were getting snow later that day so it was almost like I got let down b/c I had to go to Google for this.
I will continue using Kagi b/c 99.9% of the other time I've used it, it has been better than Google but I hope the authors of the post-mortem do mean it when they say they'll be moving their status page code to a different service/platform.
And thanks again Zac for being transparent and writing this up. This is part of good engineering!
As an engineer on call, I have been in this conversation so many times:
"Hey, should we go red?"
"I don't know, are we sure it's an outage, or just a metrics issue?"
"How many users are affected again?"
"I can check, but I'm trying to read stack traces right now."
"Look, can we just report the issue?"
"Not sure which services to list in the outage"
...and so on. Basically, putting anything up on the status page is a conversation, and the conversation consumes engineer time and attention, and that's more time before the incident is resolved. You have to balance communication and actually fixing the damn thing, and it's not always clear what the right balance is.
If you have enough people, you can have a Technical Incident Manager handle the comms and you can throw additional engineers at the communications side of it, but that's not always possible. (Some systems are niche, underdocumented, underinstrumented, etc.)
My personal preference? Throw up a big vague "we're investigating a possible problem" at the first sign of trouble, and then fill in details (or retract it) at leisure. But none of the companies I've worked at like that idea, so... [shrug]
This is exactly why those status pages are almost always a lie. Either they need to be fully automated without some middle manager hemming and hawing, or they shouldn’t be there at all. From a customer’s perspective, I’ve been burned so many times on those status pages that I ignore them completely. I just assume they’re a lie. So I’ll contact support straight away - the very thing these status pages were intended to mitigate.
The simple fix is to have a “last update: date time”
Or you can build a team to automate everything and force everyone and everything into a rigid update frequency which becomes a metric that applies to everyone and becomes the bane of the existence of your whole engineering organization
I think your bit at the end is the most important.
ANY communication is better than no communication "everything is fine, it must be you" is the worst feeling in these cases. Especially if your business is reliant on said service and you can't figure out why you are borked (eg the github ones).
Your point highlights thinking about what's being designed.
everything is fine is different from nothing has been reported. A green is misleading, there should be no green as green is unknown, there should be nothing with a note that there's nothing, and that's not the same as a green light.
Once an ISP support person insisted that I drive down to the shop and buy a phone handset so I could confirm presence of a dial tone on a line that my vdsl modem had line sync on before they’d tell me their upstream provider had an outage. I was… unimpressed.
IMHO, any significant growth in 500s (that's what I was getting during the outage) warrants mention on status page. I've seen a lot of stuff, so if I see an acknowledged outage, I'll just wait for people to do their jobs. Stuff happens. If I see unacknowledged one, I get worried that people who need to know don't and that undermines my confidence in the whole setup. I'd never complain if status page says maybe there's a problem but I don't see one. I will complain in the opposite case.
Not necessarily. The situation can be genuinely unclear to the point where it is a judgement call, and then it becomes a matter of how to weigh the consequences.
Stage 1: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.
Problems: Delayed or missed updates. Customers complain that you're not being honest about outages.
Stage 2: Status is automatically set based on the outcome of some monitoring check or functional test.
Problems: Any issue with the system that performs the "up or not?" source of truth test can result in a status change regardless of whether an actual problem exists. "Override automatic status updates" becomes one of the first steps performed during incident response, turning this into "status is manually set, but with extra steps". Customers complain that you're not being honest about outages and latency still sucks.
Stage 3: Status is automatically set based on a consensus of results from tests run from multiple points scattered across the public internet.
Problems: You now have a network of remote nodes to maintain yourself or pay someone else to maintain. The more reliable you want this monitoring to be, the more you need to spend. The cost justification discussions in an enterprise get harder as that cost rises. Meanwhile, many customers continue to say you're not being honest because they can't tell the difference between a local issue and an actual outage. Some customers might notice better alignment between the status page and their experience, but they're content, so they have little motivation to reach out and thank you for the honesty.
Eventually, the monitoring service gets axed because we can just manually update the status page after all.
Stage 4: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.
Well, in the real world it might. It should trigger a bug creation and a fix to the code, but not an incident.
Now all of a sudden to decide this you need more complex and/or specific queries in your monitoring system (or a good ML-based alert system), so complexity is already going up.
If your service is returning 5xx, that is the the definition of a server error, of course that is degraded service. Instead we have pointless dashboards that are green an hour after everything is broken.
Returning 4xx on a client error isn't hard and is usually handled largely by your framework of choice.
> Returning 4xx on a client error isn't hard and is usually handled largely by your framework of choice.
> Your argument is a strawman
That's....super not true. Malformed requests with gibberish (or, more likely, hacker/pentest- generated) headers will cause e.g. Django to return 5xx easily.
That's just the example I'm familiar with, but cursory searching indicates reports of similar failures emitted by core framework or standard middleware code for Rails, Next.js, and Spring.
If input validation is not present in your framework of choice then the framework clearly has problems.
If you do not validate your inputs properly I am not sure what you are doing when you have a user facing applications of this size. Validating inputs is the lowest hanging fruit for preventing hacking threats.
Usually handled by the framework, you may have to write some code, I'd expect my saas provider to write code so that I know whether their service is available or not.
I'm only replying to the praise here - I too, although I haven't fully switched, had a very enticing moment with Kagi when it returned a result that couldn't even be found on Google at any page in the results. This really sold me on Kagi and I've been going back and forth with some queries, but I have to say that between LLMs, Perplexity, and Google often answering my queries right on the search page, I just don't have that many queries left for Kagi.
If Kagi would somehow merge with Perplexity, now that would be something.
Kagi does offer AI features in their higher subscription tier, including summary, research assistance, and a couple others. Plus I think they have basically a frontend for GPT-4 that uses their search engine for browsing, and they just added vision support to it today.
I don't subscribe to those features or any AI tool yet, just pointing out there could be a version of Kagi that is able to replace your Chatgpt sub and save you money
Is it as good as Perplexity though? I use ChatGPT for different purposes, I just thought that if Kagi would ally with Perplexity and benefit from its index (I'm not sure what Perplexity uses), it could get really good. I've only recently tried using Perplexity and I get more use out of it than I would with Kagi, it doesn't just do summarization, but I haven't seen what Kagi does with research assistance.
It's been a while since I've used Perplexity, but I've been finding the Kagi Assistant super useful. I'm on the ultimate plan, so I get access to the `Expert` assistant. It's been pretty great.
I envy your experiences with other services. I've never seen any service's status page show downtime when or even soon after I start experiencing it. Often they simply never show it at all.
Aaaahhh, it's crazy how much this incident resonates with me!
I've personally handled this exact same kind of outage more times than I'd care to admit. And just like the fine folks at Kagi, I've fallen into the same rabbit hole (database connection pool health) and tried all the same mitigations - futilely throwing new instances at the problem, the belief that if I could just "reset" traffic it'd all be fixed, etc...
It doesn't help that the usual saturation metrics (CPU%, IOPS, ...) for databases typically don't move very much during outages like these. You see high query latency, sure, but you go looking and think: "well, it still has CPU and IOPS headroom..." without realizing, as always, lock contention lurks.
In my experience, 98% of the time, any weirdness with DB connection pools is a result of weirdness in the DB itself. Not sure what RDBMS Kagi's running, but I'd highly recommend graphing global I/O wait time (seconds per second) and global lock acquisition time (seconds per second) for the DB. And also query execution time (seconds per second) per (normalized) query. Add a CPU utilization chart and you've got a dashboard that will let you quickly identify most at-scale perf issues.
Separately: I'm a bit surprised that search queries trigger RDBMS writes. I would've figured the RDBMS would only be used for things like user settings, login management, etc. I wonder if Kagi's doing usage accounting (e.g. incrementing a counter) in the RDBMS. That'd be an absolute classic failure mode at scale.
They would have some writes indirectly due to searches, say if someone chooses to block a search result. They’re also going to have some history and analytics surely.
But yeah it’s not obvious what should cause per search write lock contention…
You know, in retrospect, I think Kagi expects O(thousands) searches per month per user, so doing per-user usage accounting in the DB is fine -- thanks to row-level locking.
Well, at least until you get a user who does 60k "in a short time period"... :-)
I once had a stock alert product running on a backend I wrote. One person signed up for alerts for every single Nasdaq ticker there was. We didn’t expect that.
This is something that every start-up company ends up going through at some point. I’ve been there and it’s painful!
Sometimes you just don’t have enough time or resource to build the capabilities that would stop an issue like this. Sometimes you didn’t even think a particular issue could even happen and it comes and bites you.
The transparency is important, the learning is important, but also (sometimes) the compensation is important. Kagi should consider giving some search credits for the time that we were unable to use the service. Especially as the real-time response was inadequate (as they freely admit).
An outage for a paid-for service is not the same as an outage of a you’re-the-product service
That speaks volumes about the observability they have of their internal systems. It's easy for me to say they should have seen it sooner, but the right datadog dashboards and splunk queries should have made that clear as day much faster. Hopefully they take it as a learning experience and invest in better monitoring.
Hi there, I'm Zac, Kagi's tech lead / author of the post-mortem etc.
This has 100% been a learning experience for us, but I can provide some finer context re: observability.
Kagi is a small team. The number of staff we have capable of responding to an event like this is essentially 3 people, seated across 3 timezones. For myself and my right-hand dev, this is actually our very first step in our web careers - this is to say that we are not some SV vets who have seen it all already. To say that we have a lot to learn is a given, from building Kagi from nothing though, I am proud of how far we've come & where we're going.
Observability is something we started taking more seriously in the past 6 months or so. We have tons of dashboards now, and alerts that go right to our company chat channels and ping relevant people. And as the primary owner of our DB, GCP's query insights are a godsend. During the incident both our monitoring went off, as well as query insights showing the "culprit" query - but, we could have monitoring in the world, and still lack the experience to interpret it and understand what the root cause is or most efficient action to mitigate is.
In other words, we don't have the wisdom yet to not be "gaslit" by our own systems if we're not careful. Only in hindsight can I say that GCP's query insights was 100% on the money, and not some bug in application space.
All said, our growth has enabled us to expand our team quite a bit now. We have had SRE consultations before, and intend to bring on more full or part-time support to help keep things moving forward.
Hi Zac, thank you for chiming in here. Been using Kagi since the private beta, and have been overwhelmingly impressed by the service since I first used it.
Don't worry too much about all the people being harsh in the comments here. There's always a tendency for HN users to pile on with criticism whenever anyone has an outage.
I've always found this bizarre, because I've worked at places with worse issues, and more holes in monitoring or whatever than a lot of the companies that get skewered here. Perhaps many of us are just insecure about our own infra and project our feeling onto other companies when they have outages.
Y'all are doing fine, and I think it's to your credit that you're able to run Kagi's users table off a single, fairly cheap primary database instance. I've worked at places that haven't much thought to optimization, and "solve" scaling problems by throwing more and bigger hardware at it, and then wonder later on why they're bleeding cash on infrastructure. Of course, by that point, those inefficiencies are much more difficult to fix.
As for monitoring, unfortunately sometimes you don't know everything you need to monitor until something bad happens because your monitoring was missing something critical that you didn't realize was critical. That's fine; seems like y'all are aware and are plugging those holes. I'm sure there will be more of those holes in the future, that's just life.
At any rate, keep doing what you're doing, and I know the next time you get hit with something bad, things will be a bit better.
Agreed. Even though this site is on YC’s domain, I think only a few of the folks in the comments are actually early-adopting startup types. Probably just due to power law statistics, I’d guess most commenters are big company worker bees who’ve never worked on/at a seed stage startup.
If everything at Kagi was FAANG-level bulletproof, with extensive processes around outages/redundancy, then the team absolutely would not be making the best use of their time/resources.
If you’re risk averse and aren’t comfortable encountering bugs/issues like this, don’t try any new software product of moderate complexity for about 7-10 years.
Mh, I work quite a bit in the OPs-side and monitoring and observability are part of my job, for a bit of time now too.
I'll say: Effective observability, monitoring and alerting of complex systems is a really hard problem.
Like, you look at a graph of a metric, and there are spikes. But... are the spikes even abnormal? Are the spikes caused by the layer below, because our storage array is failing? Are the spikes caused by ... well also the storage layer.. because the application is slamming the database with bullshit queries? Or maybe your data is collected incorrectly. Or you select the wrong data, which is then summarized misleadingly.
Been in most of these situations. The monitoring means everything, and nothing, at the same time.
And in the application case, little common industry wisdom will help you. Yes, your in-house code is slamming the database with crap, and thus all the layers in between are saturating and people are angry. I guess you'd add monitoring and instrumentation... while production is down.
At that point, I think we're at a similar point of "Safety rules are written in blood" - "the most effective monitoring boards are found while prod is down".
And that's just the road to find the function in code that's a problem. That's when product tells you how this is critical to a business critical customer.
Running voodoo analysis on graph spikes is indeed a fool’s errand. What you really need is load testing on every component of your system, and alerts for when you approach known, tested limits. Of course this is easier said than done and things will still be missed, but I’ve done both approaches and only one of them had pagers needlessly waking me in the middle of the night enough to go on sleepless swearing rants to coworkers.
Yeh. Or, during a complex investigation, you need to setup a hypothesis explaining these spikes in order to eventually establish causality surrounding these spikes. And once you have causality, you can start fixing.
For example, I've had a disagreement with another engineer there during a larger outage. We eventually got to the idea: If we click that button in the application, the database dies a violent death. Their first reaction was: So we never click that button again. My reaction was: We put all attention on the DB and click that button a couple of times.
If we can reliably trigger misbehavior of the system, we're back in control. If we're scared, we're not in control.
I really appreciate you sharing these candid insights. Let me tell you (after over a decade of deploying cloud services), some rogue user will always figure out how to throw an unforeseen wrench into your system as the service gets more popular. Even worse than an outage is when someone figures out how to explode your cloud computing costs :)
Don’t host your status page (status.kagi.com), as a subdomain of your main site (DNS issues can cause both your main site and status site to go offline - so use something like kagistatus.com).
And host it with a webhost who doesn’t use any common infra as you.
I work in a giant investment bank with hundreds of people who can answer across all time zones. We still f up, we still don't always know where problems lie and we still sometimes can spend hours on a simple DoS.
You'll only get better at guessing what the issue could be: an exploit by a user is something you'll remember forever and will overly protect against from now on, until you hit some other completely different problem which your metric will be unprepared for, and you'll fumble around and do another post mortem promising to look at that new class of issues etc.
You'll marvel at the diversity of potential issues, especially in human-facing services like yours. But you'll probably have another long loss of service again one day, and you're right to insist on the transparency / speed of signaling to your users: they can forgive everything as long as you give them an early signal, a discount and an apology, in my experience.
Kudos for being so open -- after seeing numerous "incidents" at AWS and GCE I can say that two rules always hold with respect to observability:
- You don't have enough.
- You have too much.
Usually either something will be missing or some red herring will cost you valuable time. You're already doing much better than most people by taking it seriously :)
I figured this was the case when you said “our devops engineer” singular and not “one of our devops engineers.”
I’m glad you’re willing at this stage to invest in SRE. It’s a decision a lot of companies only make when they absolutely have to or have their backs against a wall.
I use Kagi every single day, ever since the beta. I don't remember the last time I used that other search engine, the G one..can't remember the name. Anyway, absolutely love Kagi and the work you guys do. Thank you!
This is easy for me to say in hindsight, but a graph of queries per user, for the top 100 accounts would have lit up the issue like a Christmas tree. But it's the kind of chart you stuff on the full page of charts that are usually not interesting. So many problems have been avoided by "huh, that looks weird" on a dashboard.
In a more targeted search, asking Splunk to show out where the traffic is coming from on a map, and a pie chart of the top 10 accounts they're coming from would also be illuminating.
But again, this is easy for me to say in hindsight, after the issue has been helpfully explained to me in a blog post. I don't claim could have done any better if I were there, my point is that you need to see inside the system to be operate it well.
If daily volume is ~400k a sharp 60k spike should show even on a straight rps dashboard. I suspect that didn’t exist because most cloudy SaaS tools don’t seem to put any emphasis on traffic over a particular unit time, like in the Cloudflare site dashboard one must hover over the graph and subtract timestamps to find out if the number of request is per hour, minute, etc. Splunk is similarly bad — the longer the period query covers, the less granularity you get and the minimum seems to be minutely.
Drill down by user, locale, etc would just make it even easier to figure out what’s going on once you spot the spike and start to dig.
My unsolicited advice for Zac in case they’re reading is — start thinking about SLIs (I for indicators) sooner rather than later to help you think through cases like this.
One user running a scraper took the service down for seven hours? I know it's easy to sit on the outside and say they should have seen this coming, but how does nobody in testing go "what happens if a ton of searches happen?"
TL;DR - we are a tiny, young team at the center, and everyone has a closet full of hats they wear. No dedicated SRE team yet.
> "what happens if a ton of searches happen?"
In fairness, you can checkout https://kagi.com/stats - "a lot of searches" is already happening, approaching 400k per day, and systems still operate with plenty of capacity day-to-day, in addition to some auto-scaling measures.
The devil is in the details of some users exploting a pathological case. Our lack of experience (now rightfully gained) is knowing what organic or pathological traffic we could have predicted and simulated ahead of time.
Load-simulating 20,000 users searching concurrently sounds like it would have been a sound experiment early on, and we did do some things resembling this. But considering this incident, it still would not have caught this issue. We have also had maybe 10 people run security scanners on our production services at this point that generated more traffic than this incident.
It is extremely difficult to balance this kind of development when we also have features to build, and clearly we could do with more of it! As mentioned in my other post, we are looking to expand the team in the near term so that we are not spread so thin on these sorts of efforts.
There is a lot that could be said in hindsight, but I hope that is a bit more transparent WRT how we ended up here.
What does " being such to a degree that is extreme, excessive, or markedly abnormal (with a connotation of it happening on purpose)
" mean in this context?
Their scale is (at least compared to anyone operating "at scale") tiny. 400k searches daily, I don't think it's unreasonable for them to struggle with an unexpected extra 60k over a small number of hours. Especially when it's the first time someone's done that to them.
For comparison, the stuff I work on is definitely not FAANG-scale but it's decidedly larger (at least in request rate) than Kagi. I'm sure they'll learn quickly, but in the meantime I'm almost hoping that they do have more issues like this -- it's a sign that they're moving in the right direction.
I’m a paid user of Kagi, and having experienced downtime made me realize how much I took Google’s reliability for granted. Google has never gone down on me, maybe once in the last two decades. Losing access to your search engine is quite crippling. I LOVE Kagi, that’s why I pay for it, but experiencing downtime in my second month was quite off-putting. I love post-mortems, but I hope to never read them. :)
That said, I hope this experience makes Kagi even more resilient and reliable.
As another paying user of Kagi I wonder what prevented you from using another search engine for the six hours that Kagi was unavailable. Search engines are not like your email provider or ISP in that you're locked in.
The thing is, I never thought Kagi was down and thought that it must be a problem with my configuration or connection. That was how much I trusted in Kagi. I didn’t spend whole downtime online though.
Reminds me of a time I was running a proof-of-concept for a new networking tool at a customer site, and about two minutes after we got it running their entire network went down. We were in a sandboxed area so there was no way our product could of caused a network wide outage, but in my head I'm thinking: "there's no way, right. . . .RIGHT?!?!".
We were later in contact with an account that we blocked who claimed they were
using their account to perform automated scraping of our results, which is not
something our terms allow for."
Set QPS limits for every possible incoming RPC / API / HTTP request , especially public ones!
We had a search function with typeahead abilities. I had intentionally removed the rate limit from that endpoint to support fast typers.
One day around 6AM, someone in Tennessee came into work and put their purse down on their keyboard. The purse depressed a single key and started hitting the API with each keystroke.
Of course after 15 minutes of this the db became very unhappy. Then a web server crashed because the db was lagging too much. Cascading failures until that whole prod cluster crashed.
Needless to say the rate limit was readded that day ;).
This is a reminder that "we want to support bursts" is much more common thing than "we want a higher ratelimit". Often multiple levels of bursts are reasonable (e.g. support 10 requests per minute, but only 100 requests per day; support 10 requests per user, but only 100 requests across all users).
There are several ways to track history with just a couple variables (or, if you do have the history, but only accessing a couple of variables); the key observation is that you usually don't have to be exact, only put a bound on it.
For history approximations in general, one thing I'm generally fond of is using an exponential moving average (often with λ=1/8 so it can be done with shifts `ema -= ema>>3; ema += datum>>3` and it's obvious overflow can't happen). You do have to be careful that you aren't getting hyperbolic behavior though; I'm not sure I would use this for a rate limiter in particular.
Interesting. The classic problem. You offer to not meter something and then someone will use it to max capacity. Then you’re forced to place a limit so that one user won’t hose everyone else.
> you’re forced to place a limit so that one user won’t hose everyone else
Soft limits at the tail of usage comports with the term unlimited as it's commonly used. For Kagi, a rate limit derived from how quickly a human can type makes sense.
> We were later in contact with an account that we blocked who claimed they were using their account to perform automated scraping of our results, which is not something our terms allow for.
I mean beyond that it was a user that was violating the TOS. This isn’t really a bait and switch scenario (although it could be reasonably construed as such).
I've been using it for a few weeks now and when it didn't load right away last week I was at a loss what to do. I wondered "what is Kagi and why can't my browser search anymore?" It's really well built to get out of your way and I'd all forgotten about it. Eventually I realized I could use another search engine. The bother!
Before this post-mortem dropped I'd also forgotten the incident. Props to the team that doesn't make me think when I search!
And my sympathy in this incident. It's rough when things coincide like that and cause you to look at the wrong metrics.
> This didn’t exactly come as a surprise to us, as for the entirety of Kagi’s life so far we have actually used the cheapest, single-core database available to us on GCP!
Wow, love that you guys are keeping it lean. Have you considered something like PolyScale to handle sudden spikes in read load and squeeze more performance out of it?
As much as people gush over Kagi on HN, I still have yet to actually try it because I cannot for the life of me get authentication to work. Even after immediately resetting my password, I get an "incorrect email or password" or "try again later" error on the login page. I've tried at least 3 times over the last few months with the same results each time.
If such a fundamental part of a web service company's website is broken, it makes me weary of their competence.
No and I think it's kind of absurd that I would have to reach out to support staff so that I can get something as basic as account log in working correctly.
I'm confident that account login is working for most people. Potentially something weird has happened with you account and it requires manual intervention. This is exactly what support is for - handling weird situations.
My point is that this is a service that I was curious about but have no real need for. I'm mostly satisfied by using Google or DDG. As a customer, it's pretty absurd that the onus is on me to spend a fairly significant amount of time contacting support in order to simply evaluate a niche product. Furthermore, the company is a tech company, so the fact that their authentication is bugged, seems like more than enough of a reason to not spend any more of my time evaluating their service. I literally can't think of any web based products I currently use which have, at any point, had bugged authentication.
It's understandable that such a basic seeming issue would negatively impact your opinion of the service, but it's also worth considering that such a basic issue must surely be some sort of unique edge case if the vast majority of other people are claiming to be happily using the service (which implies being able to log into their accounts).
Of course you don't owe Kagi anything so you don't have to reach out to support, but just something to consider before questioning someone's competence.
You would expect this to work, but you also shouldn't be surprised that a beta project isn't perfect.
If you need the ability to login reliably and a search engine that never goes down, stick to google.
If you want to help a new entrant with a product that reliably outperforms Google search get their product battle-ready, then give them a bit of a chance to make it right.
It wasn't that long ago that I had heard about Kagi the first time. Now I use it every day, and the fact that I can pin cppreference.com to the top is just such a boon.
"This didn’t exactly come as a surprise to us, as for the entirety of Kagi’s life so far we have actually used the cheapest, single-core database available to us on GCP!"
Outages suck, but I love the fact that they are building such a lean product. Been paying for Kagi as a part of de-Google-ifying my use of online services and the experience so far (I wasn't impacted by this outage) has been great.
A few years ago I built a global SaaS (first employee and SWE) in the weather space which was backed by a single DB, and while it had more than just 1 core (8 when I left from memory), I think a lot of developers reach for distributed DBs far too early. Modern hardware can do a lot, products like AWS Aurora are impressive, but they come with their own complexities (and MUCH higher costs).
Very cool! The deluge of cloud solutions are absolutely one of those things that can be a distraction from figuring out "What do I actually need the computer to do?".
Internally I try to promote solutions with the fewest moving cloud-parts, and channel the quiet wisdom of those running services way larger than ours with something like an sqlite3 file that they rsync... I know they're out there. Not to downplay the feats of engineering of huge distributed solutions, but sometimes things can be that simple!
Distracting is right! I watched your CrystalConf video and was happy to see the familiar Postgres + Redis combo :). I remember worrying about running out of legs with Redis (being single threaded), but with a combo of pipelining and changing the data structures used it ended up being the piece of infra that had the most headroom.
Monitoring was probably the biggest value for outsourcing to another SaaS. I used Runscope, AWS dashboards and own elasticsearch and it was pretty cost effective for the API that was doing ~2M API calls a day.
The other risk of cloud solutions is the crazy cost spikes. I remember consulting a partner, much larger, weather company on reducing their costs of a similar global weather product where they chose AWS DynamoDB as their main datastore, and their bill just for DynamoDB was twice our companies cloud bill. All because it was slightly "easier" to not think about compute requirements!
Any ways, thanks for the postmortem, hopefully your channeling of quiet wisdom continues to branch out to others! :)
> While we do offer unlimited searches to our users, such volume was a clear abuse of our platform, against our terms of use. As such, we removed searching capability from the offending accounts.
Don't advertise something as unlimited if it's not actually unlimited.
Just advertise it as 10,000 per day if that’s what it actually is. Any sane user will see a number like that and know they don’t have to worry about the limit.
What difference does it make? So would it be "legitimate use" if I sat at a computer and manually clicked 100 times to do something, while it is "abuse" for me to write a script that automates exactly the same thing? Is it only considered abuse if the action isn't tedious and tiring for me as a user?
> While we do offer unlimited searches to our users, such volume was a clear abuse of our platform, against our terms of use.
Disclose that it's fair use upfront. Leveraging unlimited searches =/= Abuse. Never bait and switch unlimited to fair use, nobody reads fine print. Customers who pay for unlimited expect the service to scale the infrastructure to handle whatever they throw at it. I think that's a reasonable expectation since they paid for unlimited, but got limited.
though the lack of alerting and the lack of consideration for bad actors in general (just look for the kagi feedback thread on suicide results, you can kagi it) seems pretty consistent
This was fun to see playing out, it made me realize how I didn't even think about how impressive it was that you guys (being a small team) were pushing out updates with very few bugs or visible downtime.
I didn't mind the downtime, just gave me an excuse to take a break and focus on more casual coding.
This incidcent also shows that no security audit was ever performed on this platform, as a DoS vulnerability would've been one of the first findings there. Makes you wonder what other vulnerabilities are in their infrastructure...
Ouch, a seven hour downtime for a paid search engine service. Maybe “every startup goes through this”, as some comment here stated, but not all startups are created equal. I wonder how much this incident will cost them long term.
What I gather is this was a rate limiting issue. Rate limiting is a standard pattern for API platforms but I wonder how many consumer facing services implement it.
GCP prices are ludicrously high. You're far better off to just get a VPS and self-host Postgres and your webserver. Which is what we did at my last startup; only thing we used GCP for was build servers, and only because we got a ton of free credits. We could've gotten a paid intern for the money we saved.
If you're listening, Kagi, please add an à la carte plan for search. Maybe hide it behind the API options as not to disrupt your normal plans. I love the search and I'm happy to pay, but I'm cost sensitive now and it's the only way that I'm going to feel comfortable using it long-term.
They used to have a $5/mo option with N searches, and then charge some cents per search after that. For a lot of people, the net amount would still be under $10/mo.
Wouldn't some level of per-account rate-limiting make sense? Say, 1000 searches per hour? It's commendable and impressive that Kagi has apparently been able to get this far and perform this consistently without any account-level automated rate-limiting but the only alternative is an inevitable cat-and-mouse and whack-a-mole of cutting off paying customers who knowingly or not violate the ToS. Returning 429 with actionable messages makes it clear to users what they should expect.
You obviously want the block-interval to be long enough to not cause too much additional churn on the database.
Applying restrictions on IP-level when you can avoid it is just a world of frustration for everyone involved.
> we need to set some automated limits to help us enforce this. From analyzing our user’s usage, we have picked some limits that no good-faith user of Kagi should reasonably hit.
> These new limits should already be in place by the time of this post, and we will monitor their impact and continue to tune them as needed.
I can assure you that these "coincidences" happen all the time, and will cause you to question your very existence when you are troubleshooting them. And if you panic while questioning your very existence, you'll invariably push a hotfix that breaks something else and then you are in a world of hurt. \
Muphy's law is a cruel thing to sysadmins and developers.