How we trace a KV database with less than 5% performance impact

petethepig · on July 6, 2021

I might be a bit biased [0], but I feel like this is where profiling tools, and particularly continuous profiling tools shine. Sampling profilers look at stack traces X number of times per second resulting in a very low CPU overhead. And so essentially you're getting a very high precision trace. It's not exactly the same (e.g the order in which functions were called is not preserved), but can still be used for the same type of performance analysis.

For rust you can use eBPF to get these traces down to system calls. You can even profiler other people's software with it.

[0] I'm a co-founder at Pyroscope, where we're building an open source continuous profiling platform. https://pyroscope.io/

NightMKoder · on July 6, 2021

I’ll have to disagree with you - the problems are similar but somewhat unrelated. Sampling profilers traditionally only sample threads that are actively doing something - i.e. using CPU time. Tracing is closer to wall time profiling rather than CPU time - and global instrumentation actually sucks at that kind of enumeration. The overhead of iterating through 100k goroutines would be too high at any reasonable frequency when most of them are just blocked on something. And most are probably blocked on something you don’t even care to see (eg a worker waiting for work).

Aside from the sampling issue there is also the problem of non-blocking frameworks/languages. It’s easy to add tracing to JS functions that use async/await but it is definitely a mini research project to recover stacks from in-flight promise state machines. Most languages fall into this bucket unfortunately.

Now all that said, if you can make some automation that preemptively creates spans for you - kind of like a sampling profiler “creates” stack traces - that would be super cool. The worst part about tracing is sprinkling the tracing code in all the places you need it. You inevitably miss a few spots that show up as “missing” bars in the gannt view. Having something that fills in those blanks is invaluable.

MrBuddyCasino · on July 6, 2021

> The worst part about tracing is sprinkling the tracing code in all the places you need it.

On the JVM this is usually done by agents that automatically attach to your process, you don't have to instrument your code at all. I think Datadog open-sourced their implementation?

vips7L · on July 6, 2021

Java Flight Recorder is basically 0 overhead as well and built directly into the VM.

NightMKoder · on July 9, 2021

Both JFR and agent based implementations only work with non async code. As soon as you do async anything, stack traces are no longer representative so automatic instrumentation is useless.

That said, I’m really hoping Project Loom will fix that issue by getting rid of most async callback-style I/O.

MrBuddyCasino · on July 6, 2021

Isn't Flight Recorder profiling and not tracing? Or has this been added?

pkhuong · on July 6, 2021

Pure sampling profiles are less helpful when solving latency rather than throughput problems, especially so for rare (e.g. p99) latency spikes.

zerd · on July 6, 2021

Tracing is at a different level of detail. For instance you can see which user and specific query is missing caches and spending a lot of time waiting for IO. Unless you inspect local variables while walking the stack I don't see how you'd get that just just profiling. But a combination would be preferable (some observability tools have started on this, e.g. the profiler will tell the tracer where add additional instrumentation).

KronisLV · on July 6, 2021

Oh, hey, Pingcap are the people behind TiDB: https://en.pingcap.com/products/tidb

While this is kind of off topic (the tech is related), i really enjoy the idea behind their product - a database that is compatible with existing protocols (MySQL 5.7 connections in this case) but allows for horizontal scaling out of the box.

While a single instance of MySQL/MariaDB/PostgreSQL can go pretty far, being able to mostly delegate scaling to the actual software to get a distributed system going, without having to do lots of manual work is a great idea! Of course, there are also many pitfalls in regards to compatibility, but that's probably to be expected: https://docs.pingcap.com/tidb/stable/mysql-compatibility

That said, this won't work for all circumstances, but i think that exploring such approaches is definitely worthy of the time!

g0xA52A2A · on July 6, 2021

TSC stuff is always fun, the blogs [1] and presentations [2] from Circonus have been my favorite pieces on this. In fact Circonus and Theo Schlossnagle along with alumni Heinrich Hartmann and Riley Berton all have excellent presentations out there for anyone interested in motoring.

[1] https://www.circonus.com/2016/09/time-but-faster/

[2] https://www.youtube.com/watch?v=xqocAHGoBAg (from around 19:20)

staticassertion · on July 6, 2021

The API looks quite a lot like the one provided in `tracing` - what aspect of these optimizations would be incompatible with that library/ is there any reason not to implement them there?

breezewish · on July 6, 2021

Project contributor here :) We want to continuously try with different micro optimizations or even architecture changes that can help us achieve the best performance that we want. Some optimizations result in trace offs in public interfaces or features that are not compatible with the common interface or even the opentracing standard. Therefore, we chose to build our own library. We think users will find it useful when full-features can be sacrificed for better performance. However some optimizations mentioned in this article can be made available in the `tracing` library easily, for example, the timing optimization has been extracted to `minstant` and can be just adopted. We are glad to help if the community is interested!

staticassertion · on July 6, 2021

Got it. Yeah it might be worthwhile to see what you can push without having to worry about the same constraints, though keeping an open channel so that they can find their way into the tracing library would certainly be helpful.

I'm not a tracing contributor myself, just a consumer, for what it's worth.

kderbe · on July 6, 2021

The introduction to various timing techniques and their caveats is really good: I have no experience with Rust or tracing, but the explanations were easy to understand. Time Stamp Counter (TSC) was new to me.

forrestthewoods · on July 6, 2021

Nice. The Rust profiling story is pretty weak. Flamegraph is not nearly enough.

Grabbing traces like this is great. The next step is efficient visualization of trace data.

mamcx · on July 6, 2021

Also: A good visualization(light!) of structured logging. I move to it and suddenly get harder to inspect my logs because all this stuff presume I will deploy a heavy ELK stack or something.

JetAlone · on July 6, 2021

>The next step is efficient visualization of trace data.

Ooh, I'd be really interested to see what could be done in that vein! Do you have any ideas about how we could lay out trace information?

forrestthewoods · on July 6, 2021

There's a kazillion tools out there. Here's a few off the top of my head to get started.

RAD Game Tools Telemetry - http://www.radgametools.com/telemetry.htm Superluminal - https://superluminal.eu/features/ Remotery - https://github.com/Celtoys/Remotery

Visual Studio's built-in profiler is an ok sampling profiler. It doesn't give you a nice multi-thread view which is a huge advantage to a span based profiler.

MTuner is quite nice for debugging memory usage. Which is another gaping hole in the Rust ecosystem. https://github.com/milostosic/mtuner

Lots of tools generate data in a format viewable by the Chrome trace viewer. I think Chrome's tracer viewer is not great. Maybe someday someone will create a viewer for the format that's good. I get cranky when large traces don't render at 60fps. Web-based viewers are almost all very very slow and it makes me sad.