mirror of
https://github.com/bspeice/speice.io
synced 2025-01-10 09:40:06 -05:00
297 lines
19 KiB
Markdown
297 lines
19 KiB
Markdown
|
---
|
||
|
layout: post
|
||
|
title: "On Building High Performance Systems"
|
||
|
description: ""
|
||
|
category:
|
||
|
tags: []
|
||
|
---
|
||
|
|
||
|
**Update 2019-09-21**: Added notes on `isolcpus` and `systemd` affinity.
|
||
|
|
||
|
Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is
|
||
|
made up of people who have access to secret techniques mortal developers could only dream of. There
|
||
|
had to be some secret art that could only be learned if one had an appropriately tragic backstory:
|
||
|
|
||
|
<img src="/assets/images/2019-04-24-kung-fu.webp" alt="kung-fu fight">
|
||
|
> How I assumed HFT people learn their secret techniques
|
||
|
|
||
|
How else do you explain people working on systems that complete the round trip of market data in to
|
||
|
orders out (a.k.a. tick-to-trade) consistently within
|
||
|
[750-800 nanoseconds](https://stackoverflow.com/a/22082528/1454178)? In roughly the time it takes a
|
||
|
computer to access
|
||
|
[main memory 8 times](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html),
|
||
|
trading systems are capable of reading the market data packets, deciding what orders to send, doing
|
||
|
risk checks, creating new packets for exchange-specific protocols, and putting those packets on the
|
||
|
wire.
|
||
|
|
||
|
Having now worked in the trading industry, I can confirm the developers aren't super-human; I've
|
||
|
made some simple mistakes at the very least. Instead, what shows up in public discussions is that
|
||
|
philosophy, not technique, separates high-performance systems from everything else.
|
||
|
Performance-critical systems don't rely on "this one cool C++ optimization trick" to make code fast
|
||
|
(though micro-optimizations have their place); there's a lot more to worry about than just the code
|
||
|
written for the project.
|
||
|
|
||
|
The framework I'd propose is this: **If you want to build high-performance systems, focus first on
|
||
|
reducing performance variance** (reducing the gap between the fastest and slowest runs of the same
|
||
|
code), **and only look at average latency once variance is at an acceptable level**.
|
||
|
|
||
|
Don't get me wrong, I'm a much happier person when things are fast. Computer goes from booting in 20
|
||
|
seconds down to 10 because I installed a solid-state drive? Awesome. But if every fifth day it takes
|
||
|
a full minute to boot because of corrupted sectors? Not so great. Average speed over the course of a
|
||
|
week is the same in each situation, but you're painfully aware of that minute when it happens. When
|
||
|
it comes to code, the principal is the same: speeding up a function by an average of 10 milliseconds
|
||
|
doesn't mean much if there's a 100ms difference between your fastest and slowest runs. When
|
||
|
performance matters, you need to respond quickly _every time_, not just in aggregate.
|
||
|
High-performance systems should first optimize for time variance. Once you're consistent at the time
|
||
|
scale you care about, then focus on improving average time.
|
||
|
|
||
|
This focus on variance shows up all the time in industry too (emphasis added in all quotes below):
|
||
|
|
||
|
- In [marketing materials](https://business.nasdaq.com/market-tech/marketplaces/trading) for
|
||
|
NASDAQ's matching engine, the most performance-sensitive component of the exchange, dependability
|
||
|
is highlighted in addition to instantaneous metrics:
|
||
|
|
||
|
> Able to **consistently sustain** an order rate of over 100,000 orders per second at sub-40
|
||
|
> microsecond average latency
|
||
|
|
||
|
- The [Aeron](https://github.com/real-logic/aeron) message bus has this to say about performance:
|
||
|
|
||
|
> Performance is the key focus. Aeron is designed to be the highest throughput with the lowest and
|
||
|
> **most predictable latency possible** of any messaging system
|
||
|
|
||
|
- The company PolySync, which is working on autonomous vehicles,
|
||
|
[mentions why](https://polysync.io/blog/session-types-for-hearty-codecs/) they picked their
|
||
|
specific messaging format:
|
||
|
|
||
|
> In general, high performance is almost always desirable for serialization. But in the world of
|
||
|
> autonomous vehicles, **steady timing performance is even more important** than peak throughput.
|
||
|
> This is because safe operation is sensitive to timing outliers. Nobody wants the system that
|
||
|
> decides when to slam on the brakes to occasionally take 100 times longer than usual to encode
|
||
|
> its commands.
|
||
|
|
||
|
- [Solarflare](https://solarflare.com/), which makes highly-specialized network hardware, points out
|
||
|
variance (jitter) as a big concern for
|
||
|
[electronic trading](https://solarflare.com/electronic-trading/):
|
||
|
> The high stakes world of electronic trading, investment banks, market makers, hedge funds and
|
||
|
> exchanges demand the **lowest possible latency and jitter** while utilizing the highest
|
||
|
> bandwidth and return on their investment.
|
||
|
|
||
|
And to further clarify: we're not discussing _total run-time_, but variance of total run-time. There
|
||
|
are situations where it's not reasonably possible to make things faster, and you'd much rather be
|
||
|
consistent. For example, trading firms use
|
||
|
[wireless networks](https://sniperinmahwah.wordpress.com/2017/06/07/network-effects-part-i/) because
|
||
|
the speed of light through air is faster than through fiber-optic cables. There's still at _absolute
|
||
|
minimum_ a [~33.76 millisecond](http://tinyurl.com/y2vd7tn8) delay required to send data between,
|
||
|
say,
|
||
|
[Chicago and Tokyo](https://www.theice.com/market-data/connectivity-and-feeds/wireless/tokyo-chicago).
|
||
|
If a trading system in Chicago calls the function for "send order to Tokyo" and waits to see if a
|
||
|
trade occurs, there's a physical limit to how long that will take. In this situation, the focus is
|
||
|
on keeping variance of _additional processing_ to a minimum, since speed of light is the limiting
|
||
|
factor.
|
||
|
|
||
|
So how does one go about looking for and eliminating performance variance? To tell the truth, I
|
||
|
don't think a systematic answer or flow-chart exists. There's no substitute for (A) building a deep
|
||
|
understanding of the entire technology stack, and (B) actually measuring system performance (though
|
||
|
(C) watching a lot of [CppCon](https://www.youtube.com/channel/UCMlGfpWw-RUdWX_JbLCukXg) videos for
|
||
|
inspiration never hurt). Even then, every project cares about performance to a different degree; you
|
||
|
may need to build an entire
|
||
|
[replica production system](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=3015) to
|
||
|
accurately benchmark at nanosecond precision, or you may be content to simply
|
||
|
[avoid garbage collection](https://www.youtube.com/watch?v=BD9cRbxWQx8&feature=youtu.be&t=1335) in
|
||
|
your Java code.
|
||
|
|
||
|
Even though everyone has different needs, there are still common things to look for when trying to
|
||
|
isolate and eliminate variance. In no particular order, these are my focus areas when thinking about
|
||
|
high-performance systems:
|
||
|
|
||
|
## Language-specific
|
||
|
|
||
|
**Garbage Collection**: How often does garbage collection happen? When is it triggered? What are the
|
||
|
impacts?
|
||
|
|
||
|
- [In Python](https://rushter.com/blog/python-garbage-collector/), individual objects are collected
|
||
|
if the reference count reaches 0, and each generation is collected if
|
||
|
`num_alloc - num_dealloc > gc_threshold` whenever an allocation happens. The GIL is acquired for
|
||
|
the duration of generational collection.
|
||
|
- Java has
|
||
|
[many](https://docs.oracle.com/en/java/javase/12/gctuning/parallel-collector1.html#GUID-DCDD6E46-0406-41D1-AB49-FB96A50EB9CE)
|
||
|
[different](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector.html#GUID-ED3AB6D3-FD9B-4447-9EDF-983ED2F7A573)
|
||
|
[collection](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector-tuning.html#GUID-90E30ACA-8040-432E-B3A0-1E0440AB556A)
|
||
|
[algorithms](https://docs.oracle.com/en/java/javase/12/gctuning/z-garbage-collector1.html#GUID-A5A42691-095E-47BA-B6DC-FB4E5FAA43D0)
|
||
|
to choose from, each with different characteristics. The default algorithms (Parallel GC in Java
|
||
|
8, G1 in Java 9) freeze the JVM while collecting, while more recent algorithms
|
||
|
([ZGC](https://wiki.openjdk.java.net/display/zgc) and
|
||
|
[Shenandoah](https://wiki.openjdk.java.net/display/shenandoah)) are designed to keep "stop the
|
||
|
world" to a minimum by doing collection work in parallel.
|
||
|
|
||
|
**Allocation**: Every language has a different way of interacting with "heap" memory, but the
|
||
|
principle is the same: running the allocator to allocate/deallocate memory takes time that can often
|
||
|
be put to better use. Understanding when your language interacts with the allocator is crucial, and
|
||
|
not always obvious. For example: C++ and Rust don't allocate heap memory for iterators, but Java
|
||
|
does (meaning potential GC pauses). Take time to understand heap behavior (I made a
|
||
|
[a guide for Rust](/2019/02/understanding-allocations-in-rust.html)), and look into alternative
|
||
|
allocators ([jemalloc](http://jemalloc.net/),
|
||
|
[tcmalloc](https://gperftools.github.io/gperftools/tcmalloc.html)) that might run faster than the
|
||
|
operating system default.
|
||
|
|
||
|
**Data Layout**: How your data is arranged in memory matters;
|
||
|
[data-oriented design](https://www.youtube.com/watch?v=yy8jQgmhbAU) and
|
||
|
[cache locality](https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be&t=1185) can have huge
|
||
|
impacts on performance. The C family of languages (C, value types in C#, C++) and Rust all have
|
||
|
guarantees about the shape every object takes in memory that others (e.g. Java and Python) can't
|
||
|
make. [Cachegrind](http://valgrind.org/docs/manual/cg-manual.html) and kernel
|
||
|
[perf](https://perf.wiki.kernel.org/index.php/Main_Page) counters are both great for understanding
|
||
|
how performance relates to memory layout.
|
||
|
|
||
|
**Just-In-Time Compilation**: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are
|
||
|
great because they optimize your program for how it's actually being used, rather than how a
|
||
|
compiler expects it to be used. However, there's a variance problem if the program stops executing
|
||
|
while waiting for translation from VM bytecode to native code. As a remedy, many languages support
|
||
|
ahead-of-time compilation in addition to the JIT versions
|
||
|
([CoreRT](https://github.com/dotnet/corert) in C# and [GraalVM](https://www.graalvm.org/) in Java).
|
||
|
On the other hand, LLVM supports
|
||
|
[Profile Guided Optimization](https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization),
|
||
|
which theoretically brings JIT benefits to non-JIT languages. Finally, be careful to avoid comparing
|
||
|
apples and oranges during benchmarks; you don't want your code to suddenly speed up because the JIT
|
||
|
compiler kicked in.
|
||
|
|
||
|
**Programming Tricks**: These won't make or break performance, but can be useful in specific
|
||
|
circumstances. For example, C++ can use
|
||
|
[templates instead of branches](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=1206)
|
||
|
in critical sections.
|
||
|
|
||
|
## Kernel
|
||
|
|
||
|
Code you wrote is almost certainly not the _only_ code running on your hardware. There are many ways
|
||
|
the operating system interacts with your program, from interrupts to system calls, that are
|
||
|
important to watch for. These are written from a Linux perspective, but Windows does typically have
|
||
|
equivalent functionality.
|
||
|
|
||
|
**Scheduling**: The kernel is normally free to schedule any process on any core, so it's important
|
||
|
to reserve CPU cores exclusively for the important programs. There are a few parts to this: first,
|
||
|
limit the CPU cores that non-critical processes are allowed to run on by excluding cores from
|
||
|
scheduling
|
||
|
([`isolcpus`](https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html)
|
||
|
kernel command-line option), or by setting the `init` process CPU affinity
|
||
|
([`systemd` example](https://access.redhat.com/solutions/2884991)). Second, set critical processes
|
||
|
to run on the isolated cores by setting the
|
||
|
[processor affinity](https://en.wikipedia.org/wiki/Processor_affinity) using
|
||
|
[taskset](https://linux.die.net/man/1/taskset). Finally, use
|
||
|
[`NO_HZ`](https://github.com/torvalds/linux/blob/master/Documentation/timers/NO_HZ.txt) or
|
||
|
[`chrt`](https://linux.die.net/man/1/chrt) to disable scheduling interrupts. Turning off
|
||
|
hyper-threading is also likely beneficial.
|
||
|
|
||
|
**System calls**: Reading from a UNIX socket? Writing to a file? In addition to not knowing how long
|
||
|
the I/O operation takes, these all trigger expensive
|
||
|
[system calls (syscalls)](https://en.wikipedia.org/wiki/System_call). To handle these, the CPU must
|
||
|
[context switch](https://en.wikipedia.org/wiki/Context_switch) to the kernel, let the kernel
|
||
|
operation complete, then context switch back to your program. We'd rather keep these
|
||
|
[to a minimum](https://www.destroyallsoftware.com/talks/the-birth-and-death-of-javascript) (see
|
||
|
timestamp 18:20). [Strace](https://linux.die.net/man/1/strace) is your friend for understanding when
|
||
|
and where syscalls happen.
|
||
|
|
||
|
**Signal Handling**: Far less likely to be an issue, but signals do trigger a context switch if your
|
||
|
code has a handler registered. This will be highly dependent on the application, but you can
|
||
|
[block signals](https://www.linuxprogrammingblog.com/all-about-linux-signals?page=show#Blocking_signals)
|
||
|
if it's an issue.
|
||
|
|
||
|
**Interrupts**: System interrupts are how devices connected to your computer notify the CPU that
|
||
|
something has happened. The CPU will then choose a processor core to pause and context switch to the
|
||
|
OS to handle the interrupt. Make sure that
|
||
|
[SMP affinity](http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux) is
|
||
|
set so that interrupts are handled on a CPU core not running the program you care about.
|
||
|
|
||
|
**[NUMA](https://www.kernel.org/doc/html/latest/vm/numa.html)**: While NUMA is good at making
|
||
|
multi-cell systems transparent, there are variance implications; if the kernel moves a process
|
||
|
across nodes, future memory accesses must wait for the controller on the original node. Use
|
||
|
[numactl](https://linux.die.net/man/8/numactl) to handle memory-/cpu-cell pinning so this doesn't
|
||
|
happen.
|
||
|
|
||
|
## Hardware
|
||
|
|
||
|
**CPU Pipelining/Speculation**: Speculative execution in modern processors gave us vulnerabilities
|
||
|
like Spectre, but it also gave us performance improvements like
|
||
|
[branch prediction](https://stackoverflow.com/a/11227902/1454178). And if the CPU mis-speculates
|
||
|
your code, there's variance associated with rewind and replay. While the compiler knows a lot about
|
||
|
how your CPU [pipelines instructions](https://youtu.be/nAbCKa0FzjQ?t=4467), code can be
|
||
|
[structured to help](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=755) the branch
|
||
|
predictor.
|
||
|
|
||
|
**Paging**: For most systems, virtual memory is incredible. Applications live in their own worlds,
|
||
|
and the CPU/[MMU](https://en.wikipedia.org/wiki/Memory_management_unit) figures out the details.
|
||
|
However, there's a variance penalty associated with memory paging and caching; if you access more
|
||
|
memory pages than the [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) can store,
|
||
|
you'll have to wait for the page walk. Kernel perf tools are necessary to figure out if this is an
|
||
|
issue, but using [huge pages](https://blog.pythian.com/performance-tuning-hugepages-in-linux/) can
|
||
|
reduce TLB burdens. Alternately, running applications in a hypervisor like
|
||
|
[Jailhouse](https://github.com/siemens/jailhouse) allows one to skip virtual memory entirely, but
|
||
|
this is probably more work than the benefits are worth.
|
||
|
|
||
|
**Network Interfaces**: When more than one computer is involved, variance can go up dramatically.
|
||
|
Tuning kernel
|
||
|
[network parameters](https://github.com/leandromoreira/linux-network-performance-parameters) may be
|
||
|
helpful, but modern systems more frequently opt to skip the kernel altogether with a technique
|
||
|
called [kernel bypass](https://blog.cloudflare.com/kernel-bypass/). This typically requires
|
||
|
specialized hardware and [drivers](https://www.openonload.org/), but even industries like
|
||
|
[telecom](https://www.bbc.co.uk/rd/blog/2018-04-high-speed-networking-open-source-kernel-bypass) are
|
||
|
finding the benefits.
|
||
|
|
||
|
## Networks
|
||
|
|
||
|
**Routing**: There's a reason financial firms are willing to pay
|
||
|
[millions of euros](https://sniperinmahwah.wordpress.com/2019/03/26/4-les-moeres-english-version/)
|
||
|
for rights to a small plot of land - having a straight-line connection from point A to point B means
|
||
|
the path their data takes is the shortest possible. In contrast, there are currently 6 computers in
|
||
|
between me and Google, but that may change at any moment if my ISP realizes a
|
||
|
[more efficient route](https://en.wikipedia.org/wiki/Border_Gateway_Protocol) is available. Whether
|
||
|
it's using
|
||
|
[research-quality equipment](https://sniperinmahwah.wordpress.com/2018/05/07/shortwave-trading-part-i-the-west-chicago-tower-mystery/)
|
||
|
for shortwave radio, or just making sure there's no data inadvertently going between data centers,
|
||
|
routing matters.
|
||
|
|
||
|
**Protocol**: TCP as a network protocol is awesome: guaranteed and in-order delivery, flow control,
|
||
|
and congestion control all built in. But these attributes make the most sense when networking
|
||
|
infrastructure is lossy; for systems that expect nearly all packets to be delivered correctly, the
|
||
|
setup handshaking and packet acknowledgment are just overhead. Using UDP (unicast or multicast) may
|
||
|
make sense in these contexts as it avoids the chatter needed to track connection state, and
|
||
|
[gap-fill](https://iextrading.com/docs/IEX%20Transport%20Specification.pdf)
|
||
|
[strategies](http://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf)
|
||
|
can handle the rest.
|
||
|
|
||
|
**Switching**: Many routers/switches handle packets using "store-and-forward" behavior: wait for the
|
||
|
whole packet, validate checksums, and then send to the next device. In variance terms, the time
|
||
|
needed to move data between two nodes is proportional to the size of that data; the switch must
|
||
|
"store" all data before it can calculate checksums and "forward" to the next node. With
|
||
|
["cut-through"](https://www.networkworld.com/article/2241573/latency-and-jitter--cut-through-design-pays-off-for-arista--blade.html)
|
||
|
designs, switches will begin forwarding data as soon as they know where the destination is,
|
||
|
checksums be damned. This means there's a fixed cost (at the switch) for network traffic, no matter
|
||
|
the size.
|
||
|
|
||
|
# Final Thoughts
|
||
|
|
||
|
High-performance systems, regardless of industry, are not magical. They do require extreme precision
|
||
|
and attention to detail, but they're designed, built, and operated by regular people, using a lot of
|
||
|
tools that are publicly available. Interested in seeing how context switching affects performance of
|
||
|
your benchmarks? `taskset` should be installed in all modern Linux distributions, and can be used to
|
||
|
make sure the OS never migrates your process. Curious how often garbage collection triggers during a
|
||
|
crucial operation? Your language of choice will typically expose details of its operations
|
||
|
([Python](https://docs.python.org/3/library/gc.html),
|
||
|
[Java](https://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html#DebuggingOptions)).
|
||
|
Want to know how hard your program is stressing the TLB? Use `perf record` and look for
|
||
|
`dtlb_load_misses.miss_causes_a_walk`.
|
||
|
|
||
|
Two final guiding questions, then: first, before attempting to apply some of the technology above to
|
||
|
your own systems, can you first identify
|
||
|
[where/when you care](http://wiki.c2.com/?PrematureOptimization) about "high-performance"? As an
|
||
|
example, if parts of a system rely on humans pushing buttons, CPU pinning won't have any measurable
|
||
|
effect. Humans are already far too slow to react in time. Second, if you're using benchmarks, are
|
||
|
they being designed in a way that's actually helpful? Tools like
|
||
|
[Criterion](http://www.serpentine.com/criterion/) (also in
|
||
|
[Rust](https://github.com/bheisler/criterion.rs)) and Google's
|
||
|
[Benchmark](https://github.com/google/benchmark) output not only average run time, but variance as
|
||
|
well; your benchmarking environment is subject to the same concerns your production environment is.
|
||
|
|
||
|
Finally, I believe high-performance systems are a matter of philosophy, not necessarily technique.
|
||
|
Rigorous focus on variance is the first step, and there are plenty of ways to measure and mitigate
|
||
|
it; once that's at an acceptable level, then optimize for speed.
|