mirror of
https://github.com/bspeice/speice.io
synced 2024-12-22 08:38:09 -05:00
2019-06-31-high-performance-systems
This commit is contained in:
parent
d9baae0a5e
commit
8501da2f81
296
blog/2019-06-31-high-performance-systems/_article.md
Normal file
296
blog/2019-06-31-high-performance-systems/_article.md
Normal file
@ -0,0 +1,296 @@
|
||||
---
|
||||
layout: post
|
||||
title: "On Building High Performance Systems"
|
||||
description: ""
|
||||
category:
|
||||
tags: []
|
||||
---
|
||||
|
||||
**Update 2019-09-21**: Added notes on `isolcpus` and `systemd` affinity.
|
||||
|
||||
Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is
|
||||
made up of people who have access to secret techniques mortal developers could only dream of. There
|
||||
had to be some secret art that could only be learned if one had an appropriately tragic backstory:
|
||||
|
||||
<img src="/assets/images/2019-04-24-kung-fu.webp" alt="kung-fu fight">
|
||||
> How I assumed HFT people learn their secret techniques
|
||||
|
||||
How else do you explain people working on systems that complete the round trip of market data in to
|
||||
orders out (a.k.a. tick-to-trade) consistently within
|
||||
[750-800 nanoseconds](https://stackoverflow.com/a/22082528/1454178)? In roughly the time it takes a
|
||||
computer to access
|
||||
[main memory 8 times](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html),
|
||||
trading systems are capable of reading the market data packets, deciding what orders to send, doing
|
||||
risk checks, creating new packets for exchange-specific protocols, and putting those packets on the
|
||||
wire.
|
||||
|
||||
Having now worked in the trading industry, I can confirm the developers aren't super-human; I've
|
||||
made some simple mistakes at the very least. Instead, what shows up in public discussions is that
|
||||
philosophy, not technique, separates high-performance systems from everything else.
|
||||
Performance-critical systems don't rely on "this one cool C++ optimization trick" to make code fast
|
||||
(though micro-optimizations have their place); there's a lot more to worry about than just the code
|
||||
written for the project.
|
||||
|
||||
The framework I'd propose is this: **If you want to build high-performance systems, focus first on
|
||||
reducing performance variance** (reducing the gap between the fastest and slowest runs of the same
|
||||
code), **and only look at average latency once variance is at an acceptable level**.
|
||||
|
||||
Don't get me wrong, I'm a much happier person when things are fast. Computer goes from booting in 20
|
||||
seconds down to 10 because I installed a solid-state drive? Awesome. But if every fifth day it takes
|
||||
a full minute to boot because of corrupted sectors? Not so great. Average speed over the course of a
|
||||
week is the same in each situation, but you're painfully aware of that minute when it happens. When
|
||||
it comes to code, the principal is the same: speeding up a function by an average of 10 milliseconds
|
||||
doesn't mean much if there's a 100ms difference between your fastest and slowest runs. When
|
||||
performance matters, you need to respond quickly _every time_, not just in aggregate.
|
||||
High-performance systems should first optimize for time variance. Once you're consistent at the time
|
||||
scale you care about, then focus on improving average time.
|
||||
|
||||
This focus on variance shows up all the time in industry too (emphasis added in all quotes below):
|
||||
|
||||
- In [marketing materials](https://business.nasdaq.com/market-tech/marketplaces/trading) for
|
||||
NASDAQ's matching engine, the most performance-sensitive component of the exchange, dependability
|
||||
is highlighted in addition to instantaneous metrics:
|
||||
|
||||
> Able to **consistently sustain** an order rate of over 100,000 orders per second at sub-40
|
||||
> microsecond average latency
|
||||
|
||||
- The [Aeron](https://github.com/real-logic/aeron) message bus has this to say about performance:
|
||||
|
||||
> Performance is the key focus. Aeron is designed to be the highest throughput with the lowest and
|
||||
> **most predictable latency possible** of any messaging system
|
||||
|
||||
- The company PolySync, which is working on autonomous vehicles,
|
||||
[mentions why](https://polysync.io/blog/session-types-for-hearty-codecs/) they picked their
|
||||
specific messaging format:
|
||||
|
||||
> In general, high performance is almost always desirable for serialization. But in the world of
|
||||
> autonomous vehicles, **steady timing performance is even more important** than peak throughput.
|
||||
> This is because safe operation is sensitive to timing outliers. Nobody wants the system that
|
||||
> decides when to slam on the brakes to occasionally take 100 times longer than usual to encode
|
||||
> its commands.
|
||||
|
||||
- [Solarflare](https://solarflare.com/), which makes highly-specialized network hardware, points out
|
||||
variance (jitter) as a big concern for
|
||||
[electronic trading](https://solarflare.com/electronic-trading/):
|
||||
> The high stakes world of electronic trading, investment banks, market makers, hedge funds and
|
||||
> exchanges demand the **lowest possible latency and jitter** while utilizing the highest
|
||||
> bandwidth and return on their investment.
|
||||
|
||||
And to further clarify: we're not discussing _total run-time_, but variance of total run-time. There
|
||||
are situations where it's not reasonably possible to make things faster, and you'd much rather be
|
||||
consistent. For example, trading firms use
|
||||
[wireless networks](https://sniperinmahwah.wordpress.com/2017/06/07/network-effects-part-i/) because
|
||||
the speed of light through air is faster than through fiber-optic cables. There's still at _absolute
|
||||
minimum_ a [~33.76 millisecond](http://tinyurl.com/y2vd7tn8) delay required to send data between,
|
||||
say,
|
||||
[Chicago and Tokyo](https://www.theice.com/market-data/connectivity-and-feeds/wireless/tokyo-chicago).
|
||||
If a trading system in Chicago calls the function for "send order to Tokyo" and waits to see if a
|
||||
trade occurs, there's a physical limit to how long that will take. In this situation, the focus is
|
||||
on keeping variance of _additional processing_ to a minimum, since speed of light is the limiting
|
||||
factor.
|
||||
|
||||
So how does one go about looking for and eliminating performance variance? To tell the truth, I
|
||||
don't think a systematic answer or flow-chart exists. There's no substitute for (A) building a deep
|
||||
understanding of the entire technology stack, and (B) actually measuring system performance (though
|
||||
(C) watching a lot of [CppCon](https://www.youtube.com/channel/UCMlGfpWw-RUdWX_JbLCukXg) videos for
|
||||
inspiration never hurt). Even then, every project cares about performance to a different degree; you
|
||||
may need to build an entire
|
||||
[replica production system](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=3015) to
|
||||
accurately benchmark at nanosecond precision, or you may be content to simply
|
||||
[avoid garbage collection](https://www.youtube.com/watch?v=BD9cRbxWQx8&feature=youtu.be&t=1335) in
|
||||
your Java code.
|
||||
|
||||
Even though everyone has different needs, there are still common things to look for when trying to
|
||||
isolate and eliminate variance. In no particular order, these are my focus areas when thinking about
|
||||
high-performance systems:
|
||||
|
||||
## Language-specific
|
||||
|
||||
**Garbage Collection**: How often does garbage collection happen? When is it triggered? What are the
|
||||
impacts?
|
||||
|
||||
- [In Python](https://rushter.com/blog/python-garbage-collector/), individual objects are collected
|
||||
if the reference count reaches 0, and each generation is collected if
|
||||
`num_alloc - num_dealloc > gc_threshold` whenever an allocation happens. The GIL is acquired for
|
||||
the duration of generational collection.
|
||||
- Java has
|
||||
[many](https://docs.oracle.com/en/java/javase/12/gctuning/parallel-collector1.html#GUID-DCDD6E46-0406-41D1-AB49-FB96A50EB9CE)
|
||||
[different](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector.html#GUID-ED3AB6D3-FD9B-4447-9EDF-983ED2F7A573)
|
||||
[collection](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector-tuning.html#GUID-90E30ACA-8040-432E-B3A0-1E0440AB556A)
|
||||
[algorithms](https://docs.oracle.com/en/java/javase/12/gctuning/z-garbage-collector1.html#GUID-A5A42691-095E-47BA-B6DC-FB4E5FAA43D0)
|
||||
to choose from, each with different characteristics. The default algorithms (Parallel GC in Java
|
||||
8, G1 in Java 9) freeze the JVM while collecting, while more recent algorithms
|
||||
([ZGC](https://wiki.openjdk.java.net/display/zgc) and
|
||||
[Shenandoah](https://wiki.openjdk.java.net/display/shenandoah)) are designed to keep "stop the
|
||||
world" to a minimum by doing collection work in parallel.
|
||||
|
||||
**Allocation**: Every language has a different way of interacting with "heap" memory, but the
|
||||
principle is the same: running the allocator to allocate/deallocate memory takes time that can often
|
||||
be put to better use. Understanding when your language interacts with the allocator is crucial, and
|
||||
not always obvious. For example: C++ and Rust don't allocate heap memory for iterators, but Java
|
||||
does (meaning potential GC pauses). Take time to understand heap behavior (I made a
|
||||
[a guide for Rust](/2019/02/understanding-allocations-in-rust.html)), and look into alternative
|
||||
allocators ([jemalloc](http://jemalloc.net/),
|
||||
[tcmalloc](https://gperftools.github.io/gperftools/tcmalloc.html)) that might run faster than the
|
||||
operating system default.
|
||||
|
||||
**Data Layout**: How your data is arranged in memory matters;
|
||||
[data-oriented design](https://www.youtube.com/watch?v=yy8jQgmhbAU) and
|
||||
[cache locality](https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be&t=1185) can have huge
|
||||
impacts on performance. The C family of languages (C, value types in C#, C++) and Rust all have
|
||||
guarantees about the shape every object takes in memory that others (e.g. Java and Python) can't
|
||||
make. [Cachegrind](http://valgrind.org/docs/manual/cg-manual.html) and kernel
|
||||
[perf](https://perf.wiki.kernel.org/index.php/Main_Page) counters are both great for understanding
|
||||
how performance relates to memory layout.
|
||||
|
||||
**Just-In-Time Compilation**: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are
|
||||
great because they optimize your program for how it's actually being used, rather than how a
|
||||
compiler expects it to be used. However, there's a variance problem if the program stops executing
|
||||
while waiting for translation from VM bytecode to native code. As a remedy, many languages support
|
||||
ahead-of-time compilation in addition to the JIT versions
|
||||
([CoreRT](https://github.com/dotnet/corert) in C# and [GraalVM](https://www.graalvm.org/) in Java).
|
||||
On the other hand, LLVM supports
|
||||
[Profile Guided Optimization](https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization),
|
||||
which theoretically brings JIT benefits to non-JIT languages. Finally, be careful to avoid comparing
|
||||
apples and oranges during benchmarks; you don't want your code to suddenly speed up because the JIT
|
||||
compiler kicked in.
|
||||
|
||||
**Programming Tricks**: These won't make or break performance, but can be useful in specific
|
||||
circumstances. For example, C++ can use
|
||||
[templates instead of branches](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=1206)
|
||||
in critical sections.
|
||||
|
||||
## Kernel
|
||||
|
||||
Code you wrote is almost certainly not the _only_ code running on your hardware. There are many ways
|
||||
the operating system interacts with your program, from interrupts to system calls, that are
|
||||
important to watch for. These are written from a Linux perspective, but Windows does typically have
|
||||
equivalent functionality.
|
||||
|
||||
**Scheduling**: The kernel is normally free to schedule any process on any core, so it's important
|
||||
to reserve CPU cores exclusively for the important programs. There are a few parts to this: first,
|
||||
limit the CPU cores that non-critical processes are allowed to run on by excluding cores from
|
||||
scheduling
|
||||
([`isolcpus`](https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html)
|
||||
kernel command-line option), or by setting the `init` process CPU affinity
|
||||
([`systemd` example](https://access.redhat.com/solutions/2884991)). Second, set critical processes
|
||||
to run on the isolated cores by setting the
|
||||
[processor affinity](https://en.wikipedia.org/wiki/Processor_affinity) using
|
||||
[taskset](https://linux.die.net/man/1/taskset). Finally, use
|
||||
[`NO_HZ`](https://github.com/torvalds/linux/blob/master/Documentation/timers/NO_HZ.txt) or
|
||||
[`chrt`](https://linux.die.net/man/1/chrt) to disable scheduling interrupts. Turning off
|
||||
hyper-threading is also likely beneficial.
|
||||
|
||||
**System calls**: Reading from a UNIX socket? Writing to a file? In addition to not knowing how long
|
||||
the I/O operation takes, these all trigger expensive
|
||||
[system calls (syscalls)](https://en.wikipedia.org/wiki/System_call). To handle these, the CPU must
|
||||
[context switch](https://en.wikipedia.org/wiki/Context_switch) to the kernel, let the kernel
|
||||
operation complete, then context switch back to your program. We'd rather keep these
|
||||
[to a minimum](https://www.destroyallsoftware.com/talks/the-birth-and-death-of-javascript) (see
|
||||
timestamp 18:20). [Strace](https://linux.die.net/man/1/strace) is your friend for understanding when
|
||||
and where syscalls happen.
|
||||
|
||||
**Signal Handling**: Far less likely to be an issue, but signals do trigger a context switch if your
|
||||
code has a handler registered. This will be highly dependent on the application, but you can
|
||||
[block signals](https://www.linuxprogrammingblog.com/all-about-linux-signals?page=show#Blocking_signals)
|
||||
if it's an issue.
|
||||
|
||||
**Interrupts**: System interrupts are how devices connected to your computer notify the CPU that
|
||||
something has happened. The CPU will then choose a processor core to pause and context switch to the
|
||||
OS to handle the interrupt. Make sure that
|
||||
[SMP affinity](http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux) is
|
||||
set so that interrupts are handled on a CPU core not running the program you care about.
|
||||
|
||||
**[NUMA](https://www.kernel.org/doc/html/latest/vm/numa.html)**: While NUMA is good at making
|
||||
multi-cell systems transparent, there are variance implications; if the kernel moves a process
|
||||
across nodes, future memory accesses must wait for the controller on the original node. Use
|
||||
[numactl](https://linux.die.net/man/8/numactl) to handle memory-/cpu-cell pinning so this doesn't
|
||||
happen.
|
||||
|
||||
## Hardware
|
||||
|
||||
**CPU Pipelining/Speculation**: Speculative execution in modern processors gave us vulnerabilities
|
||||
like Spectre, but it also gave us performance improvements like
|
||||
[branch prediction](https://stackoverflow.com/a/11227902/1454178). And if the CPU mis-speculates
|
||||
your code, there's variance associated with rewind and replay. While the compiler knows a lot about
|
||||
how your CPU [pipelines instructions](https://youtu.be/nAbCKa0FzjQ?t=4467), code can be
|
||||
[structured to help](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=755) the branch
|
||||
predictor.
|
||||
|
||||
**Paging**: For most systems, virtual memory is incredible. Applications live in their own worlds,
|
||||
and the CPU/[MMU](https://en.wikipedia.org/wiki/Memory_management_unit) figures out the details.
|
||||
However, there's a variance penalty associated with memory paging and caching; if you access more
|
||||
memory pages than the [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) can store,
|
||||
you'll have to wait for the page walk. Kernel perf tools are necessary to figure out if this is an
|
||||
issue, but using [huge pages](https://blog.pythian.com/performance-tuning-hugepages-in-linux/) can
|
||||
reduce TLB burdens. Alternately, running applications in a hypervisor like
|
||||
[Jailhouse](https://github.com/siemens/jailhouse) allows one to skip virtual memory entirely, but
|
||||
this is probably more work than the benefits are worth.
|
||||
|
||||
**Network Interfaces**: When more than one computer is involved, variance can go up dramatically.
|
||||
Tuning kernel
|
||||
[network parameters](https://github.com/leandromoreira/linux-network-performance-parameters) may be
|
||||
helpful, but modern systems more frequently opt to skip the kernel altogether with a technique
|
||||
called [kernel bypass](https://blog.cloudflare.com/kernel-bypass/). This typically requires
|
||||
specialized hardware and [drivers](https://www.openonload.org/), but even industries like
|
||||
[telecom](https://www.bbc.co.uk/rd/blog/2018-04-high-speed-networking-open-source-kernel-bypass) are
|
||||
finding the benefits.
|
||||
|
||||
## Networks
|
||||
|
||||
**Routing**: There's a reason financial firms are willing to pay
|
||||
[millions of euros](https://sniperinmahwah.wordpress.com/2019/03/26/4-les-moeres-english-version/)
|
||||
for rights to a small plot of land - having a straight-line connection from point A to point B means
|
||||
the path their data takes is the shortest possible. In contrast, there are currently 6 computers in
|
||||
between me and Google, but that may change at any moment if my ISP realizes a
|
||||
[more efficient route](https://en.wikipedia.org/wiki/Border_Gateway_Protocol) is available. Whether
|
||||
it's using
|
||||
[research-quality equipment](https://sniperinmahwah.wordpress.com/2018/05/07/shortwave-trading-part-i-the-west-chicago-tower-mystery/)
|
||||
for shortwave radio, or just making sure there's no data inadvertently going between data centers,
|
||||
routing matters.
|
||||
|
||||
**Protocol**: TCP as a network protocol is awesome: guaranteed and in-order delivery, flow control,
|
||||
and congestion control all built in. But these attributes make the most sense when networking
|
||||
infrastructure is lossy; for systems that expect nearly all packets to be delivered correctly, the
|
||||
setup handshaking and packet acknowledgment are just overhead. Using UDP (unicast or multicast) may
|
||||
make sense in these contexts as it avoids the chatter needed to track connection state, and
|
||||
[gap-fill](https://iextrading.com/docs/IEX%20Transport%20Specification.pdf)
|
||||
[strategies](http://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf)
|
||||
can handle the rest.
|
||||
|
||||
**Switching**: Many routers/switches handle packets using "store-and-forward" behavior: wait for the
|
||||
whole packet, validate checksums, and then send to the next device. In variance terms, the time
|
||||
needed to move data between two nodes is proportional to the size of that data; the switch must
|
||||
"store" all data before it can calculate checksums and "forward" to the next node. With
|
||||
["cut-through"](https://www.networkworld.com/article/2241573/latency-and-jitter--cut-through-design-pays-off-for-arista--blade.html)
|
||||
designs, switches will begin forwarding data as soon as they know where the destination is,
|
||||
checksums be damned. This means there's a fixed cost (at the switch) for network traffic, no matter
|
||||
the size.
|
||||
|
||||
# Final Thoughts
|
||||
|
||||
High-performance systems, regardless of industry, are not magical. They do require extreme precision
|
||||
and attention to detail, but they're designed, built, and operated by regular people, using a lot of
|
||||
tools that are publicly available. Interested in seeing how context switching affects performance of
|
||||
your benchmarks? `taskset` should be installed in all modern Linux distributions, and can be used to
|
||||
make sure the OS never migrates your process. Curious how often garbage collection triggers during a
|
||||
crucial operation? Your language of choice will typically expose details of its operations
|
||||
([Python](https://docs.python.org/3/library/gc.html),
|
||||
[Java](https://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html#DebuggingOptions)).
|
||||
Want to know how hard your program is stressing the TLB? Use `perf record` and look for
|
||||
`dtlb_load_misses.miss_causes_a_walk`.
|
||||
|
||||
Two final guiding questions, then: first, before attempting to apply some of the technology above to
|
||||
your own systems, can you first identify
|
||||
[where/when you care](http://wiki.c2.com/?PrematureOptimization) about "high-performance"? As an
|
||||
example, if parts of a system rely on humans pushing buttons, CPU pinning won't have any measurable
|
||||
effect. Humans are already far too slow to react in time. Second, if you're using benchmarks, are
|
||||
they being designed in a way that's actually helpful? Tools like
|
||||
[Criterion](http://www.serpentine.com/criterion/) (also in
|
||||
[Rust](https://github.com/bheisler/criterion.rs)) and Google's
|
||||
[Benchmark](https://github.com/google/benchmark) output not only average run time, but variance as
|
||||
well; your benchmarking environment is subject to the same concerns your production environment is.
|
||||
|
||||
Finally, I believe high-performance systems are a matter of philosophy, not necessarily technique.
|
||||
Rigorous focus on variance is the first step, and there are plenty of ways to measure and mitigate
|
||||
it; once that's at an acceptable level, then optimize for speed.
|
302
blog/2019-06-31-high-performance-systems/index.mdx
Normal file
302
blog/2019-06-31-high-performance-systems/index.mdx
Normal file
@ -0,0 +1,302 @@
|
||||
---
|
||||
slug: 2019/06/high-performance-systems
|
||||
title: "On Building High Performance Systems"
|
||||
date: 2019-06-31 12:00:00
|
||||
last_updated:
|
||||
date: 2019-09-21 12:00:00
|
||||
authors: [bspeice]
|
||||
tags: []
|
||||
---
|
||||
|
||||
|
||||
Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is
|
||||
made up of people who have access to secret techniques mortal developers could only dream of. There
|
||||
had to be some secret art that could only be learned if one had an appropriately tragic backstory.
|
||||
|
||||
<!-- truncate -->
|
||||
|
||||
![Kung Fu fight](./kung-fu.webp)
|
||||
|
||||
> How I assumed HFT people learn their secret techniques
|
||||
|
||||
How else do you explain people working on systems that complete the round trip of market data in to
|
||||
orders out (a.k.a. tick-to-trade) consistently within
|
||||
[750-800 nanoseconds](https://stackoverflow.com/a/22082528/1454178)? In roughly the time it takes a
|
||||
computer to access
|
||||
[main memory 8 times](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html),
|
||||
trading systems are capable of reading the market data packets, deciding what orders to send, doing
|
||||
risk checks, creating new packets for exchange-specific protocols, and putting those packets on the
|
||||
wire.
|
||||
|
||||
Having now worked in the trading industry, I can confirm the developers aren't super-human; I've
|
||||
made some simple mistakes at the very least. Instead, what shows up in public discussions is that
|
||||
philosophy, not technique, separates high-performance systems from everything else.
|
||||
Performance-critical systems don't rely on "this one cool C++ optimization trick" to make code fast
|
||||
(though micro-optimizations have their place); there's a lot more to worry about than just the code
|
||||
written for the project.
|
||||
|
||||
The framework I'd propose is this: **If you want to build high-performance systems, focus first on
|
||||
reducing performance variance** (reducing the gap between the fastest and slowest runs of the same
|
||||
code), **and only look at average latency once variance is at an acceptable level**.
|
||||
|
||||
Don't get me wrong, I'm a much happier person when things are fast. Computer goes from booting in 20
|
||||
seconds down to 10 because I installed a solid-state drive? Awesome. But if every fifth day it takes
|
||||
a full minute to boot because of corrupted sectors? Not so great. Average speed over the course of a
|
||||
week is the same in each situation, but you're painfully aware of that minute when it happens. When
|
||||
it comes to code, the principal is the same: speeding up a function by an average of 10 milliseconds
|
||||
doesn't mean much if there's a 100ms difference between your fastest and slowest runs. When
|
||||
performance matters, you need to respond quickly _every time_, not just in aggregate.
|
||||
High-performance systems should first optimize for time variance. Once you're consistent at the time
|
||||
scale you care about, then focus on improving average time.
|
||||
|
||||
This focus on variance shows up all the time in industry too (emphasis added in all quotes below):
|
||||
|
||||
- In [marketing materials](https://business.nasdaq.com/market-tech/marketplaces/trading) for
|
||||
NASDAQ's matching engine, the most performance-sensitive component of the exchange, dependability
|
||||
is highlighted in addition to instantaneous metrics:
|
||||
|
||||
> Able to **consistently sustain** an order rate of over 100,000 orders per second at sub-40
|
||||
> microsecond average latency
|
||||
|
||||
- The [Aeron](https://github.com/real-logic/aeron) message bus has this to say about performance:
|
||||
|
||||
> Performance is the key focus. Aeron is designed to be the highest throughput with the lowest and
|
||||
> **most predictable latency possible** of any messaging system
|
||||
|
||||
- The company PolySync, which is working on autonomous vehicles,
|
||||
[mentions why](https://polysync.io/blog/session-types-for-hearty-codecs/) they picked their
|
||||
specific messaging format:
|
||||
|
||||
> In general, high performance is almost always desirable for serialization. But in the world of
|
||||
> autonomous vehicles, **steady timing performance is even more important** than peak throughput.
|
||||
> This is because safe operation is sensitive to timing outliers. Nobody wants the system that
|
||||
> decides when to slam on the brakes to occasionally take 100 times longer than usual to encode
|
||||
> its commands.
|
||||
|
||||
- [Solarflare](https://solarflare.com/), which makes highly-specialized network hardware, points out
|
||||
variance (jitter) as a big concern for
|
||||
[electronic trading](https://solarflare.com/electronic-trading/):
|
||||
> The high stakes world of electronic trading, investment banks, market makers, hedge funds and
|
||||
> exchanges demand the **lowest possible latency and jitter** while utilizing the highest
|
||||
> bandwidth and return on their investment.
|
||||
|
||||
And to further clarify: we're not discussing _total run-time_, but variance of total run-time. There
|
||||
are situations where it's not reasonably possible to make things faster, and you'd much rather be
|
||||
consistent. For example, trading firms use
|
||||
[wireless networks](https://sniperinmahwah.wordpress.com/2017/06/07/network-effects-part-i/) because
|
||||
the speed of light through air is faster than through fiber-optic cables. There's still at _absolute
|
||||
minimum_ a [~33.76 millisecond](http://tinyurl.com/y2vd7tn8) delay required to send data between,
|
||||
say,
|
||||
[Chicago and Tokyo](https://www.theice.com/market-data/connectivity-and-feeds/wireless/tokyo-chicago).
|
||||
If a trading system in Chicago calls the function for "send order to Tokyo" and waits to see if a
|
||||
trade occurs, there's a physical limit to how long that will take. In this situation, the focus is
|
||||
on keeping variance of _additional processing_ to a minimum, since speed of light is the limiting
|
||||
factor.
|
||||
|
||||
So how does one go about looking for and eliminating performance variance? To tell the truth, I
|
||||
don't think a systematic answer or flow-chart exists. There's no substitute for (A) building a deep
|
||||
understanding of the entire technology stack, and (B) actually measuring system performance (though
|
||||
(C) watching a lot of [CppCon](https://www.youtube.com/channel/UCMlGfpWw-RUdWX_JbLCukXg) videos for
|
||||
inspiration never hurt). Even then, every project cares about performance to a different degree; you
|
||||
may need to build an entire
|
||||
[replica production system](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=3015) to
|
||||
accurately benchmark at nanosecond precision, or you may be content to simply
|
||||
[avoid garbage collection](https://www.youtube.com/watch?v=BD9cRbxWQx8&feature=youtu.be&t=1335) in
|
||||
your Java code.
|
||||
|
||||
Even though everyone has different needs, there are still common things to look for when trying to
|
||||
isolate and eliminate variance. In no particular order, these are my focus areas when thinking about
|
||||
high-performance systems:
|
||||
|
||||
**Update 2019-09-21**: Added notes on `isolcpus` and `systemd` affinity.
|
||||
|
||||
## Language-specific
|
||||
|
||||
**Garbage Collection**: How often does garbage collection happen? When is it triggered? What are the
|
||||
impacts?
|
||||
|
||||
- [In Python](https://rushter.com/blog/python-garbage-collector/), individual objects are collected
|
||||
if the reference count reaches 0, and each generation is collected if
|
||||
`num_alloc - num_dealloc > gc_threshold` whenever an allocation happens. The GIL is acquired for
|
||||
the duration of generational collection.
|
||||
- Java has
|
||||
[many](https://docs.oracle.com/en/java/javase/12/gctuning/parallel-collector1.html#GUID-DCDD6E46-0406-41D1-AB49-FB96A50EB9CE)
|
||||
[different](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector.html#GUID-ED3AB6D3-FD9B-4447-9EDF-983ED2F7A573)
|
||||
[collection](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector-tuning.html#GUID-90E30ACA-8040-432E-B3A0-1E0440AB556A)
|
||||
[algorithms](https://docs.oracle.com/en/java/javase/12/gctuning/z-garbage-collector1.html#GUID-A5A42691-095E-47BA-B6DC-FB4E5FAA43D0)
|
||||
to choose from, each with different characteristics. The default algorithms (Parallel GC in Java
|
||||
8, G1 in Java 9) freeze the JVM while collecting, while more recent algorithms
|
||||
([ZGC](https://wiki.openjdk.java.net/display/zgc) and
|
||||
[Shenandoah](https://wiki.openjdk.java.net/display/shenandoah)) are designed to keep "stop the
|
||||
world" to a minimum by doing collection work in parallel.
|
||||
|
||||
**Allocation**: Every language has a different way of interacting with "heap" memory, but the
|
||||
principle is the same: running the allocator to allocate/deallocate memory takes time that can often
|
||||
be put to better use. Understanding when your language interacts with the allocator is crucial, and
|
||||
not always obvious. For example: C++ and Rust don't allocate heap memory for iterators, but Java
|
||||
does (meaning potential GC pauses). Take time to understand heap behavior (I made a
|
||||
[a guide for Rust](/2019/02/understanding-allocations-in-rust.html)), and look into alternative
|
||||
allocators ([jemalloc](http://jemalloc.net/),
|
||||
[tcmalloc](https://gperftools.github.io/gperftools/tcmalloc.html)) that might run faster than the
|
||||
operating system default.
|
||||
|
||||
**Data Layout**: How your data is arranged in memory matters;
|
||||
[data-oriented design](https://www.youtube.com/watch?v=yy8jQgmhbAU) and
|
||||
[cache locality](https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be&t=1185) can have huge
|
||||
impacts on performance. The C family of languages (C, value types in C#, C++) and Rust all have
|
||||
guarantees about the shape every object takes in memory that others (e.g. Java and Python) can't
|
||||
make. [Cachegrind](http://valgrind.org/docs/manual/cg-manual.html) and kernel
|
||||
[perf](https://perf.wiki.kernel.org/index.php/Main_Page) counters are both great for understanding
|
||||
how performance relates to memory layout.
|
||||
|
||||
**Just-In-Time Compilation**: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are
|
||||
great because they optimize your program for how it's actually being used, rather than how a
|
||||
compiler expects it to be used. However, there's a variance problem if the program stops executing
|
||||
while waiting for translation from VM bytecode to native code. As a remedy, many languages support
|
||||
ahead-of-time compilation in addition to the JIT versions
|
||||
([CoreRT](https://github.com/dotnet/corert) in C# and [GraalVM](https://www.graalvm.org/) in Java).
|
||||
On the other hand, LLVM supports
|
||||
[Profile Guided Optimization](https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization),
|
||||
which theoretically brings JIT benefits to non-JIT languages. Finally, be careful to avoid comparing
|
||||
apples and oranges during benchmarks; you don't want your code to suddenly speed up because the JIT
|
||||
compiler kicked in.
|
||||
|
||||
**Programming Tricks**: These won't make or break performance, but can be useful in specific
|
||||
circumstances. For example, C++ can use
|
||||
[templates instead of branches](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=1206)
|
||||
in critical sections.
|
||||
|
||||
## Kernel
|
||||
|
||||
Code you wrote is almost certainly not the _only_ code running on your hardware. There are many ways
|
||||
the operating system interacts with your program, from interrupts to system calls, that are
|
||||
important to watch for. These are written from a Linux perspective, but Windows does typically have
|
||||
equivalent functionality.
|
||||
|
||||
**Scheduling**: The kernel is normally free to schedule any process on any core, so it's important
|
||||
to reserve CPU cores exclusively for the important programs. There are a few parts to this: first,
|
||||
limit the CPU cores that non-critical processes are allowed to run on by excluding cores from
|
||||
scheduling
|
||||
([`isolcpus`](https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html)
|
||||
kernel command-line option), or by setting the `init` process CPU affinity
|
||||
([`systemd` example](https://access.redhat.com/solutions/2884991)). Second, set critical processes
|
||||
to run on the isolated cores by setting the
|
||||
[processor affinity](https://en.wikipedia.org/wiki/Processor_affinity) using
|
||||
[taskset](https://linux.die.net/man/1/taskset). Finally, use
|
||||
[`NO_HZ`](https://github.com/torvalds/linux/blob/master/Documentation/timers/NO_HZ.txt) or
|
||||
[`chrt`](https://linux.die.net/man/1/chrt) to disable scheduling interrupts. Turning off
|
||||
hyper-threading is also likely beneficial.
|
||||
|
||||
**System calls**: Reading from a UNIX socket? Writing to a file? In addition to not knowing how long
|
||||
the I/O operation takes, these all trigger expensive
|
||||
[system calls (syscalls)](https://en.wikipedia.org/wiki/System_call). To handle these, the CPU must
|
||||
[context switch](https://en.wikipedia.org/wiki/Context_switch) to the kernel, let the kernel
|
||||
operation complete, then context switch back to your program. We'd rather keep these
|
||||
[to a minimum](https://www.destroyallsoftware.com/talks/the-birth-and-death-of-javascript) (see
|
||||
timestamp 18:20). [Strace](https://linux.die.net/man/1/strace) is your friend for understanding when
|
||||
and where syscalls happen.
|
||||
|
||||
**Signal Handling**: Far less likely to be an issue, but signals do trigger a context switch if your
|
||||
code has a handler registered. This will be highly dependent on the application, but you can
|
||||
[block signals](https://www.linuxprogrammingblog.com/all-about-linux-signals?page=show#Blocking_signals)
|
||||
if it's an issue.
|
||||
|
||||
**Interrupts**: System interrupts are how devices connected to your computer notify the CPU that
|
||||
something has happened. The CPU will then choose a processor core to pause and context switch to the
|
||||
OS to handle the interrupt. Make sure that
|
||||
[SMP affinity](http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux) is
|
||||
set so that interrupts are handled on a CPU core not running the program you care about.
|
||||
|
||||
**[NUMA](https://www.kernel.org/doc/html/latest/vm/numa.html)**: While NUMA is good at making
|
||||
multi-cell systems transparent, there are variance implications; if the kernel moves a process
|
||||
across nodes, future memory accesses must wait for the controller on the original node. Use
|
||||
[numactl](https://linux.die.net/man/8/numactl) to handle memory-/cpu-cell pinning so this doesn't
|
||||
happen.
|
||||
|
||||
## Hardware
|
||||
|
||||
**CPU Pipelining/Speculation**: Speculative execution in modern processors gave us vulnerabilities
|
||||
like Spectre, but it also gave us performance improvements like
|
||||
[branch prediction](https://stackoverflow.com/a/11227902/1454178). And if the CPU mis-speculates
|
||||
your code, there's variance associated with rewind and replay. While the compiler knows a lot about
|
||||
how your CPU [pipelines instructions](https://youtu.be/nAbCKa0FzjQ?t=4467), code can be
|
||||
[structured to help](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=755) the branch
|
||||
predictor.
|
||||
|
||||
**Paging**: For most systems, virtual memory is incredible. Applications live in their own worlds,
|
||||
and the CPU/[MMU](https://en.wikipedia.org/wiki/Memory_management_unit) figures out the details.
|
||||
However, there's a variance penalty associated with memory paging and caching; if you access more
|
||||
memory pages than the [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) can store,
|
||||
you'll have to wait for the page walk. Kernel perf tools are necessary to figure out if this is an
|
||||
issue, but using [huge pages](https://blog.pythian.com/performance-tuning-hugepages-in-linux/) can
|
||||
reduce TLB burdens. Alternately, running applications in a hypervisor like
|
||||
[Jailhouse](https://github.com/siemens/jailhouse) allows one to skip virtual memory entirely, but
|
||||
this is probably more work than the benefits are worth.
|
||||
|
||||
**Network Interfaces**: When more than one computer is involved, variance can go up dramatically.
|
||||
Tuning kernel
|
||||
[network parameters](https://github.com/leandromoreira/linux-network-performance-parameters) may be
|
||||
helpful, but modern systems more frequently opt to skip the kernel altogether with a technique
|
||||
called [kernel bypass](https://blog.cloudflare.com/kernel-bypass/). This typically requires
|
||||
specialized hardware and [drivers](https://www.openonload.org/), but even industries like
|
||||
[telecom](https://www.bbc.co.uk/rd/blog/2018-04-high-speed-networking-open-source-kernel-bypass) are
|
||||
finding the benefits.
|
||||
|
||||
## Networks
|
||||
|
||||
**Routing**: There's a reason financial firms are willing to pay
|
||||
[millions of euros](https://sniperinmahwah.wordpress.com/2019/03/26/4-les-moeres-english-version/)
|
||||
for rights to a small plot of land - having a straight-line connection from point A to point B means
|
||||
the path their data takes is the shortest possible. In contrast, there are currently 6 computers in
|
||||
between me and Google, but that may change at any moment if my ISP realizes a
|
||||
[more efficient route](https://en.wikipedia.org/wiki/Border_Gateway_Protocol) is available. Whether
|
||||
it's using
|
||||
[research-quality equipment](https://sniperinmahwah.wordpress.com/2018/05/07/shortwave-trading-part-i-the-west-chicago-tower-mystery/)
|
||||
for shortwave radio, or just making sure there's no data inadvertently going between data centers,
|
||||
routing matters.
|
||||
|
||||
**Protocol**: TCP as a network protocol is awesome: guaranteed and in-order delivery, flow control,
|
||||
and congestion control all built in. But these attributes make the most sense when networking
|
||||
infrastructure is lossy; for systems that expect nearly all packets to be delivered correctly, the
|
||||
setup handshaking and packet acknowledgment are just overhead. Using UDP (unicast or multicast) may
|
||||
make sense in these contexts as it avoids the chatter needed to track connection state, and
|
||||
[gap-fill](https://iextrading.com/docs/IEX%20Transport%20Specification.pdf)
|
||||
[strategies](http://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf)
|
||||
can handle the rest.
|
||||
|
||||
**Switching**: Many routers/switches handle packets using "store-and-forward" behavior: wait for the
|
||||
whole packet, validate checksums, and then send to the next device. In variance terms, the time
|
||||
needed to move data between two nodes is proportional to the size of that data; the switch must
|
||||
"store" all data before it can calculate checksums and "forward" to the next node. With
|
||||
["cut-through"](https://www.networkworld.com/article/2241573/latency-and-jitter--cut-through-design-pays-off-for-arista--blade.html)
|
||||
designs, switches will begin forwarding data as soon as they know where the destination is,
|
||||
checksums be damned. This means there's a fixed cost (at the switch) for network traffic, no matter
|
||||
the size.
|
||||
|
||||
## Final Thoughts
|
||||
|
||||
High-performance systems, regardless of industry, are not magical. They do require extreme precision
|
||||
and attention to detail, but they're designed, built, and operated by regular people, using a lot of
|
||||
tools that are publicly available. Interested in seeing how context switching affects performance of
|
||||
your benchmarks? `taskset` should be installed in all modern Linux distributions, and can be used to
|
||||
make sure the OS never migrates your process. Curious how often garbage collection triggers during a
|
||||
crucial operation? Your language of choice will typically expose details of its operations
|
||||
([Python](https://docs.python.org/3/library/gc.html),
|
||||
[Java](https://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html#DebuggingOptions)).
|
||||
Want to know how hard your program is stressing the TLB? Use `perf record` and look for
|
||||
`dtlb_load_misses.miss_causes_a_walk`.
|
||||
|
||||
Two final guiding questions, then: first, before attempting to apply some of the technology above to
|
||||
your own systems, can you first identify
|
||||
[where/when you care](http://wiki.c2.com/?PrematureOptimization) about "high-performance"? As an
|
||||
example, if parts of a system rely on humans pushing buttons, CPU pinning won't have any measurable
|
||||
effect. Humans are already far too slow to react in time. Second, if you're using benchmarks, are
|
||||
they being designed in a way that's actually helpful? Tools like
|
||||
[Criterion](http://www.serpentine.com/criterion/) (also in
|
||||
[Rust](https://github.com/bheisler/criterion.rs)) and Google's
|
||||
[Benchmark](https://github.com/google/benchmark) output not only average run time, but variance as
|
||||
well; your benchmarking environment is subject to the same concerns your production environment is.
|
||||
|
||||
Finally, I believe high-performance systems are a matter of philosophy, not necessarily technique.
|
||||
Rigorous focus on variance is the first step, and there are plenty of ways to measure and mitigate
|
||||
it; once that's at an acceptable level, then optimize for speed.
|
BIN
blog/2019-06-31-high-performance-systems/kung-fu.webp
Normal file
BIN
blog/2019-06-31-high-performance-systems/kung-fu.webp
Normal file
Binary file not shown.
After Width: | Height: | Size: 304 KiB |
Loading…
Reference in New Issue
Block a user