diff --git a/blog/2019-06-31-high-performance-systems/_article.md b/blog/2019-06-31-high-performance-systems/_article.md new file mode 100644 index 0000000..23ef44b --- /dev/null +++ b/blog/2019-06-31-high-performance-systems/_article.md @@ -0,0 +1,296 @@ +--- +layout: post +title: "On Building High Performance Systems" +description: "" +category: +tags: [] +--- + +**Update 2019-09-21**: Added notes on `isolcpus` and `systemd` affinity. + +Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is +made up of people who have access to secret techniques mortal developers could only dream of. There +had to be some secret art that could only be learned if one had an appropriately tragic backstory: + +kung-fu fight +> How I assumed HFT people learn their secret techniques + +How else do you explain people working on systems that complete the round trip of market data in to +orders out (a.k.a. tick-to-trade) consistently within +[750-800 nanoseconds](https://stackoverflow.com/a/22082528/1454178)? In roughly the time it takes a +computer to access +[main memory 8 times](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html), +trading systems are capable of reading the market data packets, deciding what orders to send, doing +risk checks, creating new packets for exchange-specific protocols, and putting those packets on the +wire. + +Having now worked in the trading industry, I can confirm the developers aren't super-human; I've +made some simple mistakes at the very least. Instead, what shows up in public discussions is that +philosophy, not technique, separates high-performance systems from everything else. +Performance-critical systems don't rely on "this one cool C++ optimization trick" to make code fast +(though micro-optimizations have their place); there's a lot more to worry about than just the code +written for the project. + +The framework I'd propose is this: **If you want to build high-performance systems, focus first on +reducing performance variance** (reducing the gap between the fastest and slowest runs of the same +code), **and only look at average latency once variance is at an acceptable level**. + +Don't get me wrong, I'm a much happier person when things are fast. Computer goes from booting in 20 +seconds down to 10 because I installed a solid-state drive? Awesome. But if every fifth day it takes +a full minute to boot because of corrupted sectors? Not so great. Average speed over the course of a +week is the same in each situation, but you're painfully aware of that minute when it happens. When +it comes to code, the principal is the same: speeding up a function by an average of 10 milliseconds +doesn't mean much if there's a 100ms difference between your fastest and slowest runs. When +performance matters, you need to respond quickly _every time_, not just in aggregate. +High-performance systems should first optimize for time variance. Once you're consistent at the time +scale you care about, then focus on improving average time. + +This focus on variance shows up all the time in industry too (emphasis added in all quotes below): + +- In [marketing materials](https://business.nasdaq.com/market-tech/marketplaces/trading) for + NASDAQ's matching engine, the most performance-sensitive component of the exchange, dependability + is highlighted in addition to instantaneous metrics: + + > Able to **consistently sustain** an order rate of over 100,000 orders per second at sub-40 + > microsecond average latency + +- The [Aeron](https://github.com/real-logic/aeron) message bus has this to say about performance: + + > Performance is the key focus. Aeron is designed to be the highest throughput with the lowest and + > **most predictable latency possible** of any messaging system + +- The company PolySync, which is working on autonomous vehicles, + [mentions why](https://polysync.io/blog/session-types-for-hearty-codecs/) they picked their + specific messaging format: + + > In general, high performance is almost always desirable for serialization. But in the world of + > autonomous vehicles, **steady timing performance is even more important** than peak throughput. + > This is because safe operation is sensitive to timing outliers. Nobody wants the system that + > decides when to slam on the brakes to occasionally take 100 times longer than usual to encode + > its commands. + +- [Solarflare](https://solarflare.com/), which makes highly-specialized network hardware, points out + variance (jitter) as a big concern for + [electronic trading](https://solarflare.com/electronic-trading/): + > The high stakes world of electronic trading, investment banks, market makers, hedge funds and + > exchanges demand the **lowest possible latency and jitter** while utilizing the highest + > bandwidth and return on their investment. + +And to further clarify: we're not discussing _total run-time_, but variance of total run-time. There +are situations where it's not reasonably possible to make things faster, and you'd much rather be +consistent. For example, trading firms use +[wireless networks](https://sniperinmahwah.wordpress.com/2017/06/07/network-effects-part-i/) because +the speed of light through air is faster than through fiber-optic cables. There's still at _absolute +minimum_ a [~33.76 millisecond](http://tinyurl.com/y2vd7tn8) delay required to send data between, +say, +[Chicago and Tokyo](https://www.theice.com/market-data/connectivity-and-feeds/wireless/tokyo-chicago). +If a trading system in Chicago calls the function for "send order to Tokyo" and waits to see if a +trade occurs, there's a physical limit to how long that will take. In this situation, the focus is +on keeping variance of _additional processing_ to a minimum, since speed of light is the limiting +factor. + +So how does one go about looking for and eliminating performance variance? To tell the truth, I +don't think a systematic answer or flow-chart exists. There's no substitute for (A) building a deep +understanding of the entire technology stack, and (B) actually measuring system performance (though +(C) watching a lot of [CppCon](https://www.youtube.com/channel/UCMlGfpWw-RUdWX_JbLCukXg) videos for +inspiration never hurt). Even then, every project cares about performance to a different degree; you +may need to build an entire +[replica production system](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=3015) to +accurately benchmark at nanosecond precision, or you may be content to simply +[avoid garbage collection](https://www.youtube.com/watch?v=BD9cRbxWQx8&feature=youtu.be&t=1335) in +your Java code. + +Even though everyone has different needs, there are still common things to look for when trying to +isolate and eliminate variance. In no particular order, these are my focus areas when thinking about +high-performance systems: + +## Language-specific + +**Garbage Collection**: How often does garbage collection happen? When is it triggered? What are the +impacts? + +- [In Python](https://rushter.com/blog/python-garbage-collector/), individual objects are collected + if the reference count reaches 0, and each generation is collected if + `num_alloc - num_dealloc > gc_threshold` whenever an allocation happens. The GIL is acquired for + the duration of generational collection. +- Java has + [many](https://docs.oracle.com/en/java/javase/12/gctuning/parallel-collector1.html#GUID-DCDD6E46-0406-41D1-AB49-FB96A50EB9CE) + [different](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector.html#GUID-ED3AB6D3-FD9B-4447-9EDF-983ED2F7A573) + [collection](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector-tuning.html#GUID-90E30ACA-8040-432E-B3A0-1E0440AB556A) + [algorithms](https://docs.oracle.com/en/java/javase/12/gctuning/z-garbage-collector1.html#GUID-A5A42691-095E-47BA-B6DC-FB4E5FAA43D0) + to choose from, each with different characteristics. The default algorithms (Parallel GC in Java + 8, G1 in Java 9) freeze the JVM while collecting, while more recent algorithms + ([ZGC](https://wiki.openjdk.java.net/display/zgc) and + [Shenandoah](https://wiki.openjdk.java.net/display/shenandoah)) are designed to keep "stop the + world" to a minimum by doing collection work in parallel. + +**Allocation**: Every language has a different way of interacting with "heap" memory, but the +principle is the same: running the allocator to allocate/deallocate memory takes time that can often +be put to better use. Understanding when your language interacts with the allocator is crucial, and +not always obvious. For example: C++ and Rust don't allocate heap memory for iterators, but Java +does (meaning potential GC pauses). Take time to understand heap behavior (I made a +[a guide for Rust](/2019/02/understanding-allocations-in-rust.html)), and look into alternative +allocators ([jemalloc](http://jemalloc.net/), +[tcmalloc](https://gperftools.github.io/gperftools/tcmalloc.html)) that might run faster than the +operating system default. + +**Data Layout**: How your data is arranged in memory matters; +[data-oriented design](https://www.youtube.com/watch?v=yy8jQgmhbAU) and +[cache locality](https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be&t=1185) can have huge +impacts on performance. The C family of languages (C, value types in C#, C++) and Rust all have +guarantees about the shape every object takes in memory that others (e.g. Java and Python) can't +make. [Cachegrind](http://valgrind.org/docs/manual/cg-manual.html) and kernel +[perf](https://perf.wiki.kernel.org/index.php/Main_Page) counters are both great for understanding +how performance relates to memory layout. + +**Just-In-Time Compilation**: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are +great because they optimize your program for how it's actually being used, rather than how a +compiler expects it to be used. However, there's a variance problem if the program stops executing +while waiting for translation from VM bytecode to native code. As a remedy, many languages support +ahead-of-time compilation in addition to the JIT versions +([CoreRT](https://github.com/dotnet/corert) in C# and [GraalVM](https://www.graalvm.org/) in Java). +On the other hand, LLVM supports +[Profile Guided Optimization](https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization), +which theoretically brings JIT benefits to non-JIT languages. Finally, be careful to avoid comparing +apples and oranges during benchmarks; you don't want your code to suddenly speed up because the JIT +compiler kicked in. + +**Programming Tricks**: These won't make or break performance, but can be useful in specific +circumstances. For example, C++ can use +[templates instead of branches](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=1206) +in critical sections. + +## Kernel + +Code you wrote is almost certainly not the _only_ code running on your hardware. There are many ways +the operating system interacts with your program, from interrupts to system calls, that are +important to watch for. These are written from a Linux perspective, but Windows does typically have +equivalent functionality. + +**Scheduling**: The kernel is normally free to schedule any process on any core, so it's important +to reserve CPU cores exclusively for the important programs. There are a few parts to this: first, +limit the CPU cores that non-critical processes are allowed to run on by excluding cores from +scheduling +([`isolcpus`](https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html) +kernel command-line option), or by setting the `init` process CPU affinity +([`systemd` example](https://access.redhat.com/solutions/2884991)). Second, set critical processes +to run on the isolated cores by setting the +[processor affinity](https://en.wikipedia.org/wiki/Processor_affinity) using +[taskset](https://linux.die.net/man/1/taskset). Finally, use +[`NO_HZ`](https://github.com/torvalds/linux/blob/master/Documentation/timers/NO_HZ.txt) or +[`chrt`](https://linux.die.net/man/1/chrt) to disable scheduling interrupts. Turning off +hyper-threading is also likely beneficial. + +**System calls**: Reading from a UNIX socket? Writing to a file? In addition to not knowing how long +the I/O operation takes, these all trigger expensive +[system calls (syscalls)](https://en.wikipedia.org/wiki/System_call). To handle these, the CPU must +[context switch](https://en.wikipedia.org/wiki/Context_switch) to the kernel, let the kernel +operation complete, then context switch back to your program. We'd rather keep these +[to a minimum](https://www.destroyallsoftware.com/talks/the-birth-and-death-of-javascript) (see +timestamp 18:20). [Strace](https://linux.die.net/man/1/strace) is your friend for understanding when +and where syscalls happen. + +**Signal Handling**: Far less likely to be an issue, but signals do trigger a context switch if your +code has a handler registered. This will be highly dependent on the application, but you can +[block signals](https://www.linuxprogrammingblog.com/all-about-linux-signals?page=show#Blocking_signals) +if it's an issue. + +**Interrupts**: System interrupts are how devices connected to your computer notify the CPU that +something has happened. The CPU will then choose a processor core to pause and context switch to the +OS to handle the interrupt. Make sure that +[SMP affinity](http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux) is +set so that interrupts are handled on a CPU core not running the program you care about. + +**[NUMA](https://www.kernel.org/doc/html/latest/vm/numa.html)**: While NUMA is good at making +multi-cell systems transparent, there are variance implications; if the kernel moves a process +across nodes, future memory accesses must wait for the controller on the original node. Use +[numactl](https://linux.die.net/man/8/numactl) to handle memory-/cpu-cell pinning so this doesn't +happen. + +## Hardware + +**CPU Pipelining/Speculation**: Speculative execution in modern processors gave us vulnerabilities +like Spectre, but it also gave us performance improvements like +[branch prediction](https://stackoverflow.com/a/11227902/1454178). And if the CPU mis-speculates +your code, there's variance associated with rewind and replay. While the compiler knows a lot about +how your CPU [pipelines instructions](https://youtu.be/nAbCKa0FzjQ?t=4467), code can be +[structured to help](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=755) the branch +predictor. + +**Paging**: For most systems, virtual memory is incredible. Applications live in their own worlds, +and the CPU/[MMU](https://en.wikipedia.org/wiki/Memory_management_unit) figures out the details. +However, there's a variance penalty associated with memory paging and caching; if you access more +memory pages than the [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) can store, +you'll have to wait for the page walk. Kernel perf tools are necessary to figure out if this is an +issue, but using [huge pages](https://blog.pythian.com/performance-tuning-hugepages-in-linux/) can +reduce TLB burdens. Alternately, running applications in a hypervisor like +[Jailhouse](https://github.com/siemens/jailhouse) allows one to skip virtual memory entirely, but +this is probably more work than the benefits are worth. + +**Network Interfaces**: When more than one computer is involved, variance can go up dramatically. +Tuning kernel +[network parameters](https://github.com/leandromoreira/linux-network-performance-parameters) may be +helpful, but modern systems more frequently opt to skip the kernel altogether with a technique +called [kernel bypass](https://blog.cloudflare.com/kernel-bypass/). This typically requires +specialized hardware and [drivers](https://www.openonload.org/), but even industries like +[telecom](https://www.bbc.co.uk/rd/blog/2018-04-high-speed-networking-open-source-kernel-bypass) are +finding the benefits. + +## Networks + +**Routing**: There's a reason financial firms are willing to pay +[millions of euros](https://sniperinmahwah.wordpress.com/2019/03/26/4-les-moeres-english-version/) +for rights to a small plot of land - having a straight-line connection from point A to point B means +the path their data takes is the shortest possible. In contrast, there are currently 6 computers in +between me and Google, but that may change at any moment if my ISP realizes a +[more efficient route](https://en.wikipedia.org/wiki/Border_Gateway_Protocol) is available. Whether +it's using +[research-quality equipment](https://sniperinmahwah.wordpress.com/2018/05/07/shortwave-trading-part-i-the-west-chicago-tower-mystery/) +for shortwave radio, or just making sure there's no data inadvertently going between data centers, +routing matters. + +**Protocol**: TCP as a network protocol is awesome: guaranteed and in-order delivery, flow control, +and congestion control all built in. But these attributes make the most sense when networking +infrastructure is lossy; for systems that expect nearly all packets to be delivered correctly, the +setup handshaking and packet acknowledgment are just overhead. Using UDP (unicast or multicast) may +make sense in these contexts as it avoids the chatter needed to track connection state, and +[gap-fill](https://iextrading.com/docs/IEX%20Transport%20Specification.pdf) +[strategies](http://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf) +can handle the rest. + +**Switching**: Many routers/switches handle packets using "store-and-forward" behavior: wait for the +whole packet, validate checksums, and then send to the next device. In variance terms, the time +needed to move data between two nodes is proportional to the size of that data; the switch must +"store" all data before it can calculate checksums and "forward" to the next node. With +["cut-through"](https://www.networkworld.com/article/2241573/latency-and-jitter--cut-through-design-pays-off-for-arista--blade.html) +designs, switches will begin forwarding data as soon as they know where the destination is, +checksums be damned. This means there's a fixed cost (at the switch) for network traffic, no matter +the size. + +# Final Thoughts + +High-performance systems, regardless of industry, are not magical. They do require extreme precision +and attention to detail, but they're designed, built, and operated by regular people, using a lot of +tools that are publicly available. Interested in seeing how context switching affects performance of +your benchmarks? `taskset` should be installed in all modern Linux distributions, and can be used to +make sure the OS never migrates your process. Curious how often garbage collection triggers during a +crucial operation? Your language of choice will typically expose details of its operations +([Python](https://docs.python.org/3/library/gc.html), +[Java](https://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html#DebuggingOptions)). +Want to know how hard your program is stressing the TLB? Use `perf record` and look for +`dtlb_load_misses.miss_causes_a_walk`. + +Two final guiding questions, then: first, before attempting to apply some of the technology above to +your own systems, can you first identify +[where/when you care](http://wiki.c2.com/?PrematureOptimization) about "high-performance"? As an +example, if parts of a system rely on humans pushing buttons, CPU pinning won't have any measurable +effect. Humans are already far too slow to react in time. Second, if you're using benchmarks, are +they being designed in a way that's actually helpful? Tools like +[Criterion](http://www.serpentine.com/criterion/) (also in +[Rust](https://github.com/bheisler/criterion.rs)) and Google's +[Benchmark](https://github.com/google/benchmark) output not only average run time, but variance as +well; your benchmarking environment is subject to the same concerns your production environment is. + +Finally, I believe high-performance systems are a matter of philosophy, not necessarily technique. +Rigorous focus on variance is the first step, and there are plenty of ways to measure and mitigate +it; once that's at an acceptable level, then optimize for speed. diff --git a/blog/2019-06-31-high-performance-systems/index.mdx b/blog/2019-06-31-high-performance-systems/index.mdx new file mode 100644 index 0000000..600d33c --- /dev/null +++ b/blog/2019-06-31-high-performance-systems/index.mdx @@ -0,0 +1,302 @@ +--- +slug: 2019/06/high-performance-systems +title: "On Building High Performance Systems" +date: 2019-06-31 12:00:00 +last_updated: + date: 2019-09-21 12:00:00 +authors: [bspeice] +tags: [] +--- + + +Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is +made up of people who have access to secret techniques mortal developers could only dream of. There +had to be some secret art that could only be learned if one had an appropriately tragic backstory. + + + +![Kung Fu fight](./kung-fu.webp) + +> How I assumed HFT people learn their secret techniques + +How else do you explain people working on systems that complete the round trip of market data in to +orders out (a.k.a. tick-to-trade) consistently within +[750-800 nanoseconds](https://stackoverflow.com/a/22082528/1454178)? In roughly the time it takes a +computer to access +[main memory 8 times](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html), +trading systems are capable of reading the market data packets, deciding what orders to send, doing +risk checks, creating new packets for exchange-specific protocols, and putting those packets on the +wire. + +Having now worked in the trading industry, I can confirm the developers aren't super-human; I've +made some simple mistakes at the very least. Instead, what shows up in public discussions is that +philosophy, not technique, separates high-performance systems from everything else. +Performance-critical systems don't rely on "this one cool C++ optimization trick" to make code fast +(though micro-optimizations have their place); there's a lot more to worry about than just the code +written for the project. + +The framework I'd propose is this: **If you want to build high-performance systems, focus first on +reducing performance variance** (reducing the gap between the fastest and slowest runs of the same +code), **and only look at average latency once variance is at an acceptable level**. + +Don't get me wrong, I'm a much happier person when things are fast. Computer goes from booting in 20 +seconds down to 10 because I installed a solid-state drive? Awesome. But if every fifth day it takes +a full minute to boot because of corrupted sectors? Not so great. Average speed over the course of a +week is the same in each situation, but you're painfully aware of that minute when it happens. When +it comes to code, the principal is the same: speeding up a function by an average of 10 milliseconds +doesn't mean much if there's a 100ms difference between your fastest and slowest runs. When +performance matters, you need to respond quickly _every time_, not just in aggregate. +High-performance systems should first optimize for time variance. Once you're consistent at the time +scale you care about, then focus on improving average time. + +This focus on variance shows up all the time in industry too (emphasis added in all quotes below): + +- In [marketing materials](https://business.nasdaq.com/market-tech/marketplaces/trading) for + NASDAQ's matching engine, the most performance-sensitive component of the exchange, dependability + is highlighted in addition to instantaneous metrics: + + > Able to **consistently sustain** an order rate of over 100,000 orders per second at sub-40 + > microsecond average latency + +- The [Aeron](https://github.com/real-logic/aeron) message bus has this to say about performance: + + > Performance is the key focus. Aeron is designed to be the highest throughput with the lowest and + > **most predictable latency possible** of any messaging system + +- The company PolySync, which is working on autonomous vehicles, + [mentions why](https://polysync.io/blog/session-types-for-hearty-codecs/) they picked their + specific messaging format: + + > In general, high performance is almost always desirable for serialization. But in the world of + > autonomous vehicles, **steady timing performance is even more important** than peak throughput. + > This is because safe operation is sensitive to timing outliers. Nobody wants the system that + > decides when to slam on the brakes to occasionally take 100 times longer than usual to encode + > its commands. + +- [Solarflare](https://solarflare.com/), which makes highly-specialized network hardware, points out + variance (jitter) as a big concern for + [electronic trading](https://solarflare.com/electronic-trading/): + > The high stakes world of electronic trading, investment banks, market makers, hedge funds and + > exchanges demand the **lowest possible latency and jitter** while utilizing the highest + > bandwidth and return on their investment. + +And to further clarify: we're not discussing _total run-time_, but variance of total run-time. There +are situations where it's not reasonably possible to make things faster, and you'd much rather be +consistent. For example, trading firms use +[wireless networks](https://sniperinmahwah.wordpress.com/2017/06/07/network-effects-part-i/) because +the speed of light through air is faster than through fiber-optic cables. There's still at _absolute +minimum_ a [~33.76 millisecond](http://tinyurl.com/y2vd7tn8) delay required to send data between, +say, +[Chicago and Tokyo](https://www.theice.com/market-data/connectivity-and-feeds/wireless/tokyo-chicago). +If a trading system in Chicago calls the function for "send order to Tokyo" and waits to see if a +trade occurs, there's a physical limit to how long that will take. In this situation, the focus is +on keeping variance of _additional processing_ to a minimum, since speed of light is the limiting +factor. + +So how does one go about looking for and eliminating performance variance? To tell the truth, I +don't think a systematic answer or flow-chart exists. There's no substitute for (A) building a deep +understanding of the entire technology stack, and (B) actually measuring system performance (though +(C) watching a lot of [CppCon](https://www.youtube.com/channel/UCMlGfpWw-RUdWX_JbLCukXg) videos for +inspiration never hurt). Even then, every project cares about performance to a different degree; you +may need to build an entire +[replica production system](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=3015) to +accurately benchmark at nanosecond precision, or you may be content to simply +[avoid garbage collection](https://www.youtube.com/watch?v=BD9cRbxWQx8&feature=youtu.be&t=1335) in +your Java code. + +Even though everyone has different needs, there are still common things to look for when trying to +isolate and eliminate variance. In no particular order, these are my focus areas when thinking about +high-performance systems: + +**Update 2019-09-21**: Added notes on `isolcpus` and `systemd` affinity. + +## Language-specific + +**Garbage Collection**: How often does garbage collection happen? When is it triggered? What are the +impacts? + +- [In Python](https://rushter.com/blog/python-garbage-collector/), individual objects are collected + if the reference count reaches 0, and each generation is collected if + `num_alloc - num_dealloc > gc_threshold` whenever an allocation happens. The GIL is acquired for + the duration of generational collection. +- Java has + [many](https://docs.oracle.com/en/java/javase/12/gctuning/parallel-collector1.html#GUID-DCDD6E46-0406-41D1-AB49-FB96A50EB9CE) + [different](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector.html#GUID-ED3AB6D3-FD9B-4447-9EDF-983ED2F7A573) + [collection](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector-tuning.html#GUID-90E30ACA-8040-432E-B3A0-1E0440AB556A) + [algorithms](https://docs.oracle.com/en/java/javase/12/gctuning/z-garbage-collector1.html#GUID-A5A42691-095E-47BA-B6DC-FB4E5FAA43D0) + to choose from, each with different characteristics. The default algorithms (Parallel GC in Java + 8, G1 in Java 9) freeze the JVM while collecting, while more recent algorithms + ([ZGC](https://wiki.openjdk.java.net/display/zgc) and + [Shenandoah](https://wiki.openjdk.java.net/display/shenandoah)) are designed to keep "stop the + world" to a minimum by doing collection work in parallel. + +**Allocation**: Every language has a different way of interacting with "heap" memory, but the +principle is the same: running the allocator to allocate/deallocate memory takes time that can often +be put to better use. Understanding when your language interacts with the allocator is crucial, and +not always obvious. For example: C++ and Rust don't allocate heap memory for iterators, but Java +does (meaning potential GC pauses). Take time to understand heap behavior (I made a +[a guide for Rust](/2019/02/understanding-allocations-in-rust.html)), and look into alternative +allocators ([jemalloc](http://jemalloc.net/), +[tcmalloc](https://gperftools.github.io/gperftools/tcmalloc.html)) that might run faster than the +operating system default. + +**Data Layout**: How your data is arranged in memory matters; +[data-oriented design](https://www.youtube.com/watch?v=yy8jQgmhbAU) and +[cache locality](https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be&t=1185) can have huge +impacts on performance. The C family of languages (C, value types in C#, C++) and Rust all have +guarantees about the shape every object takes in memory that others (e.g. Java and Python) can't +make. [Cachegrind](http://valgrind.org/docs/manual/cg-manual.html) and kernel +[perf](https://perf.wiki.kernel.org/index.php/Main_Page) counters are both great for understanding +how performance relates to memory layout. + +**Just-In-Time Compilation**: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are +great because they optimize your program for how it's actually being used, rather than how a +compiler expects it to be used. However, there's a variance problem if the program stops executing +while waiting for translation from VM bytecode to native code. As a remedy, many languages support +ahead-of-time compilation in addition to the JIT versions +([CoreRT](https://github.com/dotnet/corert) in C# and [GraalVM](https://www.graalvm.org/) in Java). +On the other hand, LLVM supports +[Profile Guided Optimization](https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization), +which theoretically brings JIT benefits to non-JIT languages. Finally, be careful to avoid comparing +apples and oranges during benchmarks; you don't want your code to suddenly speed up because the JIT +compiler kicked in. + +**Programming Tricks**: These won't make or break performance, but can be useful in specific +circumstances. For example, C++ can use +[templates instead of branches](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=1206) +in critical sections. + +## Kernel + +Code you wrote is almost certainly not the _only_ code running on your hardware. There are many ways +the operating system interacts with your program, from interrupts to system calls, that are +important to watch for. These are written from a Linux perspective, but Windows does typically have +equivalent functionality. + +**Scheduling**: The kernel is normally free to schedule any process on any core, so it's important +to reserve CPU cores exclusively for the important programs. There are a few parts to this: first, +limit the CPU cores that non-critical processes are allowed to run on by excluding cores from +scheduling +([`isolcpus`](https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html) +kernel command-line option), or by setting the `init` process CPU affinity +([`systemd` example](https://access.redhat.com/solutions/2884991)). Second, set critical processes +to run on the isolated cores by setting the +[processor affinity](https://en.wikipedia.org/wiki/Processor_affinity) using +[taskset](https://linux.die.net/man/1/taskset). Finally, use +[`NO_HZ`](https://github.com/torvalds/linux/blob/master/Documentation/timers/NO_HZ.txt) or +[`chrt`](https://linux.die.net/man/1/chrt) to disable scheduling interrupts. Turning off +hyper-threading is also likely beneficial. + +**System calls**: Reading from a UNIX socket? Writing to a file? In addition to not knowing how long +the I/O operation takes, these all trigger expensive +[system calls (syscalls)](https://en.wikipedia.org/wiki/System_call). To handle these, the CPU must +[context switch](https://en.wikipedia.org/wiki/Context_switch) to the kernel, let the kernel +operation complete, then context switch back to your program. We'd rather keep these +[to a minimum](https://www.destroyallsoftware.com/talks/the-birth-and-death-of-javascript) (see +timestamp 18:20). [Strace](https://linux.die.net/man/1/strace) is your friend for understanding when +and where syscalls happen. + +**Signal Handling**: Far less likely to be an issue, but signals do trigger a context switch if your +code has a handler registered. This will be highly dependent on the application, but you can +[block signals](https://www.linuxprogrammingblog.com/all-about-linux-signals?page=show#Blocking_signals) +if it's an issue. + +**Interrupts**: System interrupts are how devices connected to your computer notify the CPU that +something has happened. The CPU will then choose a processor core to pause and context switch to the +OS to handle the interrupt. Make sure that +[SMP affinity](http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux) is +set so that interrupts are handled on a CPU core not running the program you care about. + +**[NUMA](https://www.kernel.org/doc/html/latest/vm/numa.html)**: While NUMA is good at making +multi-cell systems transparent, there are variance implications; if the kernel moves a process +across nodes, future memory accesses must wait for the controller on the original node. Use +[numactl](https://linux.die.net/man/8/numactl) to handle memory-/cpu-cell pinning so this doesn't +happen. + +## Hardware + +**CPU Pipelining/Speculation**: Speculative execution in modern processors gave us vulnerabilities +like Spectre, but it also gave us performance improvements like +[branch prediction](https://stackoverflow.com/a/11227902/1454178). And if the CPU mis-speculates +your code, there's variance associated with rewind and replay. While the compiler knows a lot about +how your CPU [pipelines instructions](https://youtu.be/nAbCKa0FzjQ?t=4467), code can be +[structured to help](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=755) the branch +predictor. + +**Paging**: For most systems, virtual memory is incredible. Applications live in their own worlds, +and the CPU/[MMU](https://en.wikipedia.org/wiki/Memory_management_unit) figures out the details. +However, there's a variance penalty associated with memory paging and caching; if you access more +memory pages than the [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) can store, +you'll have to wait for the page walk. Kernel perf tools are necessary to figure out if this is an +issue, but using [huge pages](https://blog.pythian.com/performance-tuning-hugepages-in-linux/) can +reduce TLB burdens. Alternately, running applications in a hypervisor like +[Jailhouse](https://github.com/siemens/jailhouse) allows one to skip virtual memory entirely, but +this is probably more work than the benefits are worth. + +**Network Interfaces**: When more than one computer is involved, variance can go up dramatically. +Tuning kernel +[network parameters](https://github.com/leandromoreira/linux-network-performance-parameters) may be +helpful, but modern systems more frequently opt to skip the kernel altogether with a technique +called [kernel bypass](https://blog.cloudflare.com/kernel-bypass/). This typically requires +specialized hardware and [drivers](https://www.openonload.org/), but even industries like +[telecom](https://www.bbc.co.uk/rd/blog/2018-04-high-speed-networking-open-source-kernel-bypass) are +finding the benefits. + +## Networks + +**Routing**: There's a reason financial firms are willing to pay +[millions of euros](https://sniperinmahwah.wordpress.com/2019/03/26/4-les-moeres-english-version/) +for rights to a small plot of land - having a straight-line connection from point A to point B means +the path their data takes is the shortest possible. In contrast, there are currently 6 computers in +between me and Google, but that may change at any moment if my ISP realizes a +[more efficient route](https://en.wikipedia.org/wiki/Border_Gateway_Protocol) is available. Whether +it's using +[research-quality equipment](https://sniperinmahwah.wordpress.com/2018/05/07/shortwave-trading-part-i-the-west-chicago-tower-mystery/) +for shortwave radio, or just making sure there's no data inadvertently going between data centers, +routing matters. + +**Protocol**: TCP as a network protocol is awesome: guaranteed and in-order delivery, flow control, +and congestion control all built in. But these attributes make the most sense when networking +infrastructure is lossy; for systems that expect nearly all packets to be delivered correctly, the +setup handshaking and packet acknowledgment are just overhead. Using UDP (unicast or multicast) may +make sense in these contexts as it avoids the chatter needed to track connection state, and +[gap-fill](https://iextrading.com/docs/IEX%20Transport%20Specification.pdf) +[strategies](http://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf) +can handle the rest. + +**Switching**: Many routers/switches handle packets using "store-and-forward" behavior: wait for the +whole packet, validate checksums, and then send to the next device. In variance terms, the time +needed to move data between two nodes is proportional to the size of that data; the switch must +"store" all data before it can calculate checksums and "forward" to the next node. With +["cut-through"](https://www.networkworld.com/article/2241573/latency-and-jitter--cut-through-design-pays-off-for-arista--blade.html) +designs, switches will begin forwarding data as soon as they know where the destination is, +checksums be damned. This means there's a fixed cost (at the switch) for network traffic, no matter +the size. + +## Final Thoughts + +High-performance systems, regardless of industry, are not magical. They do require extreme precision +and attention to detail, but they're designed, built, and operated by regular people, using a lot of +tools that are publicly available. Interested in seeing how context switching affects performance of +your benchmarks? `taskset` should be installed in all modern Linux distributions, and can be used to +make sure the OS never migrates your process. Curious how often garbage collection triggers during a +crucial operation? Your language of choice will typically expose details of its operations +([Python](https://docs.python.org/3/library/gc.html), +[Java](https://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html#DebuggingOptions)). +Want to know how hard your program is stressing the TLB? Use `perf record` and look for +`dtlb_load_misses.miss_causes_a_walk`. + +Two final guiding questions, then: first, before attempting to apply some of the technology above to +your own systems, can you first identify +[where/when you care](http://wiki.c2.com/?PrematureOptimization) about "high-performance"? As an +example, if parts of a system rely on humans pushing buttons, CPU pinning won't have any measurable +effect. Humans are already far too slow to react in time. Second, if you're using benchmarks, are +they being designed in a way that's actually helpful? Tools like +[Criterion](http://www.serpentine.com/criterion/) (also in +[Rust](https://github.com/bheisler/criterion.rs)) and Google's +[Benchmark](https://github.com/google/benchmark) output not only average run time, but variance as +well; your benchmarking environment is subject to the same concerns your production environment is. + +Finally, I believe high-performance systems are a matter of philosophy, not necessarily technique. +Rigorous focus on variance is the first step, and there are plenty of ways to measure and mitigate +it; once that's at an acceptable level, then optimize for speed. diff --git a/blog/2019-06-31-high-performance-systems/kung-fu.webp b/blog/2019-06-31-high-performance-systems/kung-fu.webp new file mode 100644 index 0000000..cd9dfb0 Binary files /dev/null and b/blog/2019-06-31-high-performance-systems/kung-fu.webp differ