2019-06-31-high-performance-systems

2026-07-21 19:03:00 -04:00 · 2024-11-09 21:54:39 -05:00
parent d9baae0a5e
commit 8501da2f81
3 changed files with 598 additions and 0 deletions
@@ -0,0 +1,296 @@
+---
+layout: post
+title: "On Building High Performance Systems"
+description: ""
+category:
+tags: []
+---
+
+**Update 2019-09-21**: Added notes on `isolcpus` and `systemd` affinity.
+
+Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is
+made up of people who have access to secret techniques mortal developers could only dream of. There
+had to be some secret art that could only be learned if one had an appropriately tragic backstory:
+
+<img src="/assets/images/2019-04-24-kung-fu.webp" alt="kung-fu fight">
+> How I assumed HFT people learn their secret techniques
+
+How else do you explain people working on systems that complete the round trip of market data in to
+orders out (a.k.a. tick-to-trade) consistently within
+[750-800 nanoseconds](https://stackoverflow.com/a/22082528/1454178)? In roughly the time it takes a
+computer to access
+[main memory 8 times](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html),
+trading systems are capable of reading the market data packets, deciding what orders to send, doing
+risk checks, creating new packets for exchange-specific protocols, and putting those packets on the
+wire.
+
+Having now worked in the trading industry, I can confirm the developers aren't super-human; I've
+made some simple mistakes at the very least. Instead, what shows up in public discussions is that
+philosophy, not technique, separates high-performance systems from everything else.
+Performance-critical systems don't rely on "this one cool C++ optimization trick" to make code fast
+(though micro-optimizations have their place); there's a lot more to worry about than just the code
+written for the project.
+
+The framework I'd propose is this: **If you want to build high-performance systems, focus first on
+reducing performance variance** (reducing the gap between the fastest and slowest runs of the same
+code), **and only look at average latency once variance is at an acceptable level**.
+
+Don't get me wrong, I'm a much happier person when things are fast. Computer goes from booting in 20
+seconds down to 10 because I installed a solid-state drive? Awesome. But if every fifth day it takes
+a full minute to boot because of corrupted sectors? Not so great. Average speed over the course of a
+week is the same in each situation, but you're painfully aware of that minute when it happens. When
+it comes to code, the principal is the same: speeding up a function by an average of 10 milliseconds
+doesn't mean much if there's a 100ms difference between your fastest and slowest runs. When
+performance matters, you need to respond quickly _every time_, not just in aggregate.
+High-performance systems should first optimize for time variance. Once you're consistent at the time
+scale you care about, then focus on improving average time.
+
+This focus on variance shows up all the time in industry too (emphasis added in all quotes below):
+
+- In [marketing materials](https://business.nasdaq.com/market-tech/marketplaces/trading) for
+  NASDAQ's matching engine, the most performance-sensitive component of the exchange, dependability
+  is highlighted in addition to instantaneous metrics:
+
+  > Able to **consistently sustain** an order rate of over 100,000 orders per second at sub-40
+  > microsecond average latency
+
+- The [Aeron](https://github.com/real-logic/aeron) message bus has this to say about performance:
+
+  > Performance is the key focus. Aeron is designed to be the highest throughput with the lowest and
+  > **most predictable latency possible** of any messaging system
+
+- The company PolySync, which is working on autonomous vehicles,
+  [mentions why](https://polysync.io/blog/session-types-for-hearty-codecs/) they picked their
+  specific messaging format:
+
+  > In general, high performance is almost always desirable for serialization. But in the world of
+  > autonomous vehicles, **steady timing performance is even more important** than peak throughput.
+  > This is because safe operation is sensitive to timing outliers. Nobody wants the system that
+  > decides when to slam on the brakes to occasionally take 100 times longer than usual to encode
+  > its commands.
+
+- [Solarflare](https://solarflare.com/), which makes highly-specialized network hardware, points out
+  variance (jitter) as a big concern for
+  [electronic trading](https://solarflare.com/electronic-trading/):
+  > The high stakes world of electronic trading, investment banks, market makers, hedge funds and
+  > exchanges demand the **lowest possible latency and jitter** while utilizing the highest
+  > bandwidth and return on their investment.
+
+And to further clarify: we're not discussing _total run-time_, but variance of total run-time. There
+are situations where it's not reasonably possible to make things faster, and you'd much rather be
+consistent. For example, trading firms use
+[wireless networks](https://sniperinmahwah.wordpress.com/2017/06/07/network-effects-part-i/) because
+the speed of light through air is faster than through fiber-optic cables. There's still at _absolute
+minimum_ a [~33.76 millisecond](http://tinyurl.com/y2vd7tn8) delay required to send data between,
+say,
+[Chicago and Tokyo](https://www.theice.com/market-data/connectivity-and-feeds/wireless/tokyo-chicago).
+If a trading system in Chicago calls the function for "send order to Tokyo" and waits to see if a
+trade occurs, there's a physical limit to how long that will take. In this situation, the focus is
+on keeping variance of _additional processing_ to a minimum, since speed of light is the limiting
+factor.
+
+So how does one go about looking for and eliminating performance variance? To tell the truth, I
+don't think a systematic answer or flow-chart exists. There's no substitute for (A) building a deep
+understanding of the entire technology stack, and (B) actually measuring system performance (though
+(C) watching a lot of [CppCon](https://www.youtube.com/channel/UCMlGfpWw-RUdWX_JbLCukXg) videos for
+inspiration never hurt). Even then, every project cares about performance to a different degree; you
+may need to build an entire
+[replica production system](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=3015) to
+accurately benchmark at nanosecond precision, or you may be content to simply
+[avoid garbage collection](https://www.youtube.com/watch?v=BD9cRbxWQx8&feature=youtu.be&t=1335) in
+your Java code.
+
+Even though everyone has different needs, there are still common things to look for when trying to
+isolate and eliminate variance. In no particular order, these are my focus areas when thinking about
+high-performance systems:
+
+## Language-specific
+
+**Garbage Collection**: How often does garbage collection happen? When is it triggered? What are the
+impacts?
+
+- [In Python](https://rushter.com/blog/python-garbage-collector/), individual objects are collected
+  if the reference count reaches 0, and each generation is collected if
+  `num_alloc - num_dealloc > gc_threshold` whenever an allocation happens. The GIL is acquired for
+  the duration of generational collection.
+- Java has
+  [many](https://docs.oracle.com/en/java/javase/12/gctuning/parallel-collector1.html#GUID-DCDD6E46-0406-41D1-AB49-FB96A50EB9CE)
+  [different](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector.html#GUID-ED3AB6D3-FD9B-4447-9EDF-983ED2F7A573)
+  [collection](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector-tuning.html#GUID-90E30ACA-8040-432E-B3A0-1E0440AB556A)
+  [algorithms](https://docs.oracle.com/en/java/javase/12/gctuning/z-garbage-collector1.html#GUID-A5A42691-095E-47BA-B6DC-FB4E5FAA43D0)
+  to choose from, each with different characteristics. The default algorithms (Parallel GC in Java
+  8, G1 in Java 9) freeze the JVM while collecting, while more recent algorithms
+  ([ZGC](https://wiki.openjdk.java.net/display/zgc) and
+  [Shenandoah](https://wiki.openjdk.java.net/display/shenandoah)) are designed to keep "stop the
+  world" to a minimum by doing collection work in parallel.
+
+**Allocation**: Every language has a different way of interacting with "heap" memory, but the
+principle is the same: running the allocator to allocate/deallocate memory takes time that can often
+be put to better use. Understanding when your language interacts with the allocator is crucial, and
+not always obvious. For example: C++ and Rust don't allocate heap memory for iterators, but Java
+does (meaning potential GC pauses). Take time to understand heap behavior (I made a
+[a guide for Rust](/2019/02/understanding-allocations-in-rust.html)), and look into alternative
+allocators ([jemalloc](http://jemalloc.net/),
+[tcmalloc](https://gperftools.github.io/gperftools/tcmalloc.html)) that might run faster than the
+operating system default.
+
+**Data Layout**: How your data is arranged in memory matters;
+[data-oriented design](https://www.youtube.com/watch?v=yy8jQgmhbAU) and
+[cache locality](https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be&t=1185) can have huge
+impacts on performance. The C family of languages (C, value types in C#, C++) and Rust all have
+guarantees about the shape every object takes in memory that others (e.g. Java and Python) can't
+make. [Cachegrind](http://valgrind.org/docs/manual/cg-manual.html) and kernel
+[perf](https://perf.wiki.kernel.org/index.php/Main_Page) counters are both great for understanding
+how performance relates to memory layout.
+
+**Just-In-Time Compilation**: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are
+great because they optimize your program for how it's actually being used, rather than how a
+compiler expects it to be used. However, there's a variance problem if the program stops executing
+while waiting for translation from VM bytecode to native code. As a remedy, many languages support
+ahead-of-time compilation in addition to the JIT versions
+([CoreRT](https://github.com/dotnet/corert) in C# and [GraalVM](https://www.graalvm.org/) in Java).
+On the other hand, LLVM supports
+[Profile Guided Optimization](https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization),
+which theoretically brings JIT benefits to non-JIT languages. Finally, be careful to avoid comparing
+apples and oranges during benchmarks; you don't want your code to suddenly speed up because the JIT
+compiler kicked in.
+
+**Programming Tricks**: These won't make or break performance, but can be useful in specific
+circumstances. For example, C++ can use
+[templates instead of branches](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=1206)
+in critical sections.
+
+## Kernel
+
+Code you wrote is almost certainly not the _only_ code running on your hardware. There are many ways
+the operating system interacts with your program, from interrupts to system calls, that are
+important to watch for. These are written from a Linux perspective, but Windows does typically have
+equivalent functionality.
+
+**Scheduling**: The kernel is normally free to schedule any process on any core, so it's important
+to reserve CPU cores exclusively for the important programs. There are a few parts to this: first,
+limit the CPU cores that non-critical processes are allowed to run on by excluding cores from
+scheduling
+([`isolcpus`](https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html)
+kernel command-line option), or by setting the `init` process CPU affinity
+([`systemd` example](https://access.redhat.com/solutions/2884991)). Second, set critical processes
+to run on the isolated cores by setting the
+[processor affinity](https://en.wikipedia.org/wiki/Processor_affinity) using
+[taskset](https://linux.die.net/man/1/taskset). Finally, use
+[`NO_HZ`](https://github.com/torvalds/linux/blob/master/Documentation/timers/NO_HZ.txt) or
+[`chrt`](https://linux.die.net/man/1/chrt) to disable scheduling interrupts. Turning off
+hyper-threading is also likely beneficial.
+
+**System calls**: Reading from a UNIX socket? Writing to a file? In addition to not knowing how long
+the I/O operation takes, these all trigger expensive
+[system calls (syscalls)](https://en.wikipedia.org/wiki/System_call). To handle these, the CPU must
+[context switch](https://en.wikipedia.org/wiki/Context_switch) to the kernel, let the kernel
+operation complete, then context switch back to your program. We'd rather keep these
+[to a minimum](https://www.destroyallsoftware.com/talks/the-birth-and-death-of-javascript) (see
+timestamp 18:20). [Strace](https://linux.die.net/man/1/strace) is your friend for understanding when
+and where syscalls happen.
+
+**Signal Handling**: Far less likely to be an issue, but signals do trigger a context switch if your
+code has a handler registered. This will be highly dependent on the application, but you can
+[block signals](https://www.linuxprogrammingblog.com/all-about-linux-signals?page=show#Blocking_signals)
+if it's an issue.
+
+**Interrupts**: System interrupts are how devices connected to your computer notify the CPU that
+something has happened. The CPU will then choose a processor core to pause and context switch to the
+OS to handle the interrupt. Make sure that
+[SMP affinity](http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux) is
+set so that interrupts are handled on a CPU core not running the program you care about.
+
+**[NUMA](https://www.kernel.org/doc/html/latest/vm/numa.html)**: While NUMA is good at making
+multi-cell systems transparent, there are variance implications; if the kernel moves a process
+across nodes, future memory accesses must wait for the controller on the original node. Use
+[numactl](https://linux.die.net/man/8/numactl) to handle memory-/cpu-cell pinning so this doesn't
+happen.
+
+## Hardware
+
+**CPU Pipelining/Speculation**: Speculative execution in modern processors gave us vulnerabilities
+like Spectre, but it also gave us performance improvements like
+[branch prediction](https://stackoverflow.com/a/11227902/1454178). And if the CPU mis-speculates
+your code, there's variance associated with rewind and replay. While the compiler knows a lot about
+how your CPU [pipelines instructions](https://youtu.be/nAbCKa0FzjQ?t=4467), code can be
+[structured to help](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=755) the branch
+predictor.
+
+**Paging**: For most systems, virtual memory is incredible. Applications live in their own worlds,
+and the CPU/[MMU](https://en.wikipedia.org/wiki/Memory_management_unit) figures out the details.
+However, there's a variance penalty associated with memory paging and caching; if you access more
+memory pages than the [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) can store,
+you'll have to wait for the page walk. Kernel perf tools are necessary to figure out if this is an
+issue, but using [huge pages](https://blog.pythian.com/performance-tuning-hugepages-in-linux/) can
+reduce TLB burdens. Alternately, running applications in a hypervisor like
+[Jailhouse](https://github.com/siemens/jailhouse) allows one to skip virtual memory entirely, but
+this is probably more work than the benefits are worth.
+
+**Network Interfaces**: When more than one computer is involved, variance can go up dramatically.
+Tuning kernel
+[network parameters](https://github.com/leandromoreira/linux-network-performance-parameters) may be
+helpful, but modern systems more frequently opt to skip the kernel altogether with a technique
+called [kernel bypass](https://blog.cloudflare.com/kernel-bypass/). This typically requires
+specialized hardware and [drivers](https://www.openonload.org/), but even industries like
+[telecom](https://www.bbc.co.uk/rd/blog/2018-04-high-speed-networking-open-source-kernel-bypass) are
+finding the benefits.
+
+## Networks
+
+**Routing**: There's a reason financial firms are willing to pay
+[millions of euros](https://sniperinmahwah.wordpress.com/2019/03/26/4-les-moeres-english-version/)
+for rights to a small plot of land - having a straight-line connection from point A to point B means
+the path their data takes is the shortest possible. In contrast, there are currently 6 computers in
+between me and Google, but that may change at any moment if my ISP realizes a
+[more efficient route](https://en.wikipedia.org/wiki/Border_Gateway_Protocol) is available. Whether
+it's using
+[research-quality equipment](https://sniperinmahwah.wordpress.com/2018/05/07/shortwave-trading-part-i-the-west-chicago-tower-mystery/)
+for shortwave radio, or just making sure there's no data inadvertently going between data centers,
+routing matters.
+
+**Protocol**: TCP as a network protocol is awesome: guaranteed and in-order delivery, flow control,
+and congestion control all built in. But these attributes make the most sense when networking
+infrastructure is lossy; for systems that expect nearly all packets to be delivered correctly, the
+setup handshaking and packet acknowledgment are just overhead. Using UDP (unicast or multicast) may
+make sense in these contexts as it avoids the chatter needed to track connection state, and
+[gap-fill](https://iextrading.com/docs/IEX%20Transport%20Specification.pdf)
+[strategies](http://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf)
+can handle the rest.
+
+**Switching**: Many routers/switches handle packets using "store-and-forward" behavior: wait for the
+whole packet, validate checksums, and then send to the next device. In variance terms, the time
+needed to move data between two nodes is proportional to the size of that data; the switch must
+"store" all data before it can calculate checksums and "forward" to the next node. With
+["cut-through"](https://www.networkworld.com/article/2241573/latency-and-jitter--cut-through-design-pays-off-for-arista--blade.html)
+designs, switches will begin forwarding data as soon as they know where the destination is,
+checksums be damned. This means there's a fixed cost (at the switch) for network traffic, no matter
+the size.
+
+# Final Thoughts
+
+High-performance systems, regardless of industry, are not magical. They do require extreme precision
+and attention to detail, but they're designed, built, and operated by regular people, using a lot of
+tools that are publicly available. Interested in seeing how context switching affects performance of
+your benchmarks? `taskset` should be installed in all modern Linux distributions, and can be used to
+make sure the OS never migrates your process. Curious how often garbage collection triggers during a
+crucial operation? Your language of choice will typically expose details of its operations
+([Python](https://docs.python.org/3/library/gc.html),
+[Java](https://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html#DebuggingOptions)).
+Want to know how hard your program is stressing the TLB? Use `perf record` and look for
+`dtlb_load_misses.miss_causes_a_walk`.
+
+Two final guiding questions, then: first, before attempting to apply some of the technology above to
+your own systems, can you first identify
+[where/when you care](http://wiki.c2.com/?PrematureOptimization) about "high-performance"? As an
+example, if parts of a system rely on humans pushing buttons, CPU pinning won't have any measurable
+effect. Humans are already far too slow to react in time. Second, if you're using benchmarks, are
+they being designed in a way that's actually helpful? Tools like
+[Criterion](http://www.serpentine.com/criterion/) (also in
+[Rust](https://github.com/bheisler/criterion.rs)) and Google's
+[Benchmark](https://github.com/google/benchmark) output not only average run time, but variance as
+well; your benchmarking environment is subject to the same concerns your production environment is.
+
+Finally, I believe high-performance systems are a matter of philosophy, not necessarily technique.
+Rigorous focus on variance is the first step, and there are plenty of ways to measure and mitigate
+it; once that's at an acceptable level, then optimize for speed.
@@ -0,0 +1,302 @@
+---
+slug: 2019/06/high-performance-systems
+title: "On Building High Performance Systems"
+date: 2019-06-31 12:00:00
+last_updated:
+  date: 2019-09-21 12:00:00
+authors: [bspeice]
+tags: []
+---
+
+
+Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is
+made up of people who have access to secret techniques mortal developers could only dream of. There
+had to be some secret art that could only be learned if one had an appropriately tragic backstory.
+
+<!-- truncate -->
+
+![Kung Fu fight](./kung-fu.webp)
+
+> How I assumed HFT people learn their secret techniques
+
+How else do you explain people working on systems that complete the round trip of market data in to
+orders out (a.k.a. tick-to-trade) consistently within
+[750-800 nanoseconds](https://stackoverflow.com/a/22082528/1454178)? In roughly the time it takes a
+computer to access
+[main memory 8 times](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html),
+trading systems are capable of reading the market data packets, deciding what orders to send, doing
+risk checks, creating new packets for exchange-specific protocols, and putting those packets on the
+wire.
+
+Having now worked in the trading industry, I can confirm the developers aren't super-human; I've
+made some simple mistakes at the very least. Instead, what shows up in public discussions is that
+philosophy, not technique, separates high-performance systems from everything else.
+Performance-critical systems don't rely on "this one cool C++ optimization trick" to make code fast
+(though micro-optimizations have their place); there's a lot more to worry about than just the code
+written for the project.
+
+The framework I'd propose is this: **If you want to build high-performance systems, focus first on
+reducing performance variance** (reducing the gap between the fastest and slowest runs of the same
+code), **and only look at average latency once variance is at an acceptable level**.
+
+Don't get me wrong, I'm a much happier person when things are fast. Computer goes from booting in 20
+seconds down to 10 because I installed a solid-state drive? Awesome. But if every fifth day it takes
+a full minute to boot because of corrupted sectors? Not so great. Average speed over the course of a
+week is the same in each situation, but you're painfully aware of that minute when it happens. When
+it comes to code, the principal is the same: speeding up a function by an average of 10 milliseconds
+doesn't mean much if there's a 100ms difference between your fastest and slowest runs. When
+performance matters, you need to respond quickly _every time_, not just in aggregate.
+High-performance systems should first optimize for time variance. Once you're consistent at the time
+scale you care about, then focus on improving average time.
+
+This focus on variance shows up all the time in industry too (emphasis added in all quotes below):
+
+- In [marketing materials](https://business.nasdaq.com/market-tech/marketplaces/trading) for
+  NASDAQ's matching engine, the most performance-sensitive component of the exchange, dependability
+  is highlighted in addition to instantaneous metrics:
+
+  > Able to **consistently sustain** an order rate of over 100,000 orders per second at sub-40
+  > microsecond average latency
+
+- The [Aeron](https://github.com/real-logic/aeron) message bus has this to say about performance:
+
+  > Performance is the key focus. Aeron is designed to be the highest throughput with the lowest and
+  > **most predictable latency possible** of any messaging system
+
+- The company PolySync, which is working on autonomous vehicles,
+  [mentions why](https://polysync.io/blog/session-types-for-hearty-codecs/) they picked their
+  specific messaging format:
+
+  > In general, high performance is almost always desirable for serialization. But in the world of
+  > autonomous vehicles, **steady timing performance is even more important** than peak throughput.
+  > This is because safe operation is sensitive to timing outliers. Nobody wants the system that
+  > decides when to slam on the brakes to occasionally take 100 times longer than usual to encode
+  > its commands.
+
+- [Solarflare](https://solarflare.com/), which makes highly-specialized network hardware, points out
+  variance (jitter) as a big concern for
+  [electronic trading](https://solarflare.com/electronic-trading/):
+  > The high stakes world of electronic trading, investment banks, market makers, hedge funds and
+  > exchanges demand the **lowest possible latency and jitter** while utilizing the highest
+  > bandwidth and return on their investment.
+
+And to further clarify: we're not discussing _total run-time_, but variance of total run-time. There
+are situations where it's not reasonably possible to make things faster, and you'd much rather be
+consistent. For example, trading firms use
+[wireless networks](https://sniperinmahwah.wordpress.com/2017/06/07/network-effects-part-i/) because
+the speed of light through air is faster than through fiber-optic cables. There's still at _absolute
+minimum_ a [~33.76 millisecond](http://tinyurl.com/y2vd7tn8) delay required to send data between,
+say,
+[Chicago and Tokyo](https://www.theice.com/market-data/connectivity-and-feeds/wireless/tokyo-chicago).
+If a trading system in Chicago calls the function for "send order to Tokyo" and waits to see if a
+trade occurs, there's a physical limit to how long that will take. In this situation, the focus is
+on keeping variance of _additional processing_ to a minimum, since speed of light is the limiting
+factor.
+
+So how does one go about looking for and eliminating performance variance? To tell the truth, I
+don't think a systematic answer or flow-chart exists. There's no substitute for (A) building a deep
+understanding of the entire technology stack, and (B) actually measuring system performance (though
+(C) watching a lot of [CppCon](https://www.youtube.com/channel/UCMlGfpWw-RUdWX_JbLCukXg) videos for
+inspiration never hurt). Even then, every project cares about performance to a different degree; you
+may need to build an entire
+[replica production system](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=3015) to
+accurately benchmark at nanosecond precision, or you may be content to simply
+[avoid garbage collection](https://www.youtube.com/watch?v=BD9cRbxWQx8&feature=youtu.be&t=1335) in
+your Java code.
+
+Even though everyone has different needs, there are still common things to look for when trying to
+isolate and eliminate variance. In no particular order, these are my focus areas when thinking about
+high-performance systems:
+
+**Update 2019-09-21**: Added notes on `isolcpus` and `systemd` affinity.
+
+## Language-specific
+
+**Garbage Collection**: How often does garbage collection happen? When is it triggered? What are the
+impacts?
+
+- [In Python](https://rushter.com/blog/python-garbage-collector/), individual objects are collected
+  if the reference count reaches 0, and each generation is collected if
+  `num_alloc - num_dealloc > gc_threshold` whenever an allocation happens. The GIL is acquired for
+  the duration of generational collection.
+- Java has
+  [many](https://docs.oracle.com/en/java/javase/12/gctuning/parallel-collector1.html#GUID-DCDD6E46-0406-41D1-AB49-FB96A50EB9CE)
+  [different](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector.html#GUID-ED3AB6D3-FD9B-4447-9EDF-983ED2F7A573)
+  [collection](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector-tuning.html#GUID-90E30ACA-8040-432E-B3A0-1E0440AB556A)
+  [algorithms](https://docs.oracle.com/en/java/javase/12/gctuning/z-garbage-collector1.html#GUID-A5A42691-095E-47BA-B6DC-FB4E5FAA43D0)
+  to choose from, each with different characteristics. The default algorithms (Parallel GC in Java
+  8, G1 in Java 9) freeze the JVM while collecting, while more recent algorithms
+  ([ZGC](https://wiki.openjdk.java.net/display/zgc) and
+  [Shenandoah](https://wiki.openjdk.java.net/display/shenandoah)) are designed to keep "stop the
+  world" to a minimum by doing collection work in parallel.
+
+**Allocation**: Every language has a different way of interacting with "heap" memory, but the
+principle is the same: running the allocator to allocate/deallocate memory takes time that can often
+be put to better use. Understanding when your language interacts with the allocator is crucial, and
+not always obvious. For example: C++ and Rust don't allocate heap memory for iterators, but Java
+does (meaning potential GC pauses). Take time to understand heap behavior (I made a
+[a guide for Rust](/2019/02/understanding-allocations-in-rust.html)), and look into alternative
+allocators ([jemalloc](http://jemalloc.net/),
+[tcmalloc](https://gperftools.github.io/gperftools/tcmalloc.html)) that might run faster than the
+operating system default.
+
+**Data Layout**: How your data is arranged in memory matters;
+[data-oriented design](https://www.youtube.com/watch?v=yy8jQgmhbAU) and
+[cache locality](https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be&t=1185) can have huge
+impacts on performance. The C family of languages (C, value types in C#, C++) and Rust all have
+guarantees about the shape every object takes in memory that others (e.g. Java and Python) can't
+make. [Cachegrind](http://valgrind.org/docs/manual/cg-manual.html) and kernel
+[perf](https://perf.wiki.kernel.org/index.php/Main_Page) counters are both great for understanding
+how performance relates to memory layout.
+
+**Just-In-Time Compilation**: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are
+great because they optimize your program for how it's actually being used, rather than how a
+compiler expects it to be used. However, there's a variance problem if the program stops executing
+while waiting for translation from VM bytecode to native code. As a remedy, many languages support
+ahead-of-time compilation in addition to the JIT versions
+([CoreRT](https://github.com/dotnet/corert) in C# and [GraalVM](https://www.graalvm.org/) in Java).
+On the other hand, LLVM supports
+[Profile Guided Optimization](https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization),
+which theoretically brings JIT benefits to non-JIT languages. Finally, be careful to avoid comparing
+apples and oranges during benchmarks; you don't want your code to suddenly speed up because the JIT
+compiler kicked in.
+
+**Programming Tricks**: These won't make or break performance, but can be useful in specific
+circumstances. For example, C++ can use
+[templates instead of branches](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=1206)
+in critical sections.
+
+## Kernel
+
+Code you wrote is almost certainly not the _only_ code running on your hardware. There are many ways
+the operating system interacts with your program, from interrupts to system calls, that are
+important to watch for. These are written from a Linux perspective, but Windows does typically have
+equivalent functionality.
+
+**Scheduling**: The kernel is normally free to schedule any process on any core, so it's important
+to reserve CPU cores exclusively for the important programs. There are a few parts to this: first,
+limit the CPU cores that non-critical processes are allowed to run on by excluding cores from
+scheduling
+([`isolcpus`](https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html)
+kernel command-line option), or by setting the `init` process CPU affinity
+([`systemd` example](https://access.redhat.com/solutions/2884991)). Second, set critical processes
+to run on the isolated cores by setting the
+[processor affinity](https://en.wikipedia.org/wiki/Processor_affinity) using
+[taskset](https://linux.die.net/man/1/taskset). Finally, use
+[`NO_HZ`](https://github.com/torvalds/linux/blob/master/Documentation/timers/NO_HZ.txt) or
+[`chrt`](https://linux.die.net/man/1/chrt) to disable scheduling interrupts. Turning off
+hyper-threading is also likely beneficial.
+
+**System calls**: Reading from a UNIX socket? Writing to a file? In addition to not knowing how long
+the I/O operation takes, these all trigger expensive
+[system calls (syscalls)](https://en.wikipedia.org/wiki/System_call). To handle these, the CPU must
+[context switch](https://en.wikipedia.org/wiki/Context_switch) to the kernel, let the kernel
+operation complete, then context switch back to your program. We'd rather keep these
+[to a minimum](https://www.destroyallsoftware.com/talks/the-birth-and-death-of-javascript) (see
+timestamp 18:20). [Strace](https://linux.die.net/man/1/strace) is your friend for understanding when
+and where syscalls happen.
+
+**Signal Handling**: Far less likely to be an issue, but signals do trigger a context switch if your
+code has a handler registered. This will be highly dependent on the application, but you can
+[block signals](https://www.linuxprogrammingblog.com/all-about-linux-signals?page=show#Blocking_signals)
+if it's an issue.
+
+**Interrupts**: System interrupts are how devices connected to your computer notify the CPU that
+something has happened. The CPU will then choose a processor core to pause and context switch to the
+OS to handle the interrupt. Make sure that
+[SMP affinity](http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux) is
+set so that interrupts are handled on a CPU core not running the program you care about.
+
+**[NUMA](https://www.kernel.org/doc/html/latest/vm/numa.html)**: While NUMA is good at making
+multi-cell systems transparent, there are variance implications; if the kernel moves a process
+across nodes, future memory accesses must wait for the controller on the original node. Use
+[numactl](https://linux.die.net/man/8/numactl) to handle memory-/cpu-cell pinning so this doesn't
+happen.
+
+## Hardware
+
+**CPU Pipelining/Speculation**: Speculative execution in modern processors gave us vulnerabilities
+like Spectre, but it also gave us performance improvements like
+[branch prediction](https://stackoverflow.com/a/11227902/1454178). And if the CPU mis-speculates
+your code, there's variance associated with rewind and replay. While the compiler knows a lot about
+how your CPU [pipelines instructions](https://youtu.be/nAbCKa0FzjQ?t=4467), code can be
+[structured to help](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=755) the branch
+predictor.
+
+**Paging**: For most systems, virtual memory is incredible. Applications live in their own worlds,
+and the CPU/[MMU](https://en.wikipedia.org/wiki/Memory_management_unit) figures out the details.
+However, there's a variance penalty associated with memory paging and caching; if you access more
+memory pages than the [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) can store,
+you'll have to wait for the page walk. Kernel perf tools are necessary to figure out if this is an
+issue, but using [huge pages](https://blog.pythian.com/performance-tuning-hugepages-in-linux/) can
+reduce TLB burdens. Alternately, running applications in a hypervisor like
+[Jailhouse](https://github.com/siemens/jailhouse) allows one to skip virtual memory entirely, but
+this is probably more work than the benefits are worth.
+
+**Network Interfaces**: When more than one computer is involved, variance can go up dramatically.
+Tuning kernel
+[network parameters](https://github.com/leandromoreira/linux-network-performance-parameters) may be
+helpful, but modern systems more frequently opt to skip the kernel altogether with a technique
+called [kernel bypass](https://blog.cloudflare.com/kernel-bypass/). This typically requires
+specialized hardware and [drivers](https://www.openonload.org/), but even industries like
+[telecom](https://www.bbc.co.uk/rd/blog/2018-04-high-speed-networking-open-source-kernel-bypass) are
+finding the benefits.
+
+## Networks
+
+**Routing**: There's a reason financial firms are willing to pay
+[millions of euros](https://sniperinmahwah.wordpress.com/2019/03/26/4-les-moeres-english-version/)
+for rights to a small plot of land - having a straight-line connection from point A to point B means
+the path their data takes is the shortest possible. In contrast, there are currently 6 computers in
+between me and Google, but that may change at any moment if my ISP realizes a
+[more efficient route](https://en.wikipedia.org/wiki/Border_Gateway_Protocol) is available. Whether
+it's using
+[research-quality equipment](https://sniperinmahwah.wordpress.com/2018/05/07/shortwave-trading-part-i-the-west-chicago-tower-mystery/)
+for shortwave radio, or just making sure there's no data inadvertently going between data centers,
+routing matters.
+
+**Protocol**: TCP as a network protocol is awesome: guaranteed and in-order delivery, flow control,
+and congestion control all built in. But these attributes make the most sense when networking
+infrastructure is lossy; for systems that expect nearly all packets to be delivered correctly, the
+setup handshaking and packet acknowledgment are just overhead. Using UDP (unicast or multicast) may
+make sense in these contexts as it avoids the chatter needed to track connection state, and
+[gap-fill](https://iextrading.com/docs/IEX%20Transport%20Specification.pdf)
+[strategies](http://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf)
+can handle the rest.
+
+**Switching**: Many routers/switches handle packets using "store-and-forward" behavior: wait for the
+whole packet, validate checksums, and then send to the next device. In variance terms, the time
+needed to move data between two nodes is proportional to the size of that data; the switch must
+"store" all data before it can calculate checksums and "forward" to the next node. With
+["cut-through"](https://www.networkworld.com/article/2241573/latency-and-jitter--cut-through-design-pays-off-for-arista--blade.html)
+designs, switches will begin forwarding data as soon as they know where the destination is,
+checksums be damned. This means there's a fixed cost (at the switch) for network traffic, no matter
+the size.
+
+## Final Thoughts
+
+High-performance systems, regardless of industry, are not magical. They do require extreme precision
+and attention to detail, but they're designed, built, and operated by regular people, using a lot of
+tools that are publicly available. Interested in seeing how context switching affects performance of
+your benchmarks? `taskset` should be installed in all modern Linux distributions, and can be used to
+make sure the OS never migrates your process. Curious how often garbage collection triggers during a
+crucial operation? Your language of choice will typically expose details of its operations
+([Python](https://docs.python.org/3/library/gc.html),
+[Java](https://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html#DebuggingOptions)).
+Want to know how hard your program is stressing the TLB? Use `perf record` and look for
+`dtlb_load_misses.miss_causes_a_walk`.
+
+Two final guiding questions, then: first, before attempting to apply some of the technology above to
+your own systems, can you first identify
+[where/when you care](http://wiki.c2.com/?PrematureOptimization) about "high-performance"? As an
+example, if parts of a system rely on humans pushing buttons, CPU pinning won't have any measurable
+effect. Humans are already far too slow to react in time. Second, if you're using benchmarks, are
+they being designed in a way that's actually helpful? Tools like
+[Criterion](http://www.serpentine.com/criterion/) (also in
+[Rust](https://github.com/bheisler/criterion.rs)) and Google's
+[Benchmark](https://github.com/google/benchmark) output not only average run time, but variance as
+well; your benchmarking environment is subject to the same concerns your production environment is.
+
+Finally, I believe high-performance systems are a matter of philosophy, not necessarily technique.
+Rigorous focus on variance is the first step, and there are plenty of ways to measure and mitigate
+it; once that's at an acceptable level, then optimize for speed.