Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is
+made up of people who have access to secret techniques mortal developers could only dream of. There
+had to be some secret art that could only be learned if one had an appropriately tragic backstory.
+
+
+How I assumed HFT people learn their secret techniques
+
+
How else do you explain people working on systems that complete the round trip of market data in to
+orders out (a.k.a. tick-to-trade) consistently within
+750-800 nanoseconds? In roughly the time it takes a
+computer to access
+main memory 8 times,
+trading systems are capable of reading the market data packets, deciding what orders to send, doing
+risk checks, creating new packets for exchange-specific protocols, and putting those packets on the
+wire.
+
Having now worked in the trading industry, I can confirm the developers aren't super-human; I've
+made some simple mistakes at the very least. Instead, what shows up in public discussions is that
+philosophy, not technique, separates high-performance systems from everything else.
+Performance-critical systems don't rely on "this one cool C++ optimization trick" to make code fast
+(though micro-optimizations have their place); there's a lot more to worry about than just the code
+written for the project.
+
The framework I'd propose is this: If you want to build high-performance systems, focus first on
+reducing performance variance (reducing the gap between the fastest and slowest runs of the same
+code), and only look at average latency once variance is at an acceptable level.
+
Don't get me wrong, I'm a much happier person when things are fast. Computer goes from booting in 20
+seconds down to 10 because I installed a solid-state drive? Awesome. But if every fifth day it takes
+a full minute to boot because of corrupted sectors? Not so great. Average speed over the course of a
+week is the same in each situation, but you're painfully aware of that minute when it happens. When
+it comes to code, the principal is the same: speeding up a function by an average of 10 milliseconds
+doesn't mean much if there's a 100ms difference between your fastest and slowest runs. When
+performance matters, you need to respond quickly every time, not just in aggregate.
+High-performance systems should first optimize for time variance. Once you're consistent at the time
+scale you care about, then focus on improving average time.
+
This focus on variance shows up all the time in industry too (emphasis added in all quotes below):
+
+-
+
In marketing materials for
+NASDAQ's matching engine, the most performance-sensitive component of the exchange, dependability
+is highlighted in addition to instantaneous metrics:
+
+Able to consistently sustain an order rate of over 100,000 orders per second at sub-40
+microsecond average latency
+
+
+-
+
The Aeron message bus has this to say about performance:
+
+Performance is the key focus. Aeron is designed to be the highest throughput with the lowest and
+most predictable latency possible of any messaging system
+
+
+-
+
The company PolySync, which is working on autonomous vehicles,
+mentions why they picked their
+specific messaging format:
+
+In general, high performance is almost always desirable for serialization. But in the world of
+autonomous vehicles, steady timing performance is even more important than peak throughput.
+This is because safe operation is sensitive to timing outliers. Nobody wants the system that
+decides when to slam on the brakes to occasionally take 100 times longer than usual to encode
+its commands.
+
+
+-
+
Solarflare, which makes highly-specialized network hardware, points out
+variance (jitter) as a big concern for
+electronic trading:
+
+The high stakes world of electronic trading, investment banks, market makers, hedge funds and
+exchanges demand the lowest possible latency and jitter while utilizing the highest
+bandwidth and return on their investment.
+
+
+
+
And to further clarify: we're not discussing total run-time, but variance of total run-time. There
+are situations where it's not reasonably possible to make things faster, and you'd much rather be
+consistent. For example, trading firms use
+wireless networks because
+the speed of light through air is faster than through fiber-optic cables. There's still at absolute
+minimum a ~33.76 millisecond delay required to send data between,
+say,
+Chicago and Tokyo.
+If a trading system in Chicago calls the function for "send order to Tokyo" and waits to see if a
+trade occurs, there's a physical limit to how long that will take. In this situation, the focus is
+on keeping variance of additional processing to a minimum, since speed of light is the limiting
+factor.
+
So how does one go about looking for and eliminating performance variance? To tell the truth, I
+don't think a systematic answer or flow-chart exists. There's no substitute for (A) building a deep
+understanding of the entire technology stack, and (B) actually measuring system performance (though
+(C) watching a lot of CppCon videos for
+inspiration never hurt). Even then, every project cares about performance to a different degree; you
+may need to build an entire
+replica production system to
+accurately benchmark at nanosecond precision, or you may be content to simply
+avoid garbage collection in
+your Java code.
+
Even though everyone has different needs, there are still common things to look for when trying to
+isolate and eliminate variance. In no particular order, these are my focus areas when thinking about
+high-performance systems:
+
Update 2019-09-21: Added notes on isolcpus
and systemd
affinity.
+
Language-specific
+
Garbage Collection: How often does garbage collection happen? When is it triggered? What are the
+impacts?
+
+- In Python, individual objects are collected
+if the reference count reaches 0, and each generation is collected if
+
num_alloc - num_dealloc > gc_threshold
whenever an allocation happens. The GIL is acquired for
+the duration of generational collection.
+- Java has
+many
+different
+collection
+algorithms
+to choose from, each with different characteristics. The default algorithms (Parallel GC in Java
+8, G1 in Java 9) freeze the JVM while collecting, while more recent algorithms
+(ZGC and
+Shenandoah) are designed to keep "stop the
+world" to a minimum by doing collection work in parallel.
+
+
Allocation: Every language has a different way of interacting with "heap" memory, but the
+principle is the same: running the allocator to allocate/deallocate memory takes time that can often
+be put to better use. Understanding when your language interacts with the allocator is crucial, and
+not always obvious. For example: C++ and Rust don't allocate heap memory for iterators, but Java
+does (meaning potential GC pauses). Take time to understand heap behavior (I made a
+a guide for Rust), and look into alternative
+allocators (jemalloc,
+tcmalloc) that might run faster than the
+operating system default.
+
Data Layout: How your data is arranged in memory matters;
+data-oriented design and
+cache locality can have huge
+impacts on performance. The C family of languages (C, value types in C#, C++) and Rust all have
+guarantees about the shape every object takes in memory that others (e.g. Java and Python) can't
+make. Cachegrind and kernel
+perf counters are both great for understanding
+how performance relates to memory layout.
+
Just-In-Time Compilation: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are
+great because they optimize your program for how it's actually being used, rather than how a
+compiler expects it to be used. However, there's a variance problem if the program stops executing
+while waiting for translation from VM bytecode to native code. As a remedy, many languages support
+ahead-of-time compilation in addition to the JIT versions
+(CoreRT in C# and GraalVM in Java).
+On the other hand, LLVM supports
+Profile Guided Optimization,
+which theoretically brings JIT benefits to non-JIT languages. Finally, be careful to avoid comparing
+apples and oranges during benchmarks; you don't want your code to suddenly speed up because the JIT
+compiler kicked in.
+
Programming Tricks: These won't make or break performance, but can be useful in specific
+circumstances. For example, C++ can use
+templates instead of branches
+in critical sections.
+
Kernel
+
Code you wrote is almost certainly not the only code running on your hardware. There are many ways
+the operating system interacts with your program, from interrupts to system calls, that are
+important to watch for. These are written from a Linux perspective, but Windows does typically have
+equivalent functionality.
+
Scheduling: The kernel is normally free to schedule any process on any core, so it's important
+to reserve CPU cores exclusively for the important programs. There are a few parts to this: first,
+limit the CPU cores that non-critical processes are allowed to run on by excluding cores from
+scheduling
+(isolcpus
+kernel command-line option), or by setting the init
process CPU affinity
+(systemd
example). Second, set critical processes
+to run on the isolated cores by setting the
+processor affinity using
+taskset. Finally, use
+NO_HZ
or
+chrt
to disable scheduling interrupts. Turning off
+hyper-threading is also likely beneficial.
+
System calls: Reading from a UNIX socket? Writing to a file? In addition to not knowing how long
+the I/O operation takes, these all trigger expensive
+system calls (syscalls). To handle these, the CPU must
+context switch to the kernel, let the kernel
+operation complete, then context switch back to your program. We'd rather keep these
+to a minimum (see
+timestamp 18:20). Strace is your friend for understanding when
+and where syscalls happen.
+
Signal Handling: Far less likely to be an issue, but signals do trigger a context switch if your
+code has a handler registered. This will be highly dependent on the application, but you can
+block signals
+if it's an issue.
+
Interrupts: System interrupts are how devices connected to your computer notify the CPU that
+something has happened. The CPU will then choose a processor core to pause and context switch to the
+OS to handle the interrupt. Make sure that
+SMP affinity is
+set so that interrupts are handled on a CPU core not running the program you care about.
+
NUMA: While NUMA is good at making
+multi-cell systems transparent, there are variance implications; if the kernel moves a process
+across nodes, future memory accesses must wait for the controller on the original node. Use
+numactl to handle memory-/cpu-cell pinning so this doesn't
+happen.
+
Hardware
+
CPU Pipelining/Speculation: Speculative execution in modern processors gave us vulnerabilities
+like Spectre, but it also gave us performance improvements like
+branch prediction. And if the CPU mis-speculates
+your code, there's variance associated with rewind and replay. While the compiler knows a lot about
+how your CPU pipelines instructions, code can be
+structured to help the branch
+predictor.
+
Paging: For most systems, virtual memory is incredible. Applications live in their own worlds,
+and the CPU/MMU figures out the details.
+However, there's a variance penalty associated with memory paging and caching; if you access more
+memory pages than the TLB can store,
+you'll have to wait for the page walk. Kernel perf tools are necessary to figure out if this is an
+issue, but using huge pages can
+reduce TLB burdens. Alternately, running applications in a hypervisor like
+Jailhouse allows one to skip virtual memory entirely, but
+this is probably more work than the benefits are worth.
+
Network Interfaces: When more than one computer is involved, variance can go up dramatically.
+Tuning kernel
+network parameters may be
+helpful, but modern systems more frequently opt to skip the kernel altogether with a technique
+called kernel bypass. This typically requires
+specialized hardware and drivers, but even industries like
+telecom are
+finding the benefits.
+
Networks
+
Routing: There's a reason financial firms are willing to pay
+millions of euros
+for rights to a small plot of land - having a straight-line connection from point A to point B means
+the path their data takes is the shortest possible. In contrast, there are currently 6 computers in
+between me and Google, but that may change at any moment if my ISP realizes a
+more efficient route is available. Whether
+it's using
+research-quality equipment
+for shortwave radio, or just making sure there's no data inadvertently going between data centers,
+routing matters.
+
Protocol: TCP as a network protocol is awesome: guaranteed and in-order delivery, flow control,
+and congestion control all built in. But these attributes make the most sense when networking
+infrastructure is lossy; for systems that expect nearly all packets to be delivered correctly, the
+setup handshaking and packet acknowledgment are just overhead. Using UDP (unicast or multicast) may
+make sense in these contexts as it avoids the chatter needed to track connection state, and
+gap-fill
+strategies
+can handle the rest.
+
Switching: Many routers/switches handle packets using "store-and-forward" behavior: wait for the
+whole packet, validate checksums, and then send to the next device. In variance terms, the time
+needed to move data between two nodes is proportional to the size of that data; the switch must
+"store" all data before it can calculate checksums and "forward" to the next node. With
+"cut-through"
+designs, switches will begin forwarding data as soon as they know where the destination is,
+checksums be damned. This means there's a fixed cost (at the switch) for network traffic, no matter
+the size.
+
Final Thoughts
+
High-performance systems, regardless of industry, are not magical. They do require extreme precision
+and attention to detail, but they're designed, built, and operated by regular people, using a lot of
+tools that are publicly available. Interested in seeing how context switching affects performance of
+your benchmarks? taskset
should be installed in all modern Linux distributions, and can be used to
+make sure the OS never migrates your process. Curious how often garbage collection triggers during a
+crucial operation? Your language of choice will typically expose details of its operations
+(Python,
+Java).
+Want to know how hard your program is stressing the TLB? Use perf record
and look for
+dtlb_load_misses.miss_causes_a_walk
.
+
Two final guiding questions, then: first, before attempting to apply some of the technology above to
+your own systems, can you first identify
+where/when you care about "high-performance"? As an
+example, if parts of a system rely on humans pushing buttons, CPU pinning won't have any measurable
+effect. Humans are already far too slow to react in time. Second, if you're using benchmarks, are
+they being designed in a way that's actually helpful? Tools like
+Criterion (also in
+Rust) and Google's
+Benchmark output not only average run time, but variance as
+well; your benchmarking environment is subject to the same concerns your production environment is.
+
Finally, I believe high-performance systems are a matter of philosophy, not necessarily technique.
+Rigorous focus on variance is the first step, and there are plenty of ways to measure and mitigate
+it; once that's at an acceptable level, then optimize for speed.