Allocation rewording

timing
Bradlee Speice 2019-06-05 18:12:12 -04:00
parent 53d7b03a14
commit 3f561c959c
1 changed files with 13 additions and 11 deletions

View File

@ -12,16 +12,18 @@ Prior to working in the trading industry, my assumption was that High Frequency
> How I assumed HFT people learn their secret techniques
How else do you explain people working on systems that complete the round trip of market data in to orders out (a.k.a. tick-to-trade) consistently within [750-800 nanoseconds](https://stackoverflow.com/a/22082528/1454178)?
In roughly the time it takes other computers to access [main memory 8 times](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html), trading systems are capable of reading the market data packets, deciding what orders to send, (presumably) doing risk checks, creating new packets for exchange-specific protocols, and putting those packets on the wire.
In roughly the time it takes a computer to access [main memory 8 times](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html), trading systems are capable of reading the market data packets, deciding what orders to send, doing risk checks, creating new packets for exchange-specific protocols, and putting those packets on the wire.
Having now worked in the trading industry, I can confirm the developers are mortal; I've made some simple mistakes at the very least. Instead, what shows up in public discussions is that philosophy, not technique, separates high-performance systems from everything else. Performance-critical systems don't rely on C++ optimization tricks to make code fast (though they have their place); there's a lot more to worry about than just the code written for the project. What I focus on now is **variance**, and reducing the gap between the fastest and slowest runs of the same code.
Having now worked in the trading industry, I can confirm the developers aren't super-human; I've made some simple mistakes at the very least. Instead, what shows up in public discussions is that philosophy, not technique, separates high-performance systems from everything else. Performance-critical systems don't rely on "this one cool C++ optimization trick" to make code fast (though micro-optimizations have their place); there's a lot more to worry about than just the code written for the project.
Don't get me wrong, I'm a much happier person when things are fast. Computer goes from booting in 20 seconds down to 10 because I installed a solid-state drive? Awesome. But if every fifth day it takes a full minute to boot up because of corrupted sectors? Not so great. Average speed is the same in each situation, but you're painfully aware of that minute when it happens. When it comes to code, the principal is the same: speeding up a function by an average 10 milliseconds doesn't mean much if there's a 100ms difference between your fastest and slowest runs. When performance matters, you need to respond quickly *every time*, not just in aggregate. **High-performance systems should first optimize for time variance**. Once you're consistent at the time scale you care about, then focus on improving overall time.
The framework I'd propose is this: **If you want to build high-performance systems, focus first on reducing performance variance** (reducing the gap between the fastest and slowest runs of the same code), **and only look at average latency once variance is at an acceptable level**.
Don't get me wrong, I'm a much happier person when things are fast. Computer goes from booting in 20 seconds down to 10 because I installed a solid-state drive? Awesome. But if every fifth day it takes a full minute to boot because of corrupted sectors? Not so great. Average speed over the course of a week is the same in each situation, but you're painfully aware of that minute when it happens. When it comes to code, the principal is the same: speeding up a function by an average 10 milliseconds doesn't mean much if there's a 100ms difference between your fastest and slowest runs. When performance matters, you need to respond quickly *every time*, not just in aggregate. High-performance systems should first optimize for time variance. Once you're consistent at the time scale you care about, then focus on improving overall time.
This focus on variance shows up all the time in public discussions (emphasis added in all quotes below):
- In [marketing materials](https://business.nasdaq.com/market-tech/marketplaces/trading) for NASDAQ's matching engine, they mention consistency alongside average latency:
> Able to **consistently sustain an order rate** of over 100,000 orders per second at sub-40 microsecond average latency
- Consistency shows up in [marketing materials](https://business.nasdaq.com/market-tech/marketplaces/trading) for NASDAQ's matching engine, the most performance-sensitive component of the exchange:
> Able to **consistently sustain** an order rate of over 100,000 orders per second at sub-40 microsecond average latency
- The [Aeron](https://github.com/real-logic/aeron) message bus has this to say about performance:
> Performance is the key focus. Aeron is designed to be the highest throughput with the lowest and **most predictable latency possible** of any messaging system
@ -29,14 +31,14 @@ This focus on variance shows up all the time in public discussions (emphasis add
- The company PolySync, which is working on autonomous vehicles, [mentions why](https://polysync.io/blog/session-types-for-hearty-codecs/) they picked their specific messaging format:
> In general, high performance is almost always desirable for serialization. But in the world of autonomous vehicles, **steady timing performance is even more important** than peak throughput. This is because safe operation is sensitive to timing outliers. Nobody wants the system that decides when to slam on the brakes to occasionally take 100 times longer than usual to encode its commands.
- [Solarflare](https://solarflare.com/), which makes highly-specialized network hardware, points out variance as a big concern for [electronic trading](https://solarflare.com/electronic-trading/):
- [Solarflare](https://solarflare.com/), which makes highly-specialized network hardware, points out variance (jitter) as a big concern for [electronic trading](https://solarflare.com/electronic-trading/):
> The high stakes world of electronic trading, investment banks, market makers, hedge funds and exchanges demand the **lowest possible latency and jitter** while utilizing the highest bandwidth and return on their investment.
One more important thing to note: we're not discussing *total runtime*, but variance of that total runtime. For example, trading firms use [wireless networks](https://sniperinmahwah.wordpress.com/2017/06/07/network-effects-part-i/) because the speed of light through air is faster than through fiber-optic cables. There's still at *absolute minimum* a [~33.76 millisecond](http://tinyurl.com/y2vd7tn8) delay required to send data between, say, [Chicago and Tokyo](https://www.theice.com/market-data/connectivity-and-feeds/wireless/tokyo-chicago) (unless [neutrinos](https://arxiv.org/pdf/1203.2847.pdf) become [a thing](https://news.ycombinator.com/item?id=10979340)). If a trading system in Chicago calls the function for "send order to Tokyo" and waits for a result, there's a physical limit to how long that will take. In this situation, the focus is on keeping variance of additional processing to a minimum.
And one more clarification: we're not discussing how to reduce *total run-time*, but its variance. For example, trading firms use [wireless networks](https://sniperinmahwah.wordpress.com/2017/06/07/network-effects-part-i/) because the speed of light through air is faster than through fiber-optic cables. There's still at *absolute minimum* a [~33.76 millisecond](http://tinyurl.com/y2vd7tn8) delay required to send data between, say, [Chicago and Tokyo](https://www.theice.com/market-data/connectivity-and-feeds/wireless/tokyo-chicago) (unless [neutrinos](https://arxiv.org/pdf/1203.2847.pdf) become [a thing](https://news.ycombinator.com/item?id=10979340)). If a trading system in Chicago calls the function for "send order to Tokyo" and waits for a result, there's a physical limit to how long that will take. In this situation, the focus is on keeping variance of additional processing to a minimum.
So how exactly does one go about looking for and eliminating performance variance? To tell the truth, I don't think a systematic answer or flow-chart exists. There's no substitute for (A) building a deep understanding of the entire technology stack, and (B) actually measuring system performance (though (C) watching a lot of [CppCon](https://www.youtube.com/channel/UCMlGfpWw-RUdWX_JbLCukXg) videos never hurt). Even then, each project cares about performance to a different degree; you may need to build an entire [replica production system](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=3015) to accurately benchmark at nanosecond precision. Alternately, you may be content to simply [avoid garbage collection](https://www.youtube.com/watch?v=BD9cRbxWQx8&feature=youtu.be&t=1335) in your Java code.
So how exactly does one go about looking for and eliminating performance variance? To tell the truth, I don't think a systematic answer or flow-chart exists. There's no substitute for (A) building a deep understanding of the entire technology stack, and (B) actually measuring system performance (though (C) watching a lot of [CppCon](https://www.youtube.com/channel/UCMlGfpWw-RUdWX_JbLCukXg) videos never hurt). Even then, every project cares about performance to a different degree; you may need to build an entire [replica production system](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=3015) to accurately benchmark at nanosecond precision, or you may be content to simply [avoid garbage collection](https://www.youtube.com/watch?v=BD9cRbxWQx8&feature=youtu.be&t=1335) in your Java code.
Even though everyone has different needs, there are still common things to look for when trying to isolate variance. In no particular order, these are my focus areas when thinking about high-performance systems:
Even though everyone has different needs, there are still common things to look for when trying to isolate and eliminate variance. In no particular order, these are my focus areas when thinking about high-performance systems:
## Language-specific
@ -44,11 +46,11 @@ Even though everyone has different needs, there are still common things to look
- [In Python](https://rushter.com/blog/python-garbage-collector/), individual objects are collected if the reference count reaches 0, and each generation is collected if `num_alloc - num_dealloc > gc_threshold` whenever an allocation happens. The GIL is acquired for the duration of generational collection.
- Java has [many](https://docs.oracle.com/en/java/javase/12/gctuning/parallel-collector1.html#GUID-DCDD6E46-0406-41D1-AB49-FB96A50EB9CE) [different](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector.html#GUID-ED3AB6D3-FD9B-4447-9EDF-983ED2F7A573) [collection](https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector-tuning.html#GUID-90E30ACA-8040-432E-B3A0-1E0440AB556A) [algorithms](https://docs.oracle.com/en/java/javase/12/gctuning/z-garbage-collector1.html#GUID-A5A42691-095E-47BA-B6DC-FB4E5FAA43D0) to choose from, each with different characteristics. The default algorithms (Parallel GC in Java 8, G1 in Java 9) freeze the JVM while collecting, while more recent algorithms ([ZGC](https://wiki.openjdk.java.net/display/zgc) and [Shenandoah](https://wiki.openjdk.java.net/display/shenandoah)) are designed to keep "stop the world" to a minimum by doing collection work in parallel.
**Allocation**: Every language has a different way of interacting with "heap" memory, but the principle is the same; figuring out what chunks of memory are available to give to a program is complex. Understanding when your language interacts with the allocator is crucial (and I wrote [a guide for Rust](https://speice.io/2019/02/understanding-allocations-in-rust.html)). Maybe look into alternative allocators like [jemalloc](http://jemalloc.net/) and [tcmalloc](https://gperftools.github.io/gperftools/tcmalloc.html).
**Allocation**: Every language has a different way of interacting with "heap" memory, but the principle is the same: running the allocator to allocate/deallocate memory takes time that can often be put to better use. Understanding when your language interacts with the allocator is crucial, and not always obvious. For example: C++ and Rust don't allocate heap memory for iterators, but Java does (meaning potential GC pauses). Take time to understand heap behavior (I made a [a guide for Rust](https://speice.io/2019/02/understanding-allocations-in-rust.html)), and look into alternative allocators ([jemalloc](http://jemalloc.net/), [tcmalloc](https://gperftools.github.io/gperftools/tcmalloc.html)) that might run faster than the operating system default.
**Data Layout**: How your data is arranged in memory matters; [data-oriented design](https://www.youtube.com/watch?v=yy8jQgmhbAU) and [cache locality](https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be&t=1185) can have huge impacts on performance. The C family of languages (C, value types in C#, C++) and Rust all have guarantees about the shape every object takes in memory that others (e.g. Java and Python) can't make. [Cachegrind](http://valgrind.org/docs/manual/cg-manual.html) and kernel [perf](https://perf.wiki.kernel.org/index.php/Main_Page) counters are both great for understanding how performance relates to memory layout.
**Just-In-Time Compilation**: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are great because they optimize your program for how it's actually being used, rather than how the compiler expects it to be used. However, there's a variance cost associated with this because the virtual machine may stop executing while it waits for translation from VM bytecode to native code. As a remedy, many of these languages now support ahead-of-time compilation ([CoreRT](https://github.com/dotnet/corert) in C# and [GraalVM](https://www.graalvm.org/) in Java). On the other hand, LLVM supports [Profile Guided Optimization](https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization), which should bring JIT-like benefits to non-JIT languages. Be careful when benchmarking to avoid comparing apples and oranges; you don't want a benchmark to suddenly speed up because the JIT compiler kicked in.
**Just-In-Time Compilation**: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are great because they optimize your program for how it's actually being used, rather than how a compiler expects it to be used. However, there's a variance problem if the program stops executing while waiting for translation from VM bytecode to native code. As a remedy, many languages support ahead-of-time compilation in addition to the JIT versions ([CoreRT](https://github.com/dotnet/corert) in C# and [GraalVM](https://www.graalvm.org/) in Java). On the other hand, LLVM supports [Profile Guided Optimization](https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization), which theoretically brings JIT benefits to non-JIT languages. Finally, be careful to avoid comparing apples and oranges during benchmarks; you don't want your code to suddenly speed up because the JIT compiler kicked in.
**Programming Tricks**: These won't make or break performance, but can be useful in specific circumstances. For example, C++ can use [templates instead of branches](https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=1206) in critical sections.