speice.io/2019/06/high-performance-systems/index.html

<!doctype html><html lang=en dir=ltr class="blog-wrapper blog-post-page plugin-blog plugin-id-default" data-has-hydrated=false><meta charset=UTF-8><meta name=generator content="Docusaurus v3.7.0"><title data-rh=true>On building high performance systems | The Old Speice Guy</title><meta data-rh=true name=viewport content="width=device-width, initial-scale=1.0"><meta data-rh=true name=twitter:card content=summary_large_image><meta data-rh=true property=og:url content=https://speice.io/2019/06/high-performance-systems/><meta data-rh=true property=og:locale content=en><meta data-rh=true name=docusaurus_locale content=en><meta data-rh=true name=docusaurus_tag content=default><meta data-rh=true name=docsearch:language content=en><meta data-rh=true name=docsearch:docusaurus_tag content=default><meta data-rh=true property=og:title content="On building high performance systems | The Old Speice Guy"><meta data-rh=true name=description content="Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is"><meta data-rh=true property=og:description content="Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is"><meta data-rh=true property=og:type content=article><meta data-rh=true property=article:published_time content=2019-07-01T12:00:00.000Z><link data-rh=true rel=icon href=/img/favicon.ico><link data-rh=true rel=canonical href=https://speice.io/2019/06/high-performance-systems/><link data-rh=true rel=alternate href=https://speice.io/2019/06/high-performance-systems/ hreflang=en><link data-rh=true rel=alternate href=https://speice.io/2019/06/high-performance-systems/ hreflang=x-default><script data-rh=true type=application/ld+json>{"@context":"https://schema.org","@id":"https://speice.io/2019/06/high-performance-systems","@type":"BlogPosting","author":{"@type":"Person","name":"Bradlee Speice"},"dateModified":"2024-11-10T21:43:14.000Z","datePublished":"2019-07-01T12:00:00.000Z","description":"Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is","headline":"On building high performance systems","isPartOf":{"@id":"https://speice.io/","@type":"Blog","name":"Blog"},"keywords":[],"mainEntityOfPage":"https://speice.io/2019/06/high-performance-systems","name":"On building high performance systems","url":"https://speice.io/2019/06/high-performance-systems"}</script><link rel=alternate type=application/rss+xml href=/rss.xml title="The Old Speice Guy RSS Feed"><link rel=alternate type=application/atom+xml href=/atom.xml title="The Old Speice Guy Atom Feed"><link rel=stylesheet href=/katex/katex.min.css type=text/css><link rel=stylesheet href=/assets/css/styles.24ac2c37.css><script src=/assets/js/runtime~main.75ada3c5.js defer></script><script src=/assets/js/main.d0bb06d2.js defer></script><body class=navigation-with-keyboard><script>!function(){var t,e=function(){try{return new URLSearchParams(window.location.search).get("docusaurus-theme")}catch(t){}}()||function(){try{return window.localStorage.getItem("theme")}catch(t){}}();t=null!==e?e:"light",document.documentElement.setAttribute("data-theme",t)}(),function(){try{for(var[t,e]of new URLSearchParams(window.location.search).entries())if(t.startsWith("docusaurus-data-")){var a=t.replace("docusaurus-data-","data-");document.documentElement.setAttribute(a,e)}}catch(t){}}()</script><div id=__docusaurus><div role=region aria-label="Skip to main content"><a class=skipToContent_fXgn href=#__docusaurus_skipToContent_fallback>Skip to main content</a></div><nav aria-label=Main class="navbar navbar--fixed-top"><div class=navbar__inner><div class=navbar__items><button aria-label="Toggle navigation bar" aria-expanded=false class="navbar__toggle clean-btn" type=button><svg width=30 height=30 viewBox="0 0 30 30" aria-hidden=true><path stroke=currentColor stroke-linecap=round stroke-miterlimit=10 stroke-width=2 d="M4 7h22M4 15h22M4 23h22"/></svg></button><a class=navbar__brand href=/><div class=navbar__logo><img src=/img/logo.svg alt="Sierpinski Gasket" class="themedComponent_mlkZ 
made up of people who have access to secret techniques mortal developers could only dream of. There
had to be some secret art that could only be learned if one had an appropriately tragic backstory.</p>
<p><img decoding=async loading=lazy alt="Kung Fu fight" src=/assets/images/kung-fu-5715f30eef7bf3aaa26770b1247024dc.webp width=426 height=240 class=img_ev3q></p>
<blockquote>
<p>How I assumed HFT people learn their secret techniques</p>
</blockquote>
<p>How else do you explain people working on systems that complete the round trip of market data in to
orders out (a.k.a. tick-to-trade) consistently within
<a href=https://stackoverflow.com/a/22082528/1454178 target=_blank rel="noopener noreferrer">750-800 nanoseconds</a>? In roughly the time it takes a
computer to access
<a href=https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html target=_blank rel="noopener noreferrer">main memory 8 times</a>,
trading systems are capable of reading the market data packets, deciding what orders to send, doing
risk checks, creating new packets for exchange-specific protocols, and putting those packets on the
wire.</p>
<p>Having now worked in the trading industry, I can confirm the developers aren't super-human; I've
made some simple mistakes at the very least. Instead, what shows up in public discussions is that
philosophy, not technique, separates high-performance systems from everything else.
Performance-critical systems don't rely on "this one cool C++ optimization trick" to make code fast
(though micro-optimizations have their place); there's a lot more to worry about than just the code
written for the project.</p>
<p>The framework I'd propose is this: <strong>If you want to build high-performance systems, focus first on
reducing performance variance</strong> (reducing the gap between the fastest and slowest runs of the same
code), <strong>and only look at average latency once variance is at an acceptable level</strong>.</p>
<p>Don't get me wrong, I'm a much happier person when things are fast. Computer goes from booting in 20
seconds down to 10 because I installed a solid-state drive? Awesome. But if every fifth day it takes
a full minute to boot because of corrupted sectors? Not so great. Average speed over the course of a
week is the same in each situation, but you're painfully aware of that minute when it happens. When
it comes to code, the principal is the same: speeding up a function by an average of 10 milliseconds
doesn't mean much if there's a 100ms difference between your fastest and slowest runs. When
performance matters, you need to respond quickly <em>every time</em>, not just in aggregate.
High-performance systems should first optimize for time variance. Once you're consistent at the time
scale you care about, then focus on improving average time.</p>
<p>This focus on variance shows up all the time in industry too (emphasis added in all quotes below):</p>
<ul>
<li>
<p>In <a href=https://business.nasdaq.com/market-tech/marketplaces/trading target=_blank rel="noopener noreferrer">marketing materials</a> for
NASDAQ's matching engine, the most performance-sensitive component of the exchange, dependability
is highlighted in addition to instantaneous metrics:</p>
<blockquote>
<p>Able to <strong>consistently sustain</strong> an order rate of over 100,000 orders per second at sub-40
microsecond average latency</p>
</blockquote>
</li>
<li>
<p>The <a href=https://github.com/real-logic/aeron target=_blank rel="noopener noreferrer">Aeron</a> message bus has this to say about performance:</p>
<blockquote>
<p>Performance is the key focus. Aeron is designed to be the highest throughput with the lowest and
<strong>most predictable latency possible</strong> of any messaging system</p>
</blockquote>
</li>
<li>
<p>The company PolySync, which is working on autonomous vehicles,
<a href=https://polysync.io/blog/session-types-for-hearty-codecs/ target=_blank rel="noopener noreferrer">mentions why</a> they picked their
specific messaging format:</p>
<blockquote>
<p>In general, high performance is almost always desirable for serialization. But in the world of
autonomous vehicles, <strong>steady timing performance is even more important</strong> than peak throughput.
This is because safe operation is sensitive to timing outliers. Nobody wants the system that
decides when to slam on the brakes to occasionally take 100 times longer than usual to encode
its commands.</p>
</blockquote>
</li>
<li>
<p><a href=https://solarflare.com/ target=_blank rel="noopener noreferrer">Solarflare</a>, which makes highly-specialized network hardware, points out
variance (jitter) as a big concern for
<a href=https://solarflare.com/electronic-trading/ target=_blank rel="noopener noreferrer">electronic trading</a>:</p>
<blockquote>
<p>The high stakes world of electronic trading, investment banks, market makers, hedge funds and
exchanges demand the <strong>lowest possible latency and jitter</strong> while utilizing the highest
bandwidth and return on their investment.</p>
</blockquote>
</li>
</ul>
<p>And to further clarify: we're not discussing <em>total run-time</em>, but variance of total run-time. There
are situations where it's not reasonably possible to make things faster, and you'd much rather be
consistent. For example, trading firms use
<a href=https://sniperinmahwah.wordpress.com/2017/06/07/network-effects-part-i/ target=_blank rel="noopener noreferrer">wireless networks</a> because
the speed of light through air is faster than through fiber-optic cables. There's still at <em>absolute
minimum</em> a <a href=http://tinyurl.com/y2vd7tn8 target=_blank rel="noopener noreferrer">~33.76 millisecond</a> delay required to send data between,
say,
<a href=https://www.theice.com/market-data/connectivity-and-feeds/wireless/tokyo-chicago target=_blank rel="noopener noreferrer">Chicago and Tokyo</a>.
If a trading system in Chicago calls the function for "send order to Tokyo" and waits to see if a
trade occurs, there's a physical limit to how long that will take. In this situation, the focus is
on keeping variance of <em>additional processing</em> to a minimum, since speed of light is the limiting
factor.</p>
<p>So how does one go about looking for and eliminating performance variance? To tell the truth, I
don't think a systematic answer or flow-chart exists. There's no substitute for (A) building a deep
understanding of the entire technology stack, and (B) actually measuring system performance (though
(C) watching a lot of <a href=https://www.youtube.com/channel/UCMlGfpWw-RUdWX_JbLCukXg target=_blank rel="noopener noreferrer">CppCon</a> videos for
inspiration never hurt). Even then, every project cares about performance to a different degree; you
may need to build an entire
<a href="https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=3015" target=_blank rel="noopener noreferrer">replica production system</a> to
accurately benchmark at nanosecond precision, or you may be content to simply
<a href="https://www.youtube.com/watch?v=BD9cRbxWQx8&feature=youtu.be&t=1335" target=_blank rel="noopener noreferrer">avoid garbage collection</a> in
your Java code.</p>
<p>Even though everyone has different needs, there are still common things to look for when trying to
isolate and eliminate variance. In no particular order, these are my focus areas when thinking about
high-performance systems:</p>
<p><strong>Update 2019-09-21</strong>: Added notes on <code>isolcpus</code> and <code>systemd</code> affinity.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id=language-specific>Language-specific<a href=#language-specific class=hash-link aria-label="Direct link to Language-specific" title="Direct link to Language-specific"></a></h2>
<p><strong>Garbage Collection</strong>: How often does garbage collection happen? When is it triggered? What are the
impacts?</p>
<ul>
<li><a href=https://rushter.com/blog/python-garbage-collector/ target=_blank rel="noopener noreferrer">In Python</a>, individual objects are collected
if the reference count reaches 0, and each generation is collected if
<code>num_alloc - num_dealloc > gc_threshold</code> whenever an allocation happens. The GIL is acquired for
the duration of generational collection.</li>
<li>Java has
<a href=https://docs.oracle.com/en/java/javase/12/gctuning/parallel-collector1.html#GUID-DCDD6E46-0406-41D1-AB49-FB96A50EB9CE target=_blank rel="noopener noreferrer">many</a>
<a href=https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector.html#GUID-ED3AB6D3-FD9B-4447-9EDF-983ED2F7A573 target=_blank rel="noopener noreferrer">different</a>
<a href=https://docs.oracle.com/en/java/javase/12/gctuning/garbage-first-garbage-collector-tuning.html#GUID-90E30ACA-8040-432E-B3A0-1E0440AB556A target=_blank rel="noopener noreferrer">collection</a>
<a href=https://docs.oracle.com/en/java/javase/12/gctuning/z-garbage-collector1.html#GUID-A5A42691-095E-47BA-B6DC-FB4E5FAA43D0 target=_blank rel="noopener noreferrer">algorithms</a>
to choose from, each with different characteristics. The default algorithms (Parallel GC in Java
8, G1 in Java 9) freeze the JVM while collecting, while more recent algorithms
(<a href=https://wiki.openjdk.java.net/display/zgc target=_blank rel="noopener noreferrer">ZGC</a> and
<a href=https://wiki.openjdk.java.net/display/shenandoah target=_blank rel="noopener noreferrer">Shenandoah</a>) are designed to keep "stop the
world" to a minimum by doing collection work in parallel.</li>
</ul>
<p><strong>Allocation</strong>: Every language has a different way of interacting with "heap" memory, but the
principle is the same: running the allocator to allocate/deallocate memory takes time that can often
be put to better use. Understanding when your language interacts with the allocator is crucial, and
not always obvious. For example: C++ and Rust don't allocate heap memory for iterators, but Java
does (meaning potential GC pauses). Take time to understand heap behavior (I made a
<a href=/2019/02/understanding-allocations-in-rust/>a guide for Rust</a>), and look into alternative
allocators (<a href=http://jemalloc.net/ target=_blank rel="noopener noreferrer">jemalloc</a>,
<a href=https://gperftools.github.io/gperftools/tcmalloc.html target=_blank rel="noopener noreferrer">tcmalloc</a>) that might run faster than the
operating system default.</p>
<p><strong>Data Layout</strong>: How your data is arranged in memory matters;
<a href="https://www.youtube.com/watch?v=yy8jQgmhbAU" target=_blank rel="noopener noreferrer">data-oriented design</a> and
<a href="https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be&t=1185" target=_blank rel="noopener noreferrer">cache locality</a> can have huge
impacts on performance. The C family of languages (C, value types in C#, C++) and Rust all have
guarantees about the shape every object takes in memory that others (e.g. Java and Python) can't
make. <a href=http://valgrind.org/docs/manual/cg-manual.html target=_blank rel="noopener noreferrer">Cachegrind</a> and kernel
<a href=https://perf.wiki.kernel.org/index.php/Main_Page target=_blank rel="noopener noreferrer">perf</a> counters are both great for understanding
how performance relates to memory layout.</p>
<p><strong>Just-In-Time Compilation</strong>: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are
great because they optimize your program for how it's actually being used, rather than how a
compiler expects it to be used. However, there's a variance problem if the program stops executing
while waiting for translation from VM bytecode to native code. As a remedy, many languages support
ahead-of-time compilation in addition to the JIT versions
(<a href=https://github.com/dotnet/corert target=_blank rel="noopener noreferrer">CoreRT</a> in C# and <a href=https://www.graalvm.org/ target=_blank rel="noopener noreferrer">GraalVM</a> in Java).
On the other hand, LLVM supports
<a href=https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization target=_blank rel="noopener noreferrer">Profile Guided Optimization</a>,
which theoretically brings JIT benefits to non-JIT languages. Finally, be careful to avoid comparing
apples and oranges during benchmarks; you don't want your code to suddenly speed up because the JIT
compiler kicked in.</p>
<p><strong>Programming Tricks</strong>: These won't make or break performance, but can be useful in specific
circumstances. For example, C++ can use
<a href="https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=1206" target=_blank rel="noopener noreferrer">templates instead of branches</a>
in critical sections.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id=kernel>Kernel<a href=#kernel class=hash-link aria-label="Direct link to Kernel" title="Direct link to Kernel"></a></h2>
<p>Code you wrote is almost certainly not the <em>only</em> code running on your hardware. There are many ways
the operating system interacts with your program, from interrupts to system calls, that are
important to watch for. These are written from a Linux perspective, but Windows does typically have
equivalent functionality.</p>
<p><strong>Scheduling</strong>: The kernel is normally free to schedule any process on any core, so it's important
to reserve CPU cores exclusively for the important programs. There are a few parts to this: first,
limit the CPU cores that non-critical processes are allowed to run on by excluding cores from
scheduling
(<a href=https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html target=_blank rel="noopener noreferrer"><code>isolcpus</code></a>
kernel command-line option), or by setting the <code>init</code> process CPU affinity
(<a href=https://access.redhat.com/solutions/2884991 target=_blank rel="noopener noreferrer"><code>systemd</code> example</a>). Second, set critical processes
to run on the isolated cores by setting the
<a href=https://en.wikipedia.org/wiki/Processor_affinity target=_blank rel="noopener noreferrer">processor affinity</a> using
<a href=https://linux.die.net/man/1/taskset target=_blank rel="noopener noreferrer">taskset</a>. Finally, use
<a href=https://github.com/torvalds/linux/blob/master/Documentation/timers/NO_HZ.txt target=_blank rel="noopener noreferrer"><code>NO_HZ</code></a> or
<a href=https://linux.die.net/man/1/chrt target=_blank rel="noopener noreferrer"><code>chrt</code></a> to disable scheduling interrupts. Turning off
hyper-threading is also likely beneficial.</p>
<p><strong>System calls</strong>: Reading from a UNIX socket? Writing to a file? In addition to not knowing how long
the I/O operation takes, these all trigger expensive
<a href=https://en.wikipedia.org/wiki/System_call target=_blank rel="noopener noreferrer">system calls (syscalls)</a>. To handle these, the CPU must
<a href=https://en.wikipedia.org/wiki/Context_switch target=_blank rel="noopener noreferrer">context switch</a> to the kernel, let the kernel
operation complete, then context switch back to your program. We'd rather keep these
<a href=https://www.destroyallsoftware.com/talks/the-birth-and-death-of-javascript target=_blank rel="noopener noreferrer">to a minimum</a> (see
timestamp 18:20). <a href=https://linux.die.net/man/1/strace target=_blank rel="noopener noreferrer">Strace</a> is your friend for understanding when
and where syscalls happen.</p>
<p><strong>Signal Handling</strong>: Far less likely to be an issue, but signals do trigger a context switch if your
code has a handler registered. This will be highly dependent on the application, but you can
<a href="https://www.linuxprogrammingblog.com/all-about-linux-signals?page=show#Blocking_signals" target=_blank rel="noopener noreferrer">block signals</a>
if it's an issue.</p>
<p><strong>Interrupts</strong>: System interrupts are how devices connected to your computer notify the CPU that
something has happened. The CPU will then choose a processor core to pause and context switch to the
OS to handle the interrupt. Make sure that
<a href=http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux target=_blank rel="noopener noreferrer">SMP affinity</a> is
set so that interrupts are handled on a CPU core not running the program you care about.</p>
<p><strong><a href=https://www.kernel.org/doc/html/latest/vm/numa.html target=_blank rel="noopener noreferrer">NUMA</a></strong>: While NUMA is good at making
multi-cell systems transparent, there are variance implications; if the kernel moves a process
across nodes, future memory accesses must wait for the controller on the original node. Use
<a href=https://linux.die.net/man/8/numactl target=_blank rel="noopener noreferrer">numactl</a> to handle memory-/cpu-cell pinning so this doesn't
happen.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id=hardware>Hardware<a href=#hardware class=hash-link aria-label="Direct link to Hardware" title="Direct link to Hardware"></a></h2>
<p><strong>CPU Pipelining/Speculation</strong>: Speculative execution in modern processors gave us vulnerabilities
like Spectre, but it also gave us performance improvements like
<a href=https://stackoverflow.com/a/11227902/1454178 target=_blank rel="noopener noreferrer">branch prediction</a>. And if the CPU mis-speculates
your code, there's variance associated with rewind and replay. While the compiler knows a lot about
how your CPU <a href="https://youtu.be/nAbCKa0FzjQ?t=4467" target=_blank rel="noopener noreferrer">pipelines instructions</a>, code can be
<a href="https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=755" target=_blank rel="noopener noreferrer">structured to help</a> the branch
predictor.</p>
<p><strong>Paging</strong>: For most systems, virtual memory is incredible. Applications live in their own worlds,
and the CPU/<a href=https://en.wikipedia.org/wiki/Memory_management_unit target=_blank rel="noopener noreferrer">MMU</a> figures out the details.
However, there's a variance penalty associated with memory paging and caching; if you access more
memory pages than the <a href=https://en.wikipedia.org/wiki/Translation_lookaside_buffer target=_blank rel="noopener noreferrer">TLB</a> can store,
you'll have to wait for the page walk. Kernel perf tools are necessary to figure out if this is an
issue, but using <a href=https://blog.pythian.com/performance-tuning-hugepages-in-linux/ target=_blank rel="noopener noreferrer">huge pages</a> can
reduce TLB burdens. Alternately, running applications in a hypervisor like
<a href=https://github.com/siemens/jailhouse target=_blank rel="noopener noreferrer">Jailhouse</a> allows one to skip virtual memory entirely, but
this is probably more work than the benefits are worth.</p>
<p><strong>Network Interfaces</strong>: When more than one computer is involved, variance can go up dramatically.
Tuning kernel
<a href=https://github.com/leandromoreira/linux-network-performance-parameters target=_blank rel="noopener noreferrer">network parameters</a> may be
helpful, but modern systems more frequently opt to skip the kernel altogether with a technique
called <a href=https://blog.cloudflare.com/kernel-bypass/ target=_blank rel="noopener noreferrer">kernel bypass</a>. This typically requires
specialized hardware and <a href=https://www.openonload.org/ target=_blank rel="noopener noreferrer">drivers</a>, but even industries like
<a href=https://www.bbc.co.uk/rd/blog/2018-04-high-speed-networking-open-source-kernel-bypass target=_blank rel="noopener noreferrer">telecom</a> are
finding the benefits.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id=networks>Networks<a href=#networks class=hash-link aria-label="Direct link to Networks" title="Direct link to Networks"></a></h2>
<p><strong>Routing</strong>: There's a reason financial firms are willing to pay
<a href=https://sniperinmahwah.wordpress.com/2019/03/26/4-les-moeres-english-version/ target=_blank rel="noopener noreferrer">millions of euros</a>
for rights to a small plot of land - having a straight-line connection from point A to point B means
the path their data takes is the shortest possible. In contrast, there are currently 6 computers in
between me and Google, but that may change at any moment if my ISP realizes a
<a href=https://en.wikipedia.org/wiki/Border_Gateway_Protocol target=_blank rel="noopener noreferrer">more efficient route</a> is available. Whether
it's using
<a href=https://sniperinmahwah.wordpress.com/2018/05/07/shortwave-trading-part-i-the-west-chicago-tower-mystery/ target=_blank rel="noopener noreferrer">research-quality equipment</a>
for shortwave radio, or just making sure there's no data inadvertently going between data centers,
routing matters.</p>
<p><strong>Protocol</strong>: TCP as a network protocol is awesome: guaranteed and in-order delivery, flow control,
and congestion control all built in. But these attributes make the most sense when networking
infrastructure is lossy; for systems that expect nearly all packets to be delivered correctly, the
setup handshaking and packet acknowledgment are just overhead. Using UDP (unicast or multicast) may
make sense in these contexts as it avoids the chatter needed to track connection state, and
<a href=https://iextrading.com/docs/IEX%20Transport%20Specification.pdf target=_blank rel="noopener noreferrer">gap-fill</a>
<a href=http://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf target=_blank rel="noopener noreferrer">strategies</a>
can handle the rest.</p>
<p><strong>Switching</strong>: Many routers/switches handle packets using "store-and-forward" behavior: wait for the
whole packet, validate checksums, and then send to the next device. In variance terms, the time
needed to move data between two nodes is proportional to the size of that data; the switch must
"store" all data before it can calculate checksums and "forward" to the next node. With
<a href=https://www.networkworld.com/article/2241573/latency-and-jitter--cut-through-design-pays-off-for-arista--blade.html target=_blank rel="noopener noreferrer">"cut-through"</a>
designs, switches will begin forwarding data as soon as they know where the destination is,
checksums be damned. This means there's a fixed cost (at the switch) for network traffic, no matter
the size.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id=final-thoughts>Final Thoughts<a href=#final-thoughts class=hash-link aria-label="Direct link to Final Thoughts" title="Direct link to Final Thoughts"></a></h2>
<p>High-performance systems, regardless of industry, are not magical. They do require extreme precision
and attention to detail, but they're designed, built, and operated by regular people, using a lot of
tools that are publicly available. Interested in seeing how context switching affects performance of
your benchmarks? <code>taskset</code> should be installed in all modern Linux distributions, and can be used to
make sure the OS never migrates your process. Curious how often garbage collection triggers during a
crucial operation? Your language of choice will typically expose details of its operations
(<a href=https://docs.python.org/3/library/gc.html target=_blank rel="noopener noreferrer">Python</a>,
<a href=https://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html#DebuggingOptions target=_blank rel="noopener noreferrer">Java</a>).
Want to know how hard your program is stressing the TLB? Use <code>perf record</code> and look for
<code>dtlb_load_misses.miss_causes_a_walk</code>.</p>
<p>Two final guiding questions, then: first, before attempting to apply some of the technology above to
your own systems, can you first identify
<a href=http://wiki.c2.com/?PrematureOptimization target=_blank rel="noopener noreferrer">where/when you care</a> about "high-performance"? As an
example, if parts of a system rely on humans pushing buttons, CPU pinning won't have any measurable
effect. Humans are already far too slow to react in time. Second, if you're using benchmarks, are
they being designed in a way that's actually helpful? Tools like
<a href=http://www.serpentine.com/criterion/ target=_blank rel="noopener noreferrer">Criterion</a> (also in
<a href=https://github.com/bheisler/criterion.rs target=_blank rel="noopener noreferrer">Rust</a>) and Google's
<a href=https://github.com/google/benchmark target=_blank rel="noopener noreferrer">Benchmark</a> output not only average run time, but variance as
well; your benchmarking environment is subject to the same concerns your production environment is.</p>
<p>Finally, I believe high-performance systems are a matter of philosophy, not necessarily technique.
Rigorous focus on variance is the first step, and there are plenty of ways to measure and mitigate
it; once that's at an acceptable level, then optimize for speed.</div></article><nav class="pagination-nav docusaurus-mt-lg" aria-label="Blog post page navigation"><a class="pagination-nav__link pagination-nav__link--prev" href=/2019/05/making-bread/><div class=pagination-nav__sublabel>Older post</div><div class=pagination-nav__label>Making bread</div></a><a class="pagination-nav__link pagination-nav__link--next" href=/2019/09/binary-format-shootout/><div class=pagination-nav__sublabel>Newer post</div><div class=pagination-nav__label>Binary format shootout</div></a></nav></main><div class="col col--2"><div class="tableOfContents_bqdL thin-scrollbar"><ul class="table-of-contents table-of-contents__left-border"><li><a href=#language-specific class="table-of-contents__link toc-highlight">Language-specific</a><li><a href=#kernel class="table-of-contents__link toc-highlight">Kernel</a><li><a href=#hardware class="table-of-contents__link toc-highlight">Hardware</a><li><a href=#networks class="table-of-contents__link toc-highlight">Networks</a><li><a href=#final-thoughts class="table-of-contents__link toc-highlight">Final Thoughts</a></ul></div></div></div></div></div><footer class=footer><div class="container container-fluid"><div class="footer__bottom text--center"><div class=footer__copyright>Copyright © 2025 Bradlee Speice</div></div></div></footer></div>