<!doctype html><htmllang=endir=ltrclass="blog-wrapper blog-post-page plugin-blog plugin-id-default"data-has-hydrated=false><metacharset=UTF-8><metaname=generatorcontent="Docusaurus v3.7.0"><titledata-rh=true>On building high performance systems | The Old Speice Guy</title><metadata-rh=truename=viewportcontent="width=device-width, initial-scale=1.0"><metadata-rh=truename=twitter:cardcontent=summary_large_image><metadata-rh=trueproperty=og:urlcontent=https://speice.io/2019/06/high-performance-systems/><metadata-rh=trueproperty=og:localecontent=en><metadata-rh=truename=docusaurus_localecontent=en><metadata-rh=truename=docusaurus_tagcontent=default><metadata-rh=truename=docsearch:languagecontent=en><metadata-rh=truename=docsearch:docusaurus_tagcontent=default><metadata-rh=trueproperty=og:titlecontent="On building high performance systems | The Old Speice Guy"><metadata-rh=truename=descriptioncontent="Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is"><metadata-rh=trueproperty=og:descriptioncontent="Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is"><metadata-rh=trueproperty=og:typecontent=article><metadata-rh=trueproperty=article:published_timecontent=2019-07-01T12:00:00.000Z><linkdata-rh=truerel=iconhref=/img/favicon.ico><linkdata-rh=truerel=canonicalhref=https://speice.io/2019/06/high-performance-systems/><linkdata-rh=truerel=alternatehref=https://speice.io/2019/06/high-performance-systems/hreflang=en><linkdata-rh=truerel=alternatehref=https://speice.io/2019/06/high-performance-systems/hreflang=x-default><scriptdata-rh=truetype=application/ld+json>{"@context":"https://schema.org","@id":"https://speice.io/2019/06/high-performance-systems","@type":"BlogPosting","author":{"@type":"Person","name":"Bradlee Speice"},"dateModified":"2024-11-10T21:43:14.000Z","datePublished":"2019-07-01T12:00:00.000Z","description":"Prior to working in the trading industry, my assumption was that High Frequency Trading (HFT) is","headline":"On building high performance systems","isPartOf":{"@id":"https://speice.io/","@type":"Blog","name":"Blog"},"keywords":[],"mainEntityOfPage":"https://speice.io/2019/06/high-performance-systems","name":"On building high performance systems","url":"https://speice.io/2019/06/high-performance-systems"}</script><linkrel=alternatetype=application/rss+xmlhref=/rss.xmltitle="The Old Speice Guy RSS Feed"><linkrel=alternatetype=application/atom+xmlhref=/atom.xmltitle="The Old Speice Guy Atom Feed"><linkrel=stylesheethref=/katex/katex.min.csstype=text/css><linkrel=stylesheethref=/assets/css/styles.24ac2c37.css><scriptsrc=/assets/js/runtime~main.75ada3c5.jsdefer></script><scriptsrc=/assets/js/main.d0bb06d2.jsdefer></script><bodyclass=navigation-with-keyboard><script>!function(){vart,e=function(){try{returnnewURLSearchParams(window.location.search).get("docusaurus-theme")}catch(t){}}()||function(){try{returnwindow.localStorage.getItem("theme")}catch(t){}}();t=null!==e?e:"light",document.documentElement.setAttribute("data-theme",t)}(),function(){try{for(var[t,e]ofnewURLSearchParams(window.location.search).entries())if(t.startsWith("docusaurus-data-")){vara=t.replace("docusaurus-data-","data-");document.documentElement.setAttribute(a,e)}}catch(t){}}()</script><divid=__docusaurus><divrole=regionaria-label="Skip to main content"><aclass=skipToContent_fXgnhref=#__docusaurus_skipToContent_fallback>Skip to main content</a></div><navaria-label=Mainclass="navbar navbar--fixed-top"><divclass=navbar__inner><divclass=navbar__items><buttonaria-label="Toggle navigation bar"aria-expanded=falseclass="navbar__toggle clean-btn"type=button><svgwidth=30height=30viewBox="0 0 30 30"aria-hidden=true><pathstroke=currentColorstroke-linecap=roundstroke-miterlimit=10stroke-width=2d="M4 7h22M4 15h22M4 23h22"/></svg></button><aclass=navbar__brandhref=/><divclass=navbar__logo><imgsrc=/img/logo.svgalt="Sierpinski Gasket"class="themedComponent_mlkZ
made up of people who have access to secret techniques mortal developers could only dream of. There
had to be some secret art that could only be learned if one had an appropriately tragic backstory.</p>
<p><imgdecoding=asyncloading=lazyalt="Kung Fu fight"src=/assets/images/kung-fu-5715f30eef7bf3aaa26770b1247024dc.webpwidth=426height=240class=img_ev3q></p>
<blockquote>
<p>How I assumed HFT people learn their secret techniques</p>
</blockquote>
<p>How else do you explain people working on systems that complete the round trip of market data in to
orders out (a.k.a. tick-to-trade) consistently within
<ahref=https://stackoverflow.com/a/22082528/1454178target=_blankrel="noopener noreferrer">750-800 nanoseconds</a>? In roughly the time it takes a
<p>The high stakes world of electronic trading, investment banks, market makers, hedge funds and
exchanges demand the <strong>lowest possible latency and jitter</strong> while utilizing the highest
bandwidth and return on their investment.</p>
</blockquote>
</li>
</ul>
<p>And to further clarify: we're not discussing <em>total run-time</em>, but variance of total run-time. There
are situations where it's not reasonably possible to make things faster, and you'd much rather be
consistent. For example, trading firms use
<ahref=https://sniperinmahwah.wordpress.com/2017/06/07/network-effects-part-i/target=_blankrel="noopener noreferrer">wireless networks</a> because
the speed of light through air is faster than through fiber-optic cables. There's still at <em>absolute
minimum</em> a <ahref=http://tinyurl.com/y2vd7tn8target=_blankrel="noopener noreferrer">~33.76 millisecond</a> delay required to send data between,
say,
<ahref=https://www.theice.com/market-data/connectivity-and-feeds/wireless/tokyo-chicagotarget=_blankrel="noopener noreferrer">Chicago and Tokyo</a>.
If a trading system in Chicago calls the function for "send order to Tokyo" and waits to see if a
trade occurs, there's a physical limit to how long that will take. In this situation, the focus is
on keeping variance of <em>additional processing</em> to a minimum, since speed of light is the limiting
factor.</p>
<p>So how does one go about looking for and eliminating performance variance? To tell the truth, I
don't think a systematic answer or flow-chart exists. There's no substitute for (A) building a deep
understanding of the entire technology stack, and (B) actually measuring system performance (though
(C) watching a lot of <ahref=https://www.youtube.com/channel/UCMlGfpWw-RUdWX_JbLCukXgtarget=_blankrel="noopener noreferrer">CppCon</a> videos for
inspiration never hurt). Even then, every project cares about performance to a different degree; you
may need to build an entire
<ahref="https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=3015"target=_blankrel="noopener noreferrer">replica production system</a> to
accurately benchmark at nanosecond precision, or you may be content to simply
<ahref="https://www.youtube.com/watch?v=BD9cRbxWQx8&feature=youtu.be&t=1335"target=_blankrel="noopener noreferrer">avoid garbage collection</a> in
your Java code.</p>
<p>Even though everyone has different needs, there are still common things to look for when trying to
isolate and eliminate variance. In no particular order, these are my focus areas when thinking about
high-performance systems:</p>
<p><strong>Update 2019-09-21</strong>: Added notes on <code>isolcpus</code> and <code>systemd</code> affinity.</p>
<h2class="anchor anchorWithStickyNavbar_LWe7"id=language-specific>Language-specific<ahref=#language-specificclass=hash-linkaria-label="Direct link to Language-specific"title="Direct link to Language-specific"></a></h2>
<p><strong>Garbage Collection</strong>: How often does garbage collection happen? When is it triggered? What are the
impacts?</p>
<ul>
<li><ahref=https://rushter.com/blog/python-garbage-collector/target=_blankrel="noopener noreferrer">In Python</a>, individual objects are collected
if the reference count reaches 0, and each generation is collected if
<code>num_alloc - num_dealloc > gc_threshold</code> whenever an allocation happens. The GIL is acquired for
<ahref=https://gperftools.github.io/gperftools/tcmalloc.htmltarget=_blankrel="noopener noreferrer">tcmalloc</a>) that might run faster than the
operating system default.</p>
<p><strong>Data Layout</strong>: How your data is arranged in memory matters;
<ahref="https://www.youtube.com/watch?v=yy8jQgmhbAU"target=_blankrel="noopener noreferrer">data-oriented design</a> and
<ahref="https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be&t=1185"target=_blankrel="noopener noreferrer">cache locality</a> can have huge
impacts on performance. The C family of languages (C, value types in C#, C++) and Rust all have
guarantees about the shape every object takes in memory that others (e.g. Java and Python) can't
make. <ahref=http://valgrind.org/docs/manual/cg-manual.htmltarget=_blankrel="noopener noreferrer">Cachegrind</a> and kernel
<ahref=https://perf.wiki.kernel.org/index.php/Main_Pagetarget=_blankrel="noopener noreferrer">perf</a> counters are both great for understanding
how performance relates to memory layout.</p>
<p><strong>Just-In-Time Compilation</strong>: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are
great because they optimize your program for how it's actually being used, rather than how a
compiler expects it to be used. However, there's a variance problem if the program stops executing
while waiting for translation from VM bytecode to native code. As a remedy, many languages support
ahead-of-time compilation in addition to the JIT versions
(<ahref=https://github.com/dotnet/corerttarget=_blankrel="noopener noreferrer">CoreRT</a> in C# and <ahref=https://www.graalvm.org/target=_blankrel="noopener noreferrer">GraalVM</a> in Java).
which theoretically brings JIT benefits to non-JIT languages. Finally, be careful to avoid comparing
apples and oranges during benchmarks; you don't want your code to suddenly speed up because the JIT
compiler kicked in.</p>
<p><strong>Programming Tricks</strong>: These won't make or break performance, but can be useful in specific
circumstances. For example, C++ can use
<ahref="https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=1206"target=_blankrel="noopener noreferrer">templates instead of branches</a>
in critical sections.</p>
<h2class="anchor anchorWithStickyNavbar_LWe7"id=kernel>Kernel<ahref=#kernelclass=hash-linkaria-label="Direct link to Kernel"title="Direct link to Kernel"></a></h2>
<p>Code you wrote is almost certainly not the <em>only</em> code running on your hardware. There are many ways
the operating system interacts with your program, from interrupts to system calls, that are
important to watch for. These are written from a Linux perspective, but Windows does typically have
equivalent functionality.</p>
<p><strong>Scheduling</strong>: The kernel is normally free to schedule any process on any core, so it's important
to reserve CPU cores exclusively for the important programs. There are a few parts to this: first,
limit the CPU cores that non-critical processes are allowed to run on by excluding cores from
<p><strong>Interrupts</strong>: System interrupts are how devices connected to your computer notify the CPU that
something has happened. The CPU will then choose a processor core to pause and context switch to the
OS to handle the interrupt. Make sure that
<ahref=http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linuxtarget=_blankrel="noopener noreferrer">SMP affinity</a> is
set so that interrupts are handled on a CPU core not running the program you care about.</p>
<p><strong><ahref=https://www.kernel.org/doc/html/latest/vm/numa.htmltarget=_blankrel="noopener noreferrer">NUMA</a></strong>: While NUMA is good at making
multi-cell systems transparent, there are variance implications; if the kernel moves a process
across nodes, future memory accesses must wait for the controller on the original node. Use
<ahref=https://linux.die.net/man/8/numactltarget=_blankrel="noopener noreferrer">numactl</a> to handle memory-/cpu-cell pinning so this doesn't
happen.</p>
<h2class="anchor anchorWithStickyNavbar_LWe7"id=hardware>Hardware<ahref=#hardwareclass=hash-linkaria-label="Direct link to Hardware"title="Direct link to Hardware"></a></h2>
<p><strong>CPU Pipelining/Speculation</strong>: Speculative execution in modern processors gave us vulnerabilities
like Spectre, but it also gave us performance improvements like
<ahref=https://stackoverflow.com/a/11227902/1454178target=_blankrel="noopener noreferrer">branch prediction</a>. And if the CPU mis-speculates
your code, there's variance associated with rewind and replay. While the compiler knows a lot about
how your CPU <ahref="https://youtu.be/nAbCKa0FzjQ?t=4467"target=_blankrel="noopener noreferrer">pipelines instructions</a>, code can be
<ahref="https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be&t=755"target=_blankrel="noopener noreferrer">structured to help</a> the branch
predictor.</p>
<p><strong>Paging</strong>: For most systems, virtual memory is incredible. Applications live in their own worlds,
and the CPU/<ahref=https://en.wikipedia.org/wiki/Memory_management_unittarget=_blankrel="noopener noreferrer">MMU</a> figures out the details.
However, there's a variance penalty associated with memory paging and caching; if you access more
memory pages than the <ahref=https://en.wikipedia.org/wiki/Translation_lookaside_buffertarget=_blankrel="noopener noreferrer">TLB</a> can store,
you'll have to wait for the page walk. Kernel perf tools are necessary to figure out if this is an
issue, but using <ahref=https://blog.pythian.com/performance-tuning-hugepages-in-linux/target=_blankrel="noopener noreferrer">huge pages</a> can
reduce TLB burdens. Alternately, running applications in a hypervisor like
<ahref=https://github.com/siemens/jailhousetarget=_blankrel="noopener noreferrer">Jailhouse</a> allows one to skip virtual memory entirely, but
this is probably more work than the benefits are worth.</p>
<p><strong>Network Interfaces</strong>: When more than one computer is involved, variance can go up dramatically.
Tuning kernel
<ahref=https://github.com/leandromoreira/linux-network-performance-parameterstarget=_blankrel="noopener noreferrer">network parameters</a> may be
helpful, but modern systems more frequently opt to skip the kernel altogether with a technique
called <ahref=https://blog.cloudflare.com/kernel-bypass/target=_blankrel="noopener noreferrer">kernel bypass</a>. This typically requires
specialized hardware and <ahref=https://www.openonload.org/target=_blankrel="noopener noreferrer">drivers</a>, but even industries like
<ahref=https://www.bbc.co.uk/rd/blog/2018-04-high-speed-networking-open-source-kernel-bypasstarget=_blankrel="noopener noreferrer">telecom</a> are
finding the benefits.</p>
<h2class="anchor anchorWithStickyNavbar_LWe7"id=networks>Networks<ahref=#networksclass=hash-linkaria-label="Direct link to Networks"title="Direct link to Networks"></a></h2>
<p><strong>Routing</strong>: There's a reason financial firms are willing to pay
<ahref=https://sniperinmahwah.wordpress.com/2019/03/26/4-les-moeres-english-version/target=_blankrel="noopener noreferrer">millions of euros</a>
for rights to a small plot of land - having a straight-line connection from point A to point B means
the path their data takes is the shortest possible. In contrast, there are currently 6 computers in
between me and Google, but that may change at any moment if my ISP realizes a
<ahref=https://en.wikipedia.org/wiki/Border_Gateway_Protocoltarget=_blankrel="noopener noreferrer">more efficient route</a> is available. Whether
designs, switches will begin forwarding data as soon as they know where the destination is,
checksums be damned. This means there's a fixed cost (at the switch) for network traffic, no matter
the size.</p>
<h2class="anchor anchorWithStickyNavbar_LWe7"id=final-thoughts>Final Thoughts<ahref=#final-thoughtsclass=hash-linkaria-label="Direct link to Final Thoughts"title="Direct link to Final Thoughts"></a></h2>
<p>High-performance systems, regardless of industry, are not magical. They do require extreme precision
and attention to detail, but they're designed, built, and operated by regular people, using a lot of
tools that are publicly available. Interested in seeing how context switching affects performance of
your benchmarks? <code>taskset</code> should be installed in all modern Linux distributions, and can be used to
make sure the OS never migrates your process. Curious how often garbage collection triggers during a
crucial operation? Your language of choice will typically expose details of its operations