First editing pass

2025-12-18 19:18:23 -05:00 · 2019-09-28 12:55:46 -04:00
parent 84fa2f5fa0
commit ec490bfc99
1 changed files with 44 additions and 42 deletions
--- a/_posts/2019-09-01-binary-format-shootout.md
+++ b/_posts/2019-09-01-binary-format-shootout.md
@ -7,18 +7,16 @@ tags: [rust]
 ---
 I've found that in many personal projects, [analysis paralysis](https://en.wikipedia.org/wiki/Analysis_paralysis)
-is particularly deadly. There's nothing like having other options available to make you question your decisions.
+is particularly deadly. Making good decisions at the start avoids pain and suffering down the line;
-There's a particular scenario that scares me: I'm a couple months into a project, only to realize that if I had
+if doing extra research avoids problems in the future, I'm happy to continue researching indefinitely.
 made a different choice at an earlier juncture, weeks of work could have been saved. If only an extra hour or
 two of research had been conducted, everything would've turned out differently.
-Let's say you're in need of a binary serialization schema for a project you're working on. Data will be going
+So let's say you're in need of a binary serialization schema for a project you're working on. Data will be going
-over the network, not just in memory, so having a schema document is a must. Performance is important;
+over the network, not just in memory, so having a schema document and code generation is a must. Performance is important;
 there's no reason to use Protocol Buffers when other projects support similar features at faster speed.
-And it must be polyglot; Rust support needs to be there, but we can't predict what other languages this will
+And it must be polyglot; Rust support is a minimum, but we can't predict what other languages this will
 interact with.
-Given these requirements, the formats I could find were:
+Given these requirements, the candidates I could find were:
 1. [Cap'n Proto](https://capnproto.org/) has been around the longest, and integrates well with all the build tools
 2. [Flatbuffers](https://google.github.io/flatbuffers/) is the newest, and claims to have a simpler encoding
@ -26,31 +24,34 @@ Given these requirements, the formats I could find were:
   but the Rust implementation is essentially unmaintained
 Any one of these will satisfy the project requirements: easy to transmit over a network, reasonably fast,
-and support multiple languages. But actually picking one to build a system on is intimidating; it's impossible
+and support multiple languages. But how do you actually pick one? It's impossible to know what issues that
-to know what issues that choice will lead to.
+choice will lead to, so you avoid commitment until the last possible moment.
-Still, a choice must be made. It's not particularly groundbreaking, but I decided to build a test system to help
+Still, a choice must be made. Instead of worrying about which is "the best," I decided to build a small 
-understand how they all behave. All code can be found in the [repository](https://github.com/bspeice/speice.io-md_shootout).
+proof-of-concept system in each format and pit them against each other. All code can be found in the
 [repository](https://github.com/bspeice/speice.io-md_shootout) for this project.
-We'll discuss more in detail, but the TLDR:
+We'll discuss more in detail, but a quick preview of the results:
 - Cap'n Proto can theoretically perform incredibly well, but the implementation had performance issues
 - Flatbuffers had poor serialization performance, but more than made up for it during deserialiation
 - SBE has the best median and worst-case performance, but the message structure doesn't support some
-  features that both Cap'n Proto and Flatbuffers have
+  features that both Cap'n Proto and Flatbuffers do
 # Prologue: Reading the Data
-Our benchmark will be a simple market data processor; given messages from [IEX](https://iextrading.com/trading/market-data/#deep),
+Our benchmark system will be a simple market data processor; given messages from
-serialize each message into the schema format, then read back each message to do some basic aggregation.
+[IEX](https://iextrading.com/trading/market-data/#deep), serialize each message into the schema format,
 then read back the message to do some basic aggregation. This test isn't complex, but it is representative
 of the project I need a binary format for.
-But before we make it to that point, we have to read in the market data. To do so, I'm using a library
+But before we make it to that point, we have to actually read in the market data. To do so, I'm using a library
 called [`nom`](https://github.com/Geal/nom). Version 5.0 was recently released and brought some big changes,
-so this was an opportunity to build a non-trivial program and see how it fared.
+so this was an opportunity to build a non-trivial program and get familiar again.
-If you're not familiar with `nom`, the idea is to build a binary data parser by combining different
+If you don't already know about `nom`, it's a kind of "parser generator". By combining different
-mini-parsers. For example, if your data looks like
+mini-parsers, you can parse more complex structures without writing all tedious code by hand.
-[this](https://www.winpcap.org/ntar/draft/PCAP-DumpFileFormat.html#rfc.section.3.3):
+For example, when parsing [PCAP files](https://www.winpcap.org/ntar/draft/PCAP-DumpFileFormat.html#rfc.section.3.3):
 ```
   0                   1                   2                   3
@ -74,7 +75,7 @@ mini-parsers. For example, if your data looks like
   |                              ...                              |
 ```
-...you can build a parser in `nom` like
+...you can build a parser in `nom` that looks like
 [this](https://github.com/bspeice/speice.io-md_shootout/blob/369613843d39cfdc728e1003123bf87f79422497/src/parsers.rs#L59-L93):
 ```rust
@ -106,12 +107,11 @@ pub fn enhanced_packet_block(input: &[u8]) -> IResult<&[u8], &[u8]> {
 }
 ```
-This demonstration isn't too interesting, but when more complex formats need to be parsed (like IEX market data),
+This example isn't too interesting, but when more complex formats need to be parsed (like IEX market data),
 [`nom` really shines](https://github.com/bspeice/speice.io-md_shootout/blob/369613843d39cfdc728e1003123bf87f79422497/src/iex.rs).
-Ultimately, because `nom` was used to parse the IEX-format market data before serialization, we're not too interested
+Ultimately, because the `nom` code in this shootout was used for all formats, we're not too interested in its performance.
-in its performance. However, it's worth mentioning how much easier this project was because I didn't have to write
+Still, building the market data parser was actually fun because I didn't have to write all the boring code by hand.
 all the boring code by hand.
 # Part 1: Cap'n Proto
@ -127,29 +127,29 @@ a new buffer for every single message. I was able to work around this and re-use
 but it required reading through Cap'n Proto's [benchmarks](https://github.com/capnproto/capnproto-rust/blob/master/benchmark/benchmark.rs#L124-L156)
 to find an example and using `transmute` to bypass Rust's borrow checker.
-Reading messages is better, but still had issues. Cap'n Proto has two message encodings: a ["packed"](https://capnproto.org/encoding.html#packing)
+The process of reading messages was better, but still had issues. Cap'n Proto has two message encodings: a ["packed"](https://capnproto.org/encoding.html#packing)
-version, and an "unpacked" version. When reading "packed" messages, we need a buffer to unpack the message into before we can use it;
+representation, and an "unpacked" version. When reading "packed" messages, we need a buffer to unpack the message into before we can use it;
-Cap'n Proto allocates a new buffer to unpack the message every time, and I wasn't able to figure out a way around that.
+Cap'n Proto allocates a new buffer for each message we unpack, and I wasn't able to figure out a way around that.
 In contrast, the unpacked message format should be where Cap'n Proto shines; its main selling point is that there's [no decoding step](https://capnproto.org/).
 However, accomplishing zero-copy deserialization required copying code from the private API ([since fixed](https://github.com/capnproto/capnproto-rust/issues/148)),
-and we still allocate a vector on every read for the segment table (not fixed at time of writing).
+and we still allocate a vector on every read for the segment table.
 In the end, I put in significant work to make Cap'n Proto as fast as possible in the tests, but there were too many issues
 for me to feel comfortable using it long-term.
 # Part 2: Flatbuffers
-This is the new kid on the block. After a [first attempt](https://github.com/google/flatbuffers/pull/3894) didn't work out,
+This is the new kid on the block. After a [first attempt](https://github.com/google/flatbuffers/pull/3894) didn't pan out,
 official support was [recently added](https://github.com/google/flatbuffers/pull/4898). Flatbuffers is intended to address
-the same problems as Cap'n Proto; have a binary schema to describe the format that can be used from many languages. The difference
+the same problems as Cap'n Proto: high-performance, polyglot, binary messaging. The difference is that Flatbuffers claims
-is that Flatbuffers claims to have a simpler wire format and [more flexibility](https://google.github.io/flatbuffers/flatbuffers_benchmarks.html).
+to have a simpler wire format and [more flexibility](https://google.github.io/flatbuffers/flatbuffers_benchmarks.html).
 On the whole, I enjoyed using Flatbuffers; the [tooling](https://crates.io/crates/flatc-rust) is nice enough, and unlike
 Cap'n Proto, parsing messages was actually zero-copy and zero-allocation. There were some issues though.
 First, Flatbuffers (at least in Rust) can't handle nested vectors. This is a problem for formats like the following:
-```flatbuffers
+```
 table Message {
  symbol: string;
 }
@ -164,7 +164,7 @@ in a `SmallVec` before building the final `MultiMessage`, but it was a painful p
 Second, streaming support in Flatbuffers seems to be something of an [afterthought](https://github.com/google/flatbuffers/issues/3898).
 Where Cap'n Proto in Rust handles reading messages from a stream as part of the API, Flatbuffers just puts a `u32` at the front of each
-message to indicate the size. Not specifically a problem, but I would've rather seen message size integrated into the underlying format.
+message to indicate the size. Not specifically a problem, but calculating message size without that size tag at the front is nigh on impossible.
 Ultimately, I enjoyed using Flatbuffers, and had to do significantly less work to make it perform well.
@ -178,12 +178,13 @@ high-performance systems, so it was encouraging to read about a format that
 the simplest binary format, but it does make some tradeoffs.
 Both Cap'n Proto and Flatbuffers use [pointers in their messages](https://capnproto.org/encoding.html#structs) to handle
-variable-length data, [unions](https://capnproto.org/language.html#unions), and a couple other features. In contrast,
+variable-length data, [unions](https://capnproto.org/language.html#unions), and various other features. In contrast,
-messages in SBE are essentially [primitive structs](https://github.com/real-logic/simple-binary-encoding/blob/master/sbe-samples/src/main/resources/example-schema.xml);
+messages in SBE are essentially [just structs](https://github.com/real-logic/simple-binary-encoding/blob/master/sbe-samples/src/main/resources/example-schema.xml);
 variable-length data is supported, but there's no union type.
-As mentioned in the beginning, the Rust port of SBE is certainly usable, but is essentially unmaintained. However, if you
+As mentioned in the beginning, the Rust port of SBE works well, but is essentially unmaintained. However, if you
-don't need union types, and can accept that schemas are XML documents, it's still worth using.
+don't need union types, and can accept that schemas are XML documents, it's still worth using. The Rust SBE implementation
 had the best streaming support of any format I used, and doesn't trigger allocation during de/serialization.
 # Results
@ -230,6 +231,7 @@ so any performance differences are due solely to the format implementation.
 Building a benchmark turned out to be incredibly helpful in making a decision; because a
 "union" type isn't important to me, I'll be using SBE for my personal projects.
-And while SBE was the fastest in terms of both median and worst-case performance, its worst case
+While SBE was the fastest in terms of both median and worst-case performance, its worst case
-performance was proportionately far higher than any other format. Further research is necessary
+performance was proportionately far higher than any other format. It seems to be that deserialization
-to figure out why this is the case. But that's for another time.
+time scales with message size, but I'll need to do some more research to understand what exactly
 is going on.