Prologue: Reading the Data

Our benchmark will be a simple market data processor; given messages from IEX, serialize each message into the schema format, then read back each message to do some basic aggregation.

But before we make it to that point, we have to read in the market data. To do so, I'm using a library called nom. Version 5.0 was recently released and brought some big changes, so this was an opportunity to build a non-trivial program and see how it fared.

If you're not familiar with nom, the idea is to build a binary data parser by combining different mini-parsers. For example, if your data looks like this:

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +---------------------------------------------------------------+
 0 |                    Block Type = 0x00000006                    |
   +---------------------------------------------------------------+
 4 |                      Block Total Length                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 8 |                         Interface ID                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
12 |                        Timestamp (High)                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
16 |                        Timestamp (Low)                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
20 |                         Captured Len                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
24 |                          Packet Len                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          Packet Data                          |
   |                              ...                              |

...you can build a parser in nom like this:

const ENHANCED_PACKET: [u8; 4] = [0x06, 0x00, 0x00, 0x00];
pub fn enhanced_packet_block(input: &[u8]) -> IResult<&[u8], &[u8]> {
    let (
        remaining,
        (
            block_type,
            block_len,
            interface_id,
            timestamp_high,
            timestamp_low,
            captured_len,
            packet_len,
        ),
    ) = tuple((
        tag(ENHANCED_PACKET),
        le_u32,
        le_u32,
        le_u32,
        le_u32,
        le_u32,
        le_u32,
    ))(input)?;

    let (remaining, packet_data) = take(captured_len)(remaining)?;
    Ok((remaining, packet_data))
}

This demonstration isn't too interesting, but when more complex formats need to be parsed (like IEX market data), nom really shines.

Ultimately, because nom was used to parse the IEX-format market data before serialization, we're not too interested in its performance. However, it's worth mentioning how much easier this project was because I didn't have to write all the boring code by hand.

Part 1: Cap'n Proto

Now it's time to get into the meaty part of the story. Cap'n Proto was the first format I tried because of how long it has supported Rust (thanks to David Renshaw for maintaining the Rust port since 2014!). However, I had a ton of performance concerns actually using of Cap'n Proto.

To serialize new messages, Cap'n Proto uses a "builder" object. This builder allocates memory on the heap to hold the message content, but because builders can't be re-used, we have to allocate a new buffer for every single message. I was able to work around this and re-use memory with a special builder, but it required reading through Cap'n Proto's benchmarks to find an example and using transmute to bypass Rust's borrow checker.

Reading messages is better, but still had issues. Cap'n Proto has two message encodings: a "packed" version, and an "unpacked" version. When reading "packed" messages, we need a buffer to unpack the message into before we can use it; Cap'n Proto allocates a new buffer to unpack the message every time, and I wasn't able to figure out a way around that. In contrast, the unpacked message format should be where Cap'n Proto shines; its main selling point is that there's no decoding step. However, accomplishing zero-copy deserialization required copying code from the private API (since fixed), and we still allocate a vector on every read for the segment table (not fixed at time of writing).

In the end, I put in significant work to make Cap'n Proto as fast as possible in the tests, but there were too many issues for me to feel comfortable using it long-term.

Part 2: Flatbuffers

This is the new kid on the block. After a first attempt didn't work out, official support was recently added. Flatbuffers is intended to address the same problems as Cap'n Proto; have a binary schema to describe the format that can be used from many languages. The difference is that Flatbuffers claims to have a simpler wire format and more flexibility.

On the whole, I enjoyed using Flatbuffers; the tooling is nice enough, and unlike Cap'n Proto, parsing messages was actually zero-copy and zero-allocation. There were some issues though.

First, Flatbuffers (at least in Rust) can't handle nested vectors. This is a problem for formats like the following:

table Message {
  symbol: string;
}
table MultiMessage {
  messages:[Message];
}

We want to create a MultiMessage that contains a vector of Message, but each Message has a vector (the string type). I was able to work around this by caching Message elements in a SmallVec before building the final MultiMessage, but it was a painful process.

Second, streaming support in Flatbuffers seems to be something of an afterthought. Where Cap'n Proto in Rust handles reading messages from a stream as part of the API, Flatbuffers just puts a u32 at the front of each message to indicate the size. Not specifically a problem, but I would've rather seen message size integrated into the underlying format.

Ultimately, I enjoyed using Flatbuffers, and had to do significantly less work to make it perform well.

Part 3: Simple Binary Encoding

Support for SBE was added by the author of one of my favorite Rust blog posts. I've [talked previously]({% post_url 2019-06-31-high-performance-systems %}) about how important variance is in high-performance systems, so it was encouraging to read about a format that directly addressed my concerns. SBE has by far the simplest binary format, but it does make some tradeoffs.

Both Cap'n Proto and Flatbuffers use pointers in their messages to handle variable-length data, unions, and a couple other features. In contrast, messages in SBE are essentially primitive structs; variable-length data is supported, but there's no union type.

As mentioned in the beginning, the Rust port of SBE is certainly usable, but is essentially unmaintained. However, if you don't need union types, and can accept that schemas are XML documents, it's still worth using.

Results

After building a test harness for each protocol, it was time to actually take them for a spin. I used this script to manage the test process, and the raw results are here. All data reported below is the average of 10 runs over a single day of IEX data. Data checks were implemented to make sure that each format achieved the same results.

Serialization

Serialization measures on a per-message basis how long it takes to convert the pre-parsed IEX message into the desired format and write to a pre-allocated buffer.

Schema	Median	99th Pctl	99.9th Pctl	Total
Cap'n Proto Packed	413ns	1751ns	2943ns	14.80s
Cap'n Proto Unpacked	273ns	1828ns	2836ns	10.65s
Flatbuffers	355ns	2185ns	3497ns	14.31s
SBE	91ns	1535ns	2423ns	3.91s

Deserialization

Deserialization measures on a per-message basis how long it takes to read the message encoded during deserialization and perform some basic aggregation. The aggregation code is the same for each format, so any performance differences are due solely to the format implementation.

Schema	Median	99th Pctl	99.9th Pctl	Total
Cap'n Proto Packed	539ns	1216ns	2599ns	18.92s
Cap'n Proto Unpacked	366ns	737ns	1583ns	12.32s
Flatbuffers	173ns	421ns	1007ns	6.00s
SBE	116ns	286ns	659ns	4.05s

Conclusion

Building a benchmark turned out to be incredibly helpful in making a decision; because a "union" type isn't important to me, I'll be using SBE for my personal projects.

And while SBE was the fastest in terms of both median and worst-case performance, its worst case performance was proportionately far higher than any other format. Further research is necessary to figure out why this is the case. But that's for another time.

14 KiB Raw Blame History