10 KiB
layout | title | description | category | tags | |
---|---|---|---|---|---|
post | Binary Format Shootout | Making sense of binary streams |
|
I've found that in many personal projects, analysis paralysis is particularly deadly. There's nothing like having other options available to make you question your decisions. There's a particular scenario that scares me: I'm a couple months into a project, only to realize that if I had made a different choice at an earlier juncture, weeks of work could have been saved. If only an extra hour or two of research had been conducted, everything would've turned out differently.
Let's say you're in need of a binary serialization schema for a project you're working on. Data will be going over the network, not just in memory, so having a schema document is a must. Performance is important; there's no reason to use Protocol Buffers when other projects support similar features at faster speed. And it must be polyglot; Rust support needs to be there, but we can't predict what other languages this will interact with.
Given these requirements, the formats I could find were:
- Cap'n Proto has been around the longest, and integrates well with all the build tools
- Flatbuffers is the newest, and claims to have a simpler encoding
- Simple Binary Encoding is being adopted by the High-performance financial community, but the Rust implementation is essentially unmaintained
Any one of these will satisfy the project requirements: easy to transmit over a network, reasonably fast, and support multiple languages. But actually picking one to build a system on is intimidating; it's impossible to know what issues that choice will lead to.
Still, a choice must be made. It's not particularly groundbreaking, but I decided to build a test system to help understand how they all behave.
Prologue: Reading the Data
Our benchmark will be a simple market data processor; given messages from IEX, serialize each message into the schema format, then read back each message to do some basic aggregation.
But before we make it to that point, we have to read in the market data. To do so, I'm using a library
called nom
. Version 5.0 was recently released and brought some big changes,
so this was an opportunity to build a non-trivial program and see how it fared.
If you're not familiar with nom
, the idea is to build a binary data parser by combining different
mini-parsers. For example, if your data looks like
this:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------------------------------------------------------+
0 | Block Type = 0x00000006 |
+---------------------------------------------------------------+
4 | Block Total Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
8 | Interface ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
12 | Timestamp (High) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
16 | Timestamp (Low) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
20 | Captured Len |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
24 | Packet Len |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Packet Data |
| ... |
...you can build a parser in nom
like
this:
const ENHANCED_PACKET: [u8; 4] = [0x06, 0x00, 0x00, 0x00];
pub fn enhanced_packet_block(input: &[u8]) -> IResult<&[u8], &[u8]> {
let (
remaining,
(
block_type,
block_len,
interface_id,
timestamp_high,
timestamp_low,
captured_len,
packet_len,
),
) = tuple((
tag(ENHANCED_PACKET),
le_u32,
le_u32,
le_u32,
le_u32,
le_u32,
le_u32,
))(input)?;
let (remaining, packet_data) = take(captured_len)(remaining)?;
Ok((remaining, packet_data))
}
This demonstration isn't too interesting, but when more complex formats need to be parsed (like IEX market data),
nom
really shines.
Ultimately, because nom
was used to parse the IEX-format market data before serialization, we're not too interested
in its performance. However, it's worth mentioning how much easier this project was because I didn't have to write
all the boring code by hand.
Part 1: Cap'n Proto
Now it's time to get into the meaty part of the story. Cap'n Proto was the first format I tried because of how long it has supported Rust. It was a bit tricky to get the compiler installed, but once that was done, the schema document wasn't hard to create.
In practice, I had a ton of issues with Cap'n Proto.
To serialize new messages, Cap'n Proto uses a "builder" object. This builder allocates memory on the heap to hold the message
content, but because builders can't be re-used, we have to allocate
a new buffer for every single message. I was able to work around this and re-use memory with a
special builder,
but it required reading through Cap'n Proto's benchmarks
to find an example and using transmute
to bypass Rust's borrow checker.
Reading messages is better, but still had issues. Cap'n Proto has two message encodings: a "packed" version, and an unpacked version. When reading "packed" messages, we need a buffer to unpack the message into before we can use it; Cap'n Proto allocates a new buffer to unpack the message every time, and I wasn't able to figure out a way around that. In contrast, the unpacked message format should be where Cap'n Proto shines; its main selling point is that there's no decoding step. However, accomplishing this required copying code from the private API (since fixed), and we still allocate a vector on every read for the segment table.
In the end, I put in significant work to make Cap'n Proto as fast as possible in the tests, but there were too many issues for me to feel comfortable using it long-term.
Part 2: Flatbuffers
This is the new kid on the block. After a first attempt didn't work out, official support was recently added. Flatbuffers is intended to address the same problems as Cap'n Proto; have a binary schema to describe the format that can be used from many languages. The difference is that Flatbuffers claims to have a simpler wire format and more flexibility.
On the whole, I enjoyed using Flatbuffers; the tooling is nice enough, and unlike Cap'n Proto, parsing messages was actually zero-copy and zero-allocation. There were some issues though.
First, Flatbuffers (at least in Rust) can't handle nested vectors. This is a problem for formats like the following:
table Message {
symbol: string;
}
table MultiMessage {
messages:[Message];
}
We want to create a MultiMessage
that contains a vector of Message
, but each Message
has a vector (the string
type).
I was able to work around this by caching Message
elements
in a SmallVec
before building the final MultiMessage
, but it was a painful process.
Second, streaming support in Flatbuffers seems to be something of an afterthought.
Where Cap'n Proto in Rust handles reading messages from a stream as part of the API, Flatbuffers just puts a u32
at the front of each
message to indicate the size. Not specifically a problem, but I would've rather seen message size integrated into the underlying format.
Ultimately, I enjoyed using Flatbuffers, and had to do significantly less work to make it fast.
Final Results
NOTE: Need to expand on this, but numbers reported below are from the IEX's 2019-09-03 data, took average over 10 runs.
Serialization
Schema | Median | 99th Pctl | 99.9th Pctl | Total |
---|---|---|---|---|
Cap'n Proto Packed | 413ns | 1751ns | 2943ns | 14.80s |
Cap'n Proto Unpacked | 273ns | 1828ns | 2836ns | 10.65s |
Flatbuffers | 355ns | 2185ns | 3497ns | 14.31s |
SBE | 91ns | 1535ns | 2423ns | 3.91s |
Deserialization
Schema | Median | 99th Pctl | 99.9th Pctl | Total |
---|---|---|---|---|
Cap'n Proto Packed | 539ns | 1216ns | 2599ns | 18.92s |
Cap'n Proto Unpacked | 366ns | 737ns | 1583ns | 12.32s |
Flatbuffers | 173ns | 421ns | 1007ns | 6.00s |
SBE | 116ns | 286ns | 659ns | 4.05s |