Unlocking greater performance in the MongoDB Rust Driver via raw BSON and zero copy deserialization
The 2.2.0
release of the Rust BSON library (the bson
crate) introduced a "raw" BSON API, which enabled us to achieve some internal performance improvements in the Rust MongoDB driver (the mongodb
crate) and, in some cases, can be leveraged by users to dramatically improve performance of their queries, including via the use of serde
's zero-copy deserialization functionality. In this post, I'll demonstrate how to use this new API and provide some examples of where it can help speed up your reads.
Index
- What is "raw" BSON?
- Overview of the raw BSON API
- Speeding up your queries using raw BSON
- Notable differences between
RawDocumentBuf
andDocument
- Conclusion
- Acknowledgments
What is "raw" BSON?
Before we jump into any examples, I first want to clarify what I mean by "raw" BSON. To take a step even further back, I want to describe what BSON is generally. BSON stands for "Binary JSON" (kind of), and it's a binary format describing ordered maps of strings to various types, notably more types than JSON supports (for example, BSON has datetimes). It's used by MongoDB to store data, to communicate with drivers, and in its query language. The bson
crate prior to v2.2.0
already had the Bson
and Document
types to model BSON values and maps (called documents) respectively, but they aren't "raw" in the sense that their actual bytes in memory aren't in BSON. For example, a Document
actually contains an indexmap::IndexMap<String, Bson>
, which itself contains a hash table and other Rust data structures, and while Document can be serialized to and deserialized from its raw BSON equivalent, it doesn’t actually contain those bytes itself. The new "raw" BSON document type, which was introduced in 2.2.0
as RawDocumentBuf
, instead just contains a Vec<u8>
of bytes that correspond to actual BSON. The main benefit of this type is that when reading a document from the database, the bytes can be just copied as-is instead of being deserialized key-by-key to a Rust data structure like a Document
, which can be very time consuming.
Overview of the raw BSON API
As mentioned above, v2.2.0
of bson
introduces the RawDocumentBuf
type, which is an owned raw document type. Because it owns the BSON bytes that back it, it can be mutated via the append
method, which adds a new key value pair. RawDocumentBufs
can also be created via the rawdoc!
macro, which behaves similarly to the existing doc!
macro.
let mut doc = RawDocumentBuf::new();
doc.append("a key", "a value");
doc.append("an integer", 12i32);
println!("{:?}", doc.get("a key")); // prints "Ok(Some(String("a value")))"
let other = rawdoc! {
"a key": "a value",
"an integer": 12i32
};
assert_eq!(doc, other);
To allow for accessing borrowed documents, including subdocuments borrowed from other documents, there's also the unsized RawDocument
type, which is only used via a wide-pointer &RawDocument
. This type includes all the same methods that RawDocumentBuf
does minus the mutating ones, since again it is only used as an immutable reference. Instances of this can be created via RawDocument::from_bytes
or via the get
/ get_document
methods on other documents.
let doc = RawDocument::from_bytes(b"\x13\x00\x00\x00\x02hi\x00\x06\x00\x00\x00y'all\x00\x00")?;
assert_eq!(doc.get("hi")?, Some(RawBsonRef::String("y'all")));
let top = rawdoc! {
"top_key": {
"some_key": 12,
"other": true
}
};
// no clones are performed when retrieving the subdocument nor when iterating over it
let subdoc = top.get_document("top_key")?;
for kvp in subdoc {
let (k, v) = kvp?;
println!("{} = {}", k, v);
}
Individual values are modeled via the RawBson
enum, which is similar to the existing Bson
enum except that all the variants are "raw" (e.g. RawBson::Document
contains a RawDocumentBuf
instead of a Document
). The reference version of this type is RawBsonRef<'a>
, and it only contains references to owned raw BSON values. Instances of RawBson
can be created via the rawbson!
macro, which behaves similarly to the existing bson!
macro.
let mut doc = RawDocumentBuf::new();
doc.append("key", RawBson::String("a".to_string());
doc.append("other", rawbson!("a"));
let s = doc.get("key")?; // gets a reference, no copy here
println!("{:?}", s); // prints "Some(String("a"))"
The raw BSON API also contains support for borrowed deserialization via serde
, which can greatly speed up populating structs from BSON by skipping expensive copies.
let doc = rawdoc! {
"key": "value",
};
#[derive(Deserialize)]
struct Data<'a> {
#[serde(borrow)]
key: &'a str
}
// borrows from the input bson rather than copying
let d: Data = bson::from_slice(doc.as_bytes())?;
assert_eq!(d.key, "value");
Speeding up your queries using raw BSON
Simply by upgrading to the 2.2.0
release of the driver, you'll already have sped them up quite a bit! This was due to some optimizations we were able to make internally to the driver by using the new types. For example, this benchmark, which finds all 10,000 documents in a collection, performs 12% faster after bumping the version of mongodb
from 2.1.0
to 2.2.0
without any changes in the code! (For more info on how that benchmark works and profiling Rust code in general, check out my previous blog post).
If you'd like to leverage the raw BSON API to unlock further performance improvements, you can do so by using RawDocumentBuf
as the generic type of your collection and the cursors returned from it. This will greatly speed up queries where you don't need to perform lots of key lookups, since in those cases using raw BSON allows the driver to avoid parsing the individual key value pairs more than necessary (e.g. a situation where you just serialize the results of the query straight to JSON to be served to the frontend).
For example, given the following code:
let coll = db.collection::<Document>("docs");
let docs = coll.find(None, None).await?.try_collect::<Vec<_>>().await?;
serde_json::to_string_pretty(&docs)?
Simply updating Document
to RawDocumentBuf
results in a 70% speedup while still yielding the exact same result!
let coll = db.collection::<RawDocumentBuf>("docs");
let docs = coll.find(None, None).await?.try_collect::<Vec<_>>().await?;
serde_json::to_string_pretty(&docs)?
See here for benchmarks demonstrating this. Note that RawDocumentBuf
works well here because no key lookups are performed on the result documents. If the document needs to be inspected a lot, it's still preferable, ergonomically and for performance reasons, to use a struct which implements Deserialize
that models your data instead.
Also note that there is such a big difference in this case because we're returning the entire collection (10k documents). In cases where only a few results need to be returned, using RawDocumentBuf
may not yield as much or any improvements.
Further speedups via zero-copy deserialization
The serde
framework includes support for borrowing data from the input during deserialization (a.k.a. zero-copy deserialization), which in some cases can be used to avoid large copies and greatly speed things up. These cases are generally when the documents being deserialized from are really big and/or include fields that are really large (e.g. big strings or binary values). Starting in 2.2.0
, the Cursor
type now includes methods for borrowing from the underlying result set, enabling users to take advantage of this functionality:
#[derive(Debug, Deserialize)]
struct Stuff<'a> {
#[serde(borrow)]
name: &'a str,
#[serde(borrow)]
bio: &'a str
}
let coll = db.collection::<Stuff>("stuff");
let mut cursor = coll.find(None, None).await?;
while cursor.advance()?.await {
println!("{:?}", cursor.deserialize_current()?);
}
Note that for a lot of workloads, the time spent server-side processing the query and the network latency dwarf the time spent during deserialization, so borrowing during deserialization won't make a big difference in those cases. However, in cases that involve huge documents, the difference can be quite significant. For a benchmark demonstrating this, see here.
Notable Differences between RawDocumentBuf
and Document
When considering whether the raw BSON API is right for your use case, it may be helpful to consider the following differences between RawDocumentBuf
and Document
.
Validating BSON
In order to construct a Document
from BSON, the bytes need to be parsed into Rust types (String
and Bson
) up front. For RawDocumentBuf
, the parsing happens lazily as the document is iterated, meaning invalid BSON bytes could be encountered during iteration. As a result, the various ways of accessing elements in a raw BSON document can fail and thus return Result<RawBsonRef>
. This differs from Document
, which can return just Bson
(and equivalents), since all the parsing was done up front.
// ensures that bytes contains valid BSON, parses it all out
let doc: Document = bson::from_slice(bson_bytes)?;
// only performs simple bounds checking, may have invalid bytes in the middle encountered during iteration
let raw_doc: RawDocumentBuf = bson::from_slice(bson_bytes)?;
for (k, v) in doc {
println!("{}: {:?}", k, v); // prints <key>: Bson::<type>
}
for kvp in &raw_doc {
println!("{:?}", kvp); // prints Ok(RawBsonRef) or Err(...)
}
Looking up keys
Because RawDocumentBuf
contains BSON instead of a hashmap-like type, lookups by key involve traversing the whole document from the front (i.e. linear time complexity). This means that the performance of RawDocumentBuf::get
is potentially a lot slower than that of Document::get
, especially if there are a lot of keys and the key being looked up is at the end of the document.
Inserting new elements
Because key lookups can be so expensive, inserting a new element to a RawDocumentBuf
would be expensive too if it first checked to see if the key existed already (which Document
does). To avoid this, instead of an insert
method, RawDocumentBuf
only has an append
method that just tacks the new key-value-pair to the end without checking to see if the key exists in the document already. This is super fast, but users have to be careful not to accidentally append two of the same key.
Conclusion
The raw BSON API is included in version 2.2.0
of the bson
crate, and support for working with it was introduced in the driver in its version 2.2.0
. Check them out and let us know how they're working for you by filing an issue on GitHub or on our Jira project. Thanks!
Acknowledgments
Huge shout out to community member @jcdyer (author of the rawbson
crate from which much of the raw BSON code in bson
is derived) who has long spearheaded this effort!