Emacsen's Blog

Thoughts on Canonical S-Expressions

Datashards currently uses Canonical S-Expressions as a data format and after using it for a few months, I have some thoughts.

First things first: If you aren't familiar with the format, let me give you a quick rundown. Canonical S-Expressions are a bit like regular S-Expressions, with a twist. If you already know Lisp, none of this will be new, but for the rest of you, there are two items in an S-Expression- a list and an atom. A list is what it sounds like- a sequence of things. And an atom is a thing. An S-Expressions looks like:

(item1 item2 item3 item4)

If you're familiar with Python or Javascript, you can think of that as the same as:

[item1, item2, item3, item4]

In Canonical S-Expressions (csexp), every atom is actually a byte object, and we say the size of the byte object by prepending it with the number of bytes, followed by a colon:

(5:hello5:world)

That's a list of two items, 'hello' and 'world'. I'm putting these in quotes but the values aren't strings, they're bytes. That means it's very efficient to put raw binary data in a csexp. If you put binary data in JSON, you'd have to do something like base64 encode it. No need in csexp!

You can also give a "type hint" in csexp, so if you have a binary object that represents an image, you can stick the mimetype in the csexp, such as:

([image/jpeg]1024:<bytes>)

You can also store other lists inside of a csexp, such as:

(9:groceries(4:milk5:bread))

What I Like

There's a lot to like about Canonical S-Expressions. They're extremely space efficient, very flexible and super easy to parse. Writing a reader for a csexp is fairly trivial. And even if your language doesn't already have a csexp library, you can easily write one in a day, if not an afternoon.

The other thing I like about Canonical S-Expressions is that they do what they claim to do and nothing else. They're a binary format that only does byte strings and lists.

What's Not to Like About Canonical S-Expressions

Working with CSEXP data can be a pain. You're always stuck writing a reader for your data. Your reader will take the resulting abstract parsed data and convert it into something your application will actually consume. In some cases this conversion is easy, 3:100 becomes the integer 100. If you want to store more complex data structures, such as associative arrays, however, then you'll need to think about it.

Since CSEXP doesn't have associative arrays, only lists, you'll end up writing the serialization/deserialization format on your own. You could store them as lists of lists, ((key val) (key val)) or the more compact form of (key val key val) or you could (ab)use the hint system, such as ([key]value [key]value). Whatever choice you make, it will be specific to your application and someone who reads the document will need to think about the choices you made beforehand. Or if you're inheriting data in this format, you may end up having to guess at the meaning of the data structure.

This type of step is necessary for many serialization formats. In some, like Protobufs, it's a requirement. In XML, it was not strictly necessary but almost always done, and in some applications using JSON, it may not be necessary at all.

Canonical S-Expressions occupy a strange middle ground where having a formal schema is not strictly necessary, as it's schema-less, but it's also challenging to work without one.

Flexible (Schema-less) Data Serialization Formats

Flexible data formats are a topic of deep discussion and debate. In the 90s, it seemed that the world had converged on XML as the One Format to Rule Them All. The problem with XML is that even though the format is self-documenting in some ways, ie , the value inside tags needed to be converted during a secondary reader, separate from the parser.

Since this distinction isn't always clear, the parser parses the raw data into a machine readable data structure, while a reader parses the data (usually post-parsed) into application specific data structures.

Canonical S-Expressions have the same problem in regards to needing a reader that XML does, but unlike XML, you don't have the storage or bandwidth issues of the tags.

JSON seems to have won out the generic data format wars by offering some types, making writing a reader trivial (or in some cases, unnecessary) but anyone who has ever had to work with JSON knows that its thin layer of types is misleading. As an example, "How do you store a date in Javascript?"

You could store it as Unix time, seconds after the epoch, or you could store it in an ISO 8601 formatted string, ie "2008-09-15T15:53:00+05:00" or an RFC 822 date format, or something else entirely. Your parser will happily give you a string, but you're stuck needing a reader to do that final conversion, just like you did with XML.

JSON-LD solves some of this by giving your values semantic meaning, but it makes the parser more complex.

And neither XML nor JSON handle binary data well. To store binary data in either format, you must first convert it to Base64, which introduces an enormous amount of storage and transmission overhead.

Canonical S-Expressions offers none of the overhead of XML and doesn't claim to do type conversions. Since you'll need a reader anyway, you can do your type conversions in that step.

Further Thoughts and Alternatives

In practice, having some type data assistance does offer benefits. It makes your reader simpler, and it makes the format more pleasant to work with, and so while I appreciate cxesp's simplicity, I find working with it to be more challenging than it should be.

One thought that I keep having while I'm using csexp is to use the type hints to store information such as the data type. Imagine if instead of:

20:2019-10-02T07:11:07Z

We instead stored:

[iso8601]20:2019-10-02T07:11:07Z

That would give us the data type and we could let the reader take some of the work off of our programming logic. This is similar to JSON-LD's method of storing semantic data.

I personally like this idea, but it requires changes to the readers to recognize a new "Semantic Canonical S-Expression".

A simpler idea would be to store some type information alongside the data, so instead of 3:253, you might store I3:253, with "I" indicating that the value is an integer. This is exactly what the Bencoding format does. Bencoding offers many of the same benefits of CSEXP, but because it also supports types, is a bit easier to work with. The downside, as always, is that this helpfulness comes at the cost of storage and bandwidth.

Other formats exist as well. I previously mentioned Bencoding, but there is also MessagePack, ASN.1, CBOR, and the newest, Preserves. Each of these has a different approach, though they center around the same problem- making it easy to store arbitrary data, especially binary data, on disk and on the network.

It's beyond the scope of this post to delve into each of them. I think Preserves is the most interesting of the formats. It's certainly the most expressive despite being compact, but since I haven't used it I don't know if that expressiveness will be something I need or if I could simply use Bencoding or MessagePack to the same effect.

Conclusion

Canonical S-Expressions are a great, flexible, compact data format. It's very fast and efficient. If you have straightforward needs, it's certainly worth checking out. In my use case, Datashards, it fits our current needs. If we end up wanting to store more complex data structures in the format, such as associative arrays, that will be the time to re-evaluate the format choice to see if something else would be a better fit.

misc