Pointers to Binary RDF?

Has anyone done any evaluation or research into binary RDF using technologies such as protocol buffers, thrift or plain ASN.1?

This is a new W3C note been submitted recently: Binary RDF Representation for Publication and Exchange (HDT). It is the specification related to the paper. Hopefully it will take off and some opensource parsers will be available.

I have (very) recently written a weblog posting about the Sesame Binary RDF format. This is a home-grown format developed by Aduna. Its main design characteristics are reducing parsing overhead, allowing full streaming processing/minimizing memory requirements and of course minimizing the number of bytes sent over the network.

I think Aduna chose to do their own format instead of implementing HDT because they had a specific issue with datasets containing very large literals, which HDT deals with less efficiently with than the Sesame format. But this is all hearsay, I have no figures to back this up.

I think that there is not that much to gain in using a specific binary implementation that is not just compressed text. For example look at turtle. How many bytes are used as value separators. One of '\t' , ',' , '.' , ';'. the prefix that is used to abbreviate the URI can be one byte followed by a ':'. And then you have UTF-8 text. There is not so much margin around that a binary serialization would beat an optimal zipped turtle encoding by more than a few percent. Is that really worth the added complexity?

Between a rdf/xml and turtle you can already have quite a bit of savings. For example ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/citations.rdf.gz in turtle can easily be 17% smaller than the rdf/xml, and higher numbers are easily possible. Gzipping the rdf/xml achieves an compression factor of 3.6 for turtle a compression factor of 3.1 is achieved. But the size difference is now only 4%.

A binary serialization only becomes interesting when the message length is so short that the overhead of the utf-8, and prefix declaration becomes an important factor. At this point your messages are so strongly defined that you are no longer talking about a generic information interchange system.

In conclusion I believe that a general text compression algorithm over rdf in turtle will so close in performance that I think a binary interchange format does not have enough value.

You may be interested in RDF compression [1]. The paper is just 2 pages long so you won't waste too much time reading it ;)

[1] Javier D. Fernández, Claudio Gutierrez, Miguel A. Martínez-Prieto. RDF Compression: Basic Approaches, In Proc. of WWW Conference 2010. http://www.dcc.uchile.cl/~cgutierr/papers/www2010.pdf

Perhaps we'll get a natural binary serialization, of rdf/xml(2), thanks to EXI.

Key point for me would be whether to await a non serialization specific rec for RDF and then create a pure binary/rdf serialization - or - whether to focus on binary versions of each serialization, as mentioned previously any XML variant will have EXI, and you can already get binary json serializations, so several options.

Previously, I looked in to doing it in ASN.1, however soon stopped because I began to feel that perhaps ASN.1 wouldn't be the most accessible thing to expect people to write serializers and parsers for - it's certainly an option though.

Additionally, I stopped all work and thoughts on binary rdf because I felt it would be a wasted effort until after the RDF community was clear on the 'next steps'.

Answering my own question, Talis use Thrift with compressed ntriple payloads but I don't count this as a binary serialisation of the RDF model.

There have been plenty of discussions about binary XML formats, so why not look at using some of these technologies but applying them to RDF/XML.

But I'm not sure that binary XML went anywhere, so the same is likely to be true of binary RDF. Unless there are specific use cases to support?