Pointers to Binary RDF?

IanDavis · June 25, 2010, 10:00pm

Has anyone done any evaluation or research into binary RDF using technologies such as protocol buffers, thrift or plain ASN.1?

fellahst · June 25, 2010, 10:00pm

This is a new W3C note been submitted recently: Binary RDF Representation for Publication and Exchange (HDT). It is the specification related to the paper. Hopefully it will take off and some opensource parsers will be available.

JeenBroekstra · June 25, 2010, 10:00pm

I have (very) recently written a weblog posting about the Sesame Binary RDF format. This is a home-grown format developed by Aduna. Its main design characteristics are reducing parsing overhead, allowing full streaming processing/minimizing memory requirements and of course minimizing the number of bytes sent over the network.

I think Aduna chose to do their own format instead of implementing HDT because they had a specific issue with datasets containing very large literals, which HDT deals with less efficiently with than the Sesame format. But this is all hearsay, I have no figures to back this up.

Jerven · June 25, 2010, 10:00pm

I think that there is not that much to gain in using a specific binary implementation that is not just compressed text. For example look at turtle. How many bytes are used as value separators. One of '\t' , ',' , '.' , ';'. the prefix that is used to abbreviate the URI can be one byte followed by a ':'. And then you have UTF-8 text. There is not so much margin around that a binary serialization would beat an optimal zipped turtle encoding by more than a few percent. Is that really worth the added complexity?

Between a rdf/xml and turtle you can already have quite a bit of savings. For example ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/citations.rdf.gz in turtle can easily be 17% smaller than the rdf/xml, and higher numbers are easily possible. Gzipping the rdf/xml achieves an compression factor of 3.6 for turtle a compression factor of 3.1 is achieved. But the size difference is now only 4%.

A binary serialization only becomes interesting when the message length is so short that the overhead of the utf-8, and prefix declaration becomes an important factor. At this point your messages are so strongly defined that you are no longer talking about a generic information interchange system.

In conclusion I believe that a general text compression algorithm over rdf in turtle will so close in performance that I think a binary interchange format does not have enough value.

AntoineZimmermann · June 25, 2010, 10:00pm

You may be interested in RDF compression [1]. The paper is just 2 pages long so you won't waste too much time reading it ;)

[1] Javier D. Fernández, Claudio Gutierrez, Miguel A. Martínez-Prieto. RDF Compression: Basic Approaches, In Proc. of WWW Conference 2010. http://www.dcc.uchile.cl/~cgutierr/papers/www2010.pdf

Nathan · June 25, 2010, 10:00pm

Perhaps we'll get a natural binary serialization, of rdf/xml(2), thanks to EXI.

Key point for me would be whether to await a non serialization specific rec for RDF and then create a pure binary/rdf serialization - or - whether to focus on binary versions of each serialization, as mentioned previously any XML variant will have EXI, and you can already get binary json serializations, so several options.

Previously, I looked in to doing it in ASN.1, however soon stopped because I began to feel that perhaps ASN.1 wouldn't be the most accessible thing to expect people to write serializers and parsers for - it's certainly an option though.

Additionally, I stopped all work and thoughts on binary rdf because I felt it would be a wasted effort until after the RDF community was clear on the 'next steps'.

IanDavis · June 25, 2010, 10:00pm

Answering my own question, Talis use Thrift with compressed ntriple payloads but I don't count this as a binary serialisation of the RDF model.

ldodds · June 25, 2010, 10:00pm

There have been plenty of discussions about binary XML formats, so why not look at using some of these technologies but applying them to RDF/XML.

But I'm not sure that binary XML went anywhere, so the same is likely to be true of binary RDF. Unless there are specific use cases to support?