Storing RDF data into HBase?

What would be the best way for storing and quering RDF data (triples or quads) using HBase?


I've just found this:

Unfortunately, the .tar.gz has been removed from the website.

I can't comment specifically on HBase, but I have implemented RDF storage for Cassandra which has a very similar BigTable-inspired data model.

You basically have two options in how to store RDF data in wide-column databases like HBase and Cassandra: the resource-centric approach and the statement-centric approach.

In the statement-oriented approach, each RDF statement corresponds to a row key (for instance, a UUID) and contains subject, predicate and object columns. In Cassandra, each of these would be supercolumns that would then contain subcolumns such as type and value, to differentiate between RDF literals, blank nodes and URIs. If you needed to support named graphs, each row could also have a context column that would contain a list of the named graphs that the statement was part of.

The above is a relatively simple mapping to implement but suffers from some problems, notably the fact that preventing the creation of duplicate statements (an important RDF semantic) means having to do a read before doing a write, which at least in Cassandra quickly becomes a performance bottleneck as writes are much faster than reads.

There are ways to work around this problem, in particular by using content-addressable statement identifiers (e.g. the SHA-1 of the canonicalized N-Triples representation of each statement) as the row keys, but this in turn introduces other trade-offs such as no longer being able to version statement data: every time statement data changes, the old row needs to be deleted and a new one inserted with the new row key.

In view of the previous considerations, the resource-oriented approach is generally a better natural fit for storing RDF data in wide-column databases. In this approach, each RDF subject/resource corresponds to a row key, and each RDF predicate/property corresponds to a column or supercolumn. Keyspaces can be used to represent RDF repositories, and column families can be used to represent named graphs.

The main trade-off with the resource-based approach is that some statement-oriented operations become more complex and/or slower, notably those counting the total number of statements or querying for predicate or object terms without specifying a subject. To support performant basic graph pattern matching, additional POS, OPS, etc. indices may need to be created and maintained.

See RDF::Cassandra, my Cassandra storage adapter for RDF.rb, for a more detailed example of a resource-centric mapping from the RDF data model to a wide-column data model.

FWIW, I'm currently working on mrlin, which at its core addresses the RDF-in-HBase question.

Given that HBase is designed for row based storage of data the simplest thing would be to have a GSPO layout i.e.

Graph | Subject | Predicate | Object

In terms of actually reading and writing data you'll most likely need to look at whatever API you want to use to manipulate the RDF and implement the necessary interfaces/classes that allows your chosen API to get data in and out of HBase.

As for querying then you'll need to implement a SPARQL engine for your database which could be rather complex or use an API that does SPARQL in-memory but this will have the overhead of needing to read out some/all of the data from HBase first.

Personally I don't know enough about HBase to say whether such an approach is viable or sensible, if you have hardware on which you can install and run HBase then you would most likely be better off installing and running a Triple Store for your RDF data.