How to store RDF triples in a key/value database?

PolatAlemdar · November 19, 2010, 11:00pm

Is there a way to store RDF triples in a key/value database, and, how?

Signified · November 19, 2010, 11:00pm

As Loranth says, there are numerous ways, and it depends on whether your map (or key/value database if you prefer) is sorted (e.g., based on a B+ tree) or not (e.g., simple hashtable).

To supplement Loranth's answer, I'm going to assume you want a full index: you want to be able to lookup triples with any combination of variables or constants.

There are 2^3 = 8 possible patterns:

(???), (s??), (?p?), (??o), (sp?), (?po), (s?o), (spo)

Where, for example, a possible (?po) query pattern would be:

?s rdf:type foaf:Person .

If your map is sorted and you can do prefix/range lookups, three indexes will suffice. For example, build an index with the three sorting orders:

s-p-o, o-s-p, p-o-s

If you have a lookup of the form (?p?) you can do binary search (or equivalent) on the p-o-s index based on the partial p-?-? key. If you have a lookup of the form (s?o) you can pose o-s-? against the o-s-p index.

You can build the map using method (2) in Loranth's answer (stick the triples in as keys and leave the values blank).

If your map is unsorted, one option is to build eight maps for each pattern, with the constants in each pattern as key, and the variables as value. For example, for (?po), build a map with a concatenation of po as key, and the subject as value.

However, there's some obvious redundancy there. You may be able to reduce the number of maps (indexes) required based on the (i) morphology of the data; (ii) characteristics of the underlying map. For example, if you had a map index servicing (s??), you could often service (sp?) or (s?o) queries quite efficiently, typically because s is highly selective. However, with only a (?p?) or (??o) index, (?po) queries might be inefficient.

(You could also consider storing a secondary index structure as value which supports lookups, like another hashtable or sorted list; e.g.: s→ {p1→[o1, o2], p2→[o1 o3]}.)

Another (unsorted) option would be to have three positional maps with terms as keys and the triples they appear in as values: a subject map, a predicate map, and an object map. For example, the subject map would map each term to a list of (identifiers for) triples it appears as a subject in. To do a (s?o) query, you would individually retrieve the s and o list and do a join between the lists (e.g., using a hash-join, or, if the lists are sorted, a merge-join). Again however, this might be inefficient given low-selectivity terms like rdf:type predicates.

Which you choose depends heavily on what kind of underlying map implementation is used, what kind of data you're storing, what scale you aim at, whether you want to be read/write optimised, volume and frequency of updates, etc. One size does not fit all.

Loranth · November 19, 2010, 11:00pm

Yes. There are multiple ways to do that. I will give you a couple of them, but there are plenty of more ways:

1) Serialize RDF graph as an object so that key is the subject and value is the combination of predicates and values in some serialized format.. i.e. key(subject) = value( "property: value, property2: value2" ), etc. Value could be for example JSON formatted or whatever.

2) Serialize a single tuple into key and leave value empty (or use it for house-keeping information). i.e. key(subject + predicate + object) = value( empty ). There are many ways how to concatenate the keys and which order they should be. (or you can store the object to the value and use just subject+predicate as the key).

When you get around the fact that RDF is really a graph and since everything in computer science can be derived from graphs, you can really store RDF anywhere. The bigger question is, how you want to access your RDF data. And what operations does the key-value database support. If we assume it is hash-based key-value database that only supports retrieval of keys and not any range-operations, then the above-mentioned option 1) is much better as you can retrieve your full object with just the key. But of course, you can't traverse the graph in reverse direction (for which you would need OPS-index).