How do I build a triple store?

If I wanted to do semantic web application development in [some obscure language] and nothing is currently available, where would I go to find out how they work and how to build a production-quality triple store in my obscure language? What API standards are out there that I ought to adhere to?

Specifically, How do I go about indexing triples efficiently in terms of space and search time? Are B-Tree derivative data structures what I should look at, or is something else better? What optimisations are known for compaction and for optimising, say, data retrieval to support reasoners and SPARQL queries?

Some stores index everything, some assume that the predicate is always present in a triple pattern. With that assumption, two indexes are needed, POS and PSO, not 3 (e.g. SPO, POS, OSP). The saving is greater for quads.

Once indexes are created, if they are not used, they just sit on disk out of the way. They don't take cache space so don't affect query execution speed.

Most triple stores and frameworks are open source under liberal licenses (e.g. virtuoso, sesame, 4store, jena, openanzo, redland, semweb.net, mulgara). I think the best place to learn about building a triple store is by looking at those and spending time understanding how they work and what design decisions were made.

A search on Google brought up the paper "Design and implementation of an RDF Triple Store".

If you're more like the source code reader type of guy, you mabe want to take a look at the sources of TDB, which is the native storage engine in Jena, or also at the sources of Sesame.

Update: Another place where you maybe could learn something about the topic is the BigData project. They write a lot about technical details in their blog, and it's open source, too, so you can take a look at the sources, too.

You might have a look at a recent paper dipLODocus

Here's the introduction:

Despite many recent efforts, the lack of efficient infrastructures to manage RDF data is often cited as one of the key problems hindering the development of the Semantic Web. Last year at ISWC, for instance, the two industrial keynote speakers (from the New York Times and Facebook) pointed out that the lack of an open-source, efficient and scalable alternative to MySql for RDF data was the number one problem of the Semantic Web.

http://www.openlinksw.com/weblog/oerling I've found it hard to follow at times, but you get the idea that Orri has thought a lot about it.

There have been some academic(-ish) papers here and there. I recently read a good one about a distributed triple store. I think it was about 4store(.org) but I can't remember where I found it. Anyone else know?

Otherwise, you probably have to ping the people that have built them for ideas. For instance, in the SemWeb.NET [1] triple store that I built, I found a simple MySQL structure [2] worked well enough to scale to 1B triples, though it was very space-hungry with many indexes.

[1] http://razor.occams.info/code/semweb/ [2] http://razor.occams.info/code/repo/?/semweb/src/SQLStore.cs

Intellidimension uses a rather intuitive solution (maybe others, too) and they said it has a major impact on performance. They maintain two triple tables:

  • Table 1 stores SPO as they are in string format in three (or four in case of quads) colums
  • Table 2 stores all the same triples (also in 3 or 4 columns) but SPO are represened not as a strings but each by their MD5 hash value as a GUID.

The big idea is that the second table not only has truely fixed width (content of TEXT types cells are stored outside of the table, which has a rather bad effect), but also very narrow. They claim that the width of a SQL tables strongly influence performance. Before executing a query, they calculate the MD5 value of the strings in the query and execute the query with the hashed values on Table 2. This way, they get the rows complying to the query in a performant way, and retrieve the real values from Table 1.

Get 'Programming the Semantic Web' by Toby Segaran et. al.

http://shop.oreilly.com/product/9780596153823.do

The authors build a simple triplestore using Python and explain how it works. Then they use RDFLib and SQLite before moving on to Sesame, Jena etc.

I really wonder if once you created an application over your triplestore, and once you now know the queries you're doing—if it is not possible to "trim" from the indexes the "things" that you won't ever use for your queries. That would actually remove the unnecessary bloat, think of this like a "database packing" operation.

Probably trimming is not doable, what about re-indexing the data while skipping what won't be used, and the no-skip list being generated from a collection of your SPARQL queries.