Triple stores (RDF) and the CAP theorem

FabianCretton · March 3, 2013, 11:00pm

Hi all,

Reading more these days about NoSQL solutions, I wonder how to think about the CAP theorem when talking about triple stores, and thus how do triple stores scale out ?

In the NoSQL litteratures I've read, it is said that graph databases don't scale out. 7 Databases in 7 weeks says about the graph databases and Neo4J "Because of the high degree of interconnectedness between nodes, graph databases are generally not suitable for network partitioning".
What about a triple store which seems to find its place inside NoSQL's graph approaches ? does it scale out depending on the implementation (for instance Jena don't scale out - all in memory - but maybe OWLIM does ?)

I am also asking myself if we could say that RDF is a good way to scale out as the goal is not just to scale out for big servers of big companies (e.g. Google, Amazon, that relies on Hadoop, MongoDB or other NoSQL solutions), but to have tools that can handle the all web as a web of data ? In such a way of thinking, is SPARQL end-points and federated queries a way to scale out ? This will not be a good answer the "latency" problem of course.

Maybe one power of RDF is its flexibility: you can provide your data in a file or expose it with a SPARQL end-point. Consumers can either send federated queries, or retrieve the selected information and store it in-house (for further processing, reasoning, or to solve latency problems).

Another way to think about it could be "what is the lack of a triple store when thinking about the NoSQL approach ?". Most pages that talk about NoSQL and RDF show how RDF suits NoSQL, but rarely what it does miss.

Any pointer would be appreciated.
Thank you
Fabian

Jerven · March 3, 2013, 11:00pm

Well there are definitely scale out triple stores out there. BigData from systap and 4(5)store that are in that category. As does the virtuoso cluster edition, and OWLIM-enterprise.

In one way I think RDF/triple stores are easier to scale out for triple stores as there is just one type of information that needs to be passed to all nodes (the triples). However, performant query answering is the difficult part.

You can argue that uRiKa by Yarcdata is a scale out approach. I don't think it is as its a very large uniform memory access machine. Even if your instance fills up a datacentre. But its internal datastore (lustre filesystem definitely is).

The main problem in discussing what triple stores lack is that there is such a huge variety of them as the SPARQL interface decouples the query interface from the storage model. So I can't think of things that a triple store can not do on a theoretical basis that other storage models could not.

I think that when you discuss CAP theory gives a set of limits to what any datastore is limited by. Just because a datastore is on a single machine we might not have C or A (worst case scenario /dev/null, wonderfully efficient datastore diskspacewise ;) but not C as reads are not consistent with writes). While many NoSQL stores went for scale out as a selling point, there is no reason that a SQL compatible datastore can not scale out e.g. google spanner and Amazon Redshift or Oracle RAC (100 nodes). That your classical store does not is due to engineering choices.

Many graph stores do not scale out because they are not designed to do so. There is no theoretical reason that they can't. And as RDF stores are Graph stores there is proof that they can be made to do so.

GerritV · March 3, 2013, 11:00pm

Querying RDF data with SPARQL implies (many) joins of triples. Joins are expensive, and don't scale that well. Or said in another way: SPARQL is too flexible and fine-grained for huge scalability. SPARQL has the same disadvantage as SQL here: JOINs are expensive. Other NoSQL stores typically are very limited reagrding joins.
RDF datastores optimized for SPARQL querying use very specific indexes that are highly granular (for each triple/quad, thus each property of a resource, multiple indexes need to be updating. Very likely on multiple shards when scaling horizontally). This makes updating/storing huge amounts of data expensive.
RDF Databases are very good for denormalizing data, and reasoning over data. Scalable RDF-stores typically make sacrifices on both of these.