RDF store comparison

jchan · February 22, 2011, 11:00pm

Is there a feature comparison available for RDF stores like Sesame,Mulgara,OpenAnzo, BigOWLIM and graph databases like hypergraphdb, Neo4j & infogrid ?

BillBarnhill · February 22, 2011, 11:00pm

FWIW, here are a few links. I'd give more weight to the TW link as I know of Jim Hendler and he's extremely sharp. That said I don't see the final report there, so not sure what the status is. The other links have info, but at first blush it's all out of date. I'm interested to to see if anyone finds anything current. Might be an opportunity there if no one does. The ESW links seem good place to start.

http://esw.w3.org/LargeTripleStores

http://esw.w3.org/RdfStoreBenchmarking

http://www.bioontology.org/wiki/images/6/6a/Triple_Stores.pdf

http://tw.rpi.edu/wiki/Triple_Store_Comparison

http://www.w3.org/2001/05/rdf-ds/DataStore

http://data.semanticweb.org/conference/iswc/2008/paper/research/178/html

From personal experience I've used Sesame, Jena+Mysql, and Virtuoso. Been meaning to try OpenAnzo. Sesame is the win for me, but it depends a lot on what you're criteria is. I needed a store that was very extensible and that I could built on and customize a fair amount. Sesame really is more of a store creation kit with stackable layers called 'Sails', so that fit my itch. YMMV.

mhgrove · February 22, 2011, 11:00pm

For the most part, Sesame is going to provide you with the best performance out of the box. If your KB is less than 10m triples, that will easily fit into memory and you can use their in-memory store which is as fast or faster than anything out there. If you move into a scale where you have to start using their native store, you're looking at a degradation in performance of an order of magnitude or more, and you're probably better of switching to ...

4Store (http://4store.org) is actually the fastest DB I've tested, usually by a fair amount, but when it can't answer a query, it really goes into the weeds; the query will either run forever and you have to kill the DB, or it will just seg fault. But it's been actively worked on and Steve Harris is doing good work there, so it's constantly improving.

I wouldn't bother with any of the Jena stuff, it's a featureful API, but TDB and SDB are both painfully slow.

BigData, which has a Sesame interface, has been really doing well. I know the guys who work on it, and they're pretty good. There's been some big performance increases in the latest version. It's still not the fastest out there, but it's pretty quick, and it's supposedly to scale very well.

Neo4j's performance in my limited tests was horrible. I was using it through their Sesame interface, and was very unhappy. It took over a day to load a 6.5m triples into their database, and most queries took hours to answer -- queries that other databases could answer in minutes or even seconds.