How to find the diff between RDF dumps?

mgns · January 24, 2012, 11:00pm

Assuming you have two RDF dumps, is there a (standard) way or tool to find the diff between these two dumps in the sense of getting the triples contained in the one but not in the other and vice versa?

database_animal · January 24, 2012, 11:00pm

One obvious trick is to (i) convert to N-Triples, (ii) sort, and (iii) run through diff.

This has some drawbacks.

Most seriously, the identity of blank nodes won't match up on the two sides and will turn up spurious differences if you're using blank nodes.

Also, if you use a standard UNIX diff this will break down if you have too many triples. I know it works for about 10,000 triples and I know it's got no prayer of working for 500 million triples -- I haven't quite found the spot it breaks down.

The good news is that because the files are sorted, you don't need a general purpose diff, you could write something that's very scalable.

The blank nodes are a fundamental problem, but some data sets don't use blank nodes...

RyanKohl · January 24, 2012, 11:00pm

If you're using the Jena framework, you could load up each dump into a Model and take the difference each way (api link). I don't know if it's standard, but it's pretty easy and darned flexible in regards to your desired output.

Heck, I guess you could even whip up a quick web service to encapsulate the whole thing, tell us all about it, and enjoy the karma.

scotthenninger · January 24, 2012, 11:00pm

TopBraid Composer has an RDF diff facility ("Compare With...") that is bnode friendly. The results are defined in a model so you can query the results in SPARQL to create reports, etc.

utapyngo · January 24, 2012, 11:00pm

Related questions:

Taming Protege OWL files in version control?

Ontology version control systems

They are about ontology version control but diff is an essential part of version control process and ontology is usually what rdf is used to describe.

I am the author of OntoVCS and will appreciate any feedback from you. The owl2diff tool does just what you need: shows what was added and what was removed, but it does it at axiom level, not triple level. This eliminated the blank nodes problem.

oesxyl · January 24, 2012, 11:00pm

there is a delta writen in python, it come with cwm. Also using --patch flag of cwm you can patch a given file using the diff providen by delta.py.

seralf · January 24, 2012, 11:00pm

i think the ideal way should be using in some matter the model provided by some triplestore as sesame, jena etc.

Loading triples is not enough, so you have to write some code on the api to implement the comparizon logic. This could be tricky and difficult for huge dataset, and could require a lot of time to do testing.

A way to have a simple start it's to convert the files in triples and put them on svn or similar, in order to use a simple "syntax" match: you see that the file should contain the triples in some kind of constant order criteria as said before, and this not ensure the process could work well. Ideally you should split the triples and create a file for each resurce. It's a very naive approach, but could work with big dataset that has a simple schema.

just an idea