Sync large RDF data

LucaMatteis · March 22, 2013, 11:00pm

I foresee a scenario where someone might download several RDF datasets into their own central RDF database (triple store), in order to query it.

Imagine I need some data from DBpedia, and from Flickr wrappr (RDF repository for flickr images). I need this data so that the users of my application (probably a website) can query it, using SPARQL or through other means. What is the best scenario to allow this? I would need to constantly download all of DBpedia's and Flickr's datasets, which are quite large, in order to keep my local copy up-to-date with the data on DBpedia or Flickr.

My question: is there a way to sync these large RDF datasets to my local database, without having to re-download them every single time? Maybe a way to only download the changes, and not the entire datasets?

Signified · March 22, 2013, 11:00pm

Very interesting question. :)

This is something I've been working on a little bit with some colleagues (research/academia). There are some options out there aside from wipe/reload because: (1) a lot of data is static; (2) it's not so difficult to distinguish the types of data that are static/dynamic based on prior experience.

First off, some datasets, like linkedmdb.org, just don't change. So don't bother updating them.

Second, datasets often follow patterns in how they change. On the Web, changes within documents have been modelled using Poisson models (which, e.g., can be used to model people arriving at a checkout based on past history).

Third, some datasets incur different types of change. For example, dbpedia.org tends to change in bulk. So if you somehow find that one document changed, you're probably going to have to do a wipe/reload. The europa.eu exporter only ever adds triples; never removes them. So you only ever need to inserts to keep up-to-date there.

Fourth, other datasets, like geospecies.org, have many documents that are static and a few documents that are highly dynamic. Again, past experience can tell you which is which.

Fifth, aside from datasets and documents, predicates can often tell you in fine-grained detail what data is likely to be stale. You should not trust your local index for triples with the predicate ex:currentTemperature, for example.

We have a paper on the general dynamic of Linked Data at ESWC this year that covers these topics and should be fairly accessible if you are interested:

Tobias Käfer, Ahmed Abdelrahman, Jürgen Umbrich, Patrick O'Byrne, Aidan Hogan. "Observing Linked Data Dynamics". In the Proceedings of the 10th Extended Semantic Web Conference (ESWC2013), Montpellier, France, 26–30 May, 2013. (to appear).

(We are also planning to add a site that tracks changes in Linked Datasets on a weekly basis and provides some API for external systems to check which sites' content changed in the previous week, hopefully in the next couple of months. We're having some pretty significant data-management problems at the moment with the weekly monitoring data we've collected so far.)

So what can you do about this? Well, you can assign different synchronisation strategies for different datasets.

Or perhaps synchronisation isn't the way to go. For example, what's the point continuously updating all documents on a site that publishes current temperatures (if you only get occasional queries thereupon)? Obviously it's better just to go to that site at runtime and get the temperature of the particular city the query is interested in.

I was also involved in work with some colleagues that looked at identifying parts of a query (triple patterns) that a warehouse is likely to have stale data based on the predicate involved. These "dynamic" triple patterns (e.g., ?city ex:currentTemperature ?temp .) are not issued to the local index, but rather fetched live (which is slow, but gives fresh results). The rest of the query is issued to the local index (which is fast and should be okay for static patterns). We got our initial intuition and results together for an ISWC paper (which again should be fairly accessible):

Jürgen Umbrich, Marcel Karnstedt, Aidan Hogan and Josiane Xavier Parreira. "Hybrid SPARQL queries: fresh vs. fast results". In the Proceedings of the 11th International Semantic Web Conference (ISWC 2012), Boston, US, 11–15 November, 2012.

Anyways, yeah, not meaning to peacock, but I just wanted to share my enthusiasm for the question. Other Semantic Web research is available. :)

EDIT: On that, worth mentioning the DaDy vocabulary for publishers to describe their dynamics to consumers. Just have to get publishers to provide this information (assuming they're aware of it).

Finally, worth mentioning HDT, which is a pretty nifty initiative to create a compressed binary exchange format for RDF that is effectively a portable index. Imagine if instead of (e.g.) 200 consumers of DBpedia having to re-index the site when there's updates, DBpedia intermittently made an index available that consumers could download, plug-out the old index and plug-in the new index. That would be pretty cool and would be cost-effective, where some up-front cost by the publisher saves costs of 200 consumers. HDT enables that.

Miguel A. Martínez-Prieto, Mario Arias Gallego, Javier D. Fernández: Exchange and Consumption of Huge RDF Data. ESWC 2012: 437-452

mhgrove · March 22, 2013, 11:00pm

I was just discussing this sort of scenario the other day. There's some useful work in this area addressing this kind of data warehouse; I have not delved much into the actual research yet, academics help me out?

But from what I recall, it looks at this situation and breaks it down. How big is one of the constituent databases? How often is it updated? How big are the updates? Knowing these sorts of questions can give you an idea of what type of schedule you'd need for pulling the data in to have it accurate and up to date as much as possible.

Now, unless those sources will tell you what's changed since the last time you pulled it (ie a diff), you'd have to go with a wipe and load approach. But that's not the end of the world. You can do that work on the schedule you've already determined and perform it asynchronously so that users are not affected by the warehouse'ed db coming down to be updated. Then you just have to hope your triple store does good bulk writes =)

So the short answer is no, I think you'd have to re-ingest the entire dataset(s) each time, unless you can get like an RSS feed of the changes from a database. Though, that might be something you could bolt on as a service.