Existing work on mining/recovery of schemas/structure/graph patterns from large collections of RDF data?

Let's say I have a bunch of RDF data. Maybe it's crawled, maybe it's a lot of dumps that I loaded into a store, whatever.

Now I'd like to know what the typical structures in that data are. Probably a lot of this data features repetitive patterns.

For example, if I crawled a lot of FOAF from one social networking site, then there would be a lot of foaf:Person instances that have foaf:knows links to other foaf:Person instances, as well as foaf:name, foaf:nick, and maybe some other properties.

Or if I crawled a lot of GoodRelations data, then this would contain a lot of gr:BusinessEntities with gr:Offerings that have gr:PriceSpecifications attached, with their respective typical properties.

If I have DBpedia, then it wouldn't find much structure except that all resources have rdfs:label, rdfs:comment, dbp:wikiPage etc.

So one can discover typical patterns in almost any collection of RDF data. There will ever be bits and pieces of data that “stick out” from underneath the pattern, but nevertheless I'd expect to be able to recover some essentially table-shaped structure in most datasets.

Are you aware of existing work in the literature that goes in this general direction? We could describe this as “schema mining”, “schema recovery”, “graph pattern detection”, “RDF summarization” and so on. Any pointers would be much appreciated.

Where to start... tell me when I'm getting warm. :)

There's been work on creating visual schema-level summaries of large, heterogeneous datasets based on the instance data.

For example:

An Interactive Map of Semantic Web Ontology Usage.
Sheila Kinsella, Uldis Bojars, Andreas Harth, John G. Breslin, Stefan Decker.
The 12th International Conference on Information Visualisation (IV08), London, UK. IEEE Computer Society. doi:10.1109/IV.2008.60

High Performance Semantic Factoring of Giga-Scale Semantic Graph Databases.
Cliff Joslyn, Bob Adolf, Sinan al-Saffar, John Feo, Eric Goodman, David Haglin, Greg Mackey, and David Mizell
Billion Triple Challenge, 2010

Taking from the latter paper, these folks build graphs like this:

...fairly intuitive stuff. The latter paper also looks at frequent (1/2/3 length) predicate paths in the data.

You may already have heard of voiD:

voiD is a vocabulary and a set of instructions that enables the discovery and usage of linked datasets. A dataset is a collection of data, published and maintained by a single provider, available as RDF, and accessible, for example, through dereferenceable HTTP URIs or a SPARQL endpoint. Based on the voiD vocabulary this document explains how to use voiD in a practical setup, for both data consumers and data providers.

There was a BTC 2010 paper on building voiD for the competition corpus:

Creating voiD Descriptions for Web-scale Data
Christoph Bohm, Johannes Lorey, Dandy Fenz, Eyk Kny, Matthias Pohl, Felix Nauman
Billion Triple Challenge, 2010 (winner)

There's a paper coming out at ESWC 2011 (Linked Data track) entitled:

Statistical Schema Induction
Johanna Voelker and Mathias Niepert

Don't have access to the paper, but from the title, it sounds like it will be very related.

Folks over at RPI won BTC 2009 for summarising large RDF corpora... a bit more into data-mining.

Scalable Reduction of Large Datasets to Interesting Subsets
Gregory Todd Williams, Jesse Weaver, Medha Atre, and James A. Hendler
Billion Triple Challenge, 2010 (winner)

RDF triple/quad stores often need to get a summary of their data in order to do effective query-processing. Although typically based on simple selectivity, there's some ongoing work on extracting frequent "graph patterns".

A little bit out-of-the-loop there, but:

Estimating the Cardinality of RDF Graph Patterns
Angela Maduko, Kemafor Anyanwu, Amit Sheth, Paul Schliekelman
WWW 2007 Poster

...might be related.

Similarly, there's work on partioning data locally for distributed querying, based on frequent graph patterns:

Index Structures and Algorithms for Querying Distributed RDF Repositories
Heiner Stuckenschmidt, Richard Vdovjak, Geert-Jan Houben, Jeen Broekstra
WWW 2004.

Structure Index for RDF Data
Thanh Tran, Gunter Ladwig
SemData workshop at VLDB 2010

...they also have a paper upcoming at ESWC 2011 Linked Data Track. (I'm sure there's more work in the area.)

Finally, there's work on summarising datasets for federated/remote/live querying; e.g.:

Comparing Data Summaries for Processing Live Queries over Linked Data.
Jürgen Umbrich, Katja Hose, Marcel Karnstedt, Andreas Harth, Axel Polleres
In WWW Journal, Special Issue "Querying the Data Web", 2011.

Summary Models for Routing Keywords to Linked Data Sources
Thanh Tran, Lei Zhang, Rudi Studer
9th International Semantic Web Conference (ISWC'10).

Linked Data Query Processing Strategies
Günter Ladwig, Thanh Tran
9th International Semantic Web Conference (ISWC'10).

...use summarisation techniques when performing live-querying over Linked Data sources.

Maybe this is a starting point?


Probably you already knew this DERI paper.