Killer applications for RDF data stores?

If you were speaking at a general programming conference, and you wanted to convince the audience that they should learn more about RDF data stores, what examples would you choose?

  • Applications built around Facebook's Open Graph? Google's RFDa search support?
  • Complex queries against graphs with millions of nodes?
  • Applications in bioinformatics?
  • Companies using RDF successfully in high-profile applications?

What do RDF data stores give you that SQL and other NoSQL databases don't?

I'm particularly interested in examples that could be used today, and not ones which depend on large numbers of people and organizations publishing RDF metadata in the future. :-)

(Related: Is there any killer application for Ontology/semantics/OWL/RDF yet? on Stack Overflow.)

If you were speaking at a general programming conference, and you wanted to convince the audience that they should learn more about RDF data stores, what examples would you choose

What do RDF data stores give you that SQL and other NoSQL databases don't?

Within (i) a closed system with (ii) a known schema which rarely changes, the answer is very very little. Vary (i) or (ii) and you're getting closer to a potential use-case for RDF.

The main difference between RDF and SQL/NoSQL is that the data-model is domain-agnostic and standardised. This has potential advantages and disadvantages:

Advantages

  • Very flexible wrt. (minor) changes in schema ✓✓✓
  • Structurally interoperable with other indexes using triples/named graphs ✓✓✓
    • Can pull data from other sources ✓✓✓
    • Can publish data in a standardised way ✓✓
    • Same for schema! ✓✓✓✓✓✓
  • Standards for building an inferencing layer which smooths out possible heterogeneity in how data is created/added to the store ✓✓
  • Reusability of APIs and off-the-shelf tools ✓
  • Future is bright: active research, development and publishing ✓

Disadvantages

  • Generic triple storage often (but not always) implies less efficient lookups (special indexes can still be built, but this moves away from schema flexibility) ✗✗✗
  • Certain data not easily represented in RDF (esp. n-ary predicates) ✗✗✗
  • Practical disadvantages with respect to (relatively) immature RDF storage systems and tools and porting over existing systems ✗✗
  • High overhead for developers to get the necessary expertise to do a good job ✗✗
  • Only non-standard solutions available for declaratively specifying (common types of CWA) constraints ✗

I'm particularly interested in examples that could be used today, and not ones which depend on large numbers of people and organizations publishing RDF metadata in the future. :-)

If they're not interested in sharing their data, or pulling in external data, the only real value proposition for RDF is then flexibility in schema, or using inferencing to smooth out heterogeneity with respect to multiple parties contributing data (where the data mean the same thing, but are described in different ways). On that, you can use, in particular, the BBC World Cup example, nicely described by mhermans in the answer here; another article here.

If you were speaking at a general programming conference, and you wanted to convince the audience that they should learn more about RDF data stores...

I wouldn't be trying to convince them to be honest. :)

(I'm not saying that this is what you intend, but I feel it's worth mentioning that...)

Blindly evangelising RDF stores won't help adoption! Tell them the disadvantages—that they probably already know, or think they know, and in any case should know—to get them on your side. Get even the hardcore SW cynics to agree with you. Then, lure them in with the advantages, and give the examples of successful use-cases, showing how RDF benefited them.

If they have a genuine use-case, make them think that RDF was their idea, not yours.

I see two areas of application. One is a place for startups and lifestyle businesses, the other is a place for government, big businesses and consultants to them.

(1) Using data that's already available in RDF or RDF-like format.

The most important examples in my mind are DBpedia and Freebase. There are hundreds of fun and informative web sites to be created based on those data sets, although you'll need to invest in data cleaning to make anything commercially viable.

Generic databases can also provide world knowledge to NLP systems -- DBpedia Spotlight, for instance, is an NLP named entity recognized system based on world knowledge that's (almost) entirely ignorant of grammar. After a year of development it's competitive with commercial grammar-based systems that have been under development for decades... And I believe it's going to make more progress in the next two years than conventional systems.

Unfortunately, Facebook's Open Graph protocol/API isn't open enough to enable interesting applications. For instance, Facebook has created pages for most topics in Wikipedia -- these can, on an individual basis, be reconciled to DBpedia topics, but there's the complication that often the "official" page for something in Freebase is a page managed by the entity, not the one from Wikipedia. With a reconciliation API or NT file, it would be easy to crosswalk Facebook pages with Dbpedia and Freebase data, but as it is, it's not easy.

(2) "Enterprise" Data Integration

All the time there's some story in the news about how a big organization is doing a project that involves integrating a large number of legacy systems. For instance, CALPERS (California's State Pension System) is integrating more than 60 legacy systems to make a system that will be helpful to state employees as well as CALPERS workers. Projects like this always run over budget and over schedule, and it would be great to have something that lowers cost and squeezes out risks.

I think a system based on declarative and logic programming could be a big improvement over the status quo. RDF is a good candidate for a 'universal' data model that can express information that exists in different kinds of databases.

One mode is to deal with a limited amount of information at a time (say a single customer) so that we don't abuse the scalability limits of highly expressive reasoners. The Amdocs example that Craig brings up is a good example.

Another mode is to 'throw it all into a triple store.' Many data warehousing applications use Star Schemas and bitmap indices which are similar in organization and performance to many triple stores -- RDF technology can be more competitive in this space than it is in OLTP.

We have a situation like the following in a project: We are developing a repository for a certain kind of artifacts (let's say a particular kind of documents), and for each stored artifact it should be possible to store key/value-style metadata. A few key-names and their value spaces are pre-defined. Other key-names are still in the flow and may eventually become fixed or may be changed or dropped over development time. But, most importantly, it should be possible for users to add their custom metadata to artifacts. So it is completely open what kind of data will be assigned to some artifact, despite the general format of key/value pairs, and a key that is used for one artifact does not need to be used for a different artifact. It should be possible to add new metadata to a stored artifact at any time, and this means that the repository has to deal with new key-names at any time.

There is a schema (OWL ontology) for the fixed key-names (to give some guidance in the first place, not so much for inferencing), and this schema may evolve over time for the still-in-flow key-names. There will probably be no such schema for any of the custom key-names. Ideally, neither the implementers nor the users/admins of the repository should be required to deal with the case of newly defined keys, the repository should be completely transparent in this respect.

Well, as you guess, we use a triple store for storing this metadata, where each triple consist of the URI assigned to a stored artifact, the key-name (or, if needed, a URIfied variant of it), and the value. And, obviously, the key/value pairs should not only be stored but also be retrievable in a flexible way. This calls for SPARQL.

I think this usecase practically excludes SQL/RDBs. However, I don't know what the situation is for NoSQL, since there is a variety of different NoSQL approaches, and I am not familiar with all of them.

Btw, I would /not/ go as far to call my usecase a killer application. :-)

AIDA - Semantic Real Time Intelligent Decision Automation

Telecom use case and topic of a recent webcast at Semanticweb.com

Link to the webcast, Semtech video and slides.

http://www.franz.com/agraph/amdocs/