Advantages of RDF over Relational Databases

I am still looking for useful arguments for deciding to use RDF as an alternative way to model and offer the data on the internet over relational databases. I know that it depends a lot on the use case and I am also aware of the advantages a graph structure as RDF offers:

  • decentralized
  • distributed
  • enable seamless interoperability of the data on the web
  • can capture all kind of data: information from unstructured, semi-structured or structured sources
  • graph structure allows modular operations
  • bottom-up approach (relational databases are top-down)

But all of this sounds very dry. Can someone provide advantages from a practical point of you, which would explain the listed points or are there any other advantages? This should not only include RDF, but also the semantic web.

One thing I would say is that it's easy to embed additional information about certain entities, for instance you offer a web interface which lists cities and DBpedia data can easily be integrated.

I hope this questions fits the Q&A style.

My answer is tainted by my experience of datawarehousing in pharma and lifesciences, as well as working on a website with millions of hits on a very bad day.

Datawarehousing

At some point your system needs to answer a question that takes into account information outside your central database. For a user of a classical relational database you can't do federalized queries, so loading the extra data (expanding the schema) is the only option. A SPARQL 1.1. user with the SERVICE keyword can have part of your query answered by a remote system.

At some point the expanding the relational schema option becomes too expensive compared to the value given by the answered query. And the amount of work required before you can even do a query becomes prohibitive. SPARQL allows cheaper experimentation.

In the life-science field hundreds of organizations have created data-warehouses combining public and private information. For example a lab would want to query over ChEMBL, UniProt, ArrayExpres and PDB plus their internal data. They would hire a bunch of programmers (from students to professionals) and they would start integration, write parsers, make relational schema, tune database. And after about a year they would have a working system.

And then the file formats would change... meaning a parser change... meaning a schema change... meaning queries need to be fixed. And they do this the first year... But soon these updates stop happening and the bitrot sets in. Requests for new datasources start to get denied and the system never lives up to its promises.

In the rdf/sparql world this there is no need for parser to change (triples is triples), no schema so no need for changing tables. Just your queries need updating. 1/3 of the work! And the same is true for the start up costs. Well uniprot is available as RDF download and public sparql endpoint in beta. Same for ChEMBL and ArrayExpress. PDB you would need to load yourself into your own endpoint. But all you need to do there is tune your RDF store and your good.

And as we are working in the real world: at some point someone is going to come to your desk with an excel file and they want to combine the data in the excell file with your relational database. Thats possible thanks to ODBC and some sql magic. But you will never be able to combine it with public resources.

Excel and R

The SPARQL/RDF excel story is easier today. In excel save as tab/cvs. Use biotable or equivalent and change excel to RDF. Load into an in memory sparql database such as the one that ships with JENA. and query away combining any SPARQL endpoint available via HTTP. The same is true for data analysts using R or another statistical language.

No data left behind

SQL with its long history has locked up a lot of information. Best explained with an example. Assume you have an legacy Oracle forms 6 application running in a corner in your enterprise. Accounting wants to merge some of that data with your current CRM (SAP/Sybase) installation. How do you do this using only relational technology? With SPARQL you can put a relational to RDF mapper on both the old oracle 6 database and the latest sybase release. Then you can use the SERVICE keyword to collate information from both sources with minimal pain and effort. Only relational does not offer this option, only the addition of SPARQL to SQL translation gives this capability.

SPARQL is more consistent than SQL

SPARQL over all datasources is extremely standardized. Not at all like SQL. SQL has to many dialect changes from one implementation to the other for random users to explore schema's. I know how to show a list of tables in Oracle and MySQL, but always forget for Postgresql and DB2. A semi trained user that has experience with one vendors SQL syntax is lost in the next. A semi trained SPARQL user can use any SPARQL endpoint no matter what the implementation is or what the schema looks like.

Lower cost

So in the end SPARQL based information systems have lower maintenance costs in fields where information changes rapidly. Have lower building cost for integrative data (RDF as source compared to XML as source, relative same experienced developers). Allow for experiments with public or super private information in the enterprise. (i.e. combine hr database and ldap database to find who has been fired and still has an active e-mail account. etc...)

Allows competitive tendering

For those who are used to Oracle or DB2 prices. SPARQL brings an other advantage, much lower switching cost. This is something that changes the amount of decisions developers and project leaders need to make ahead of time. People ask me which store should they use: I answer, use a simple free store that works well enough for your test data. Communicate via http with the store. If at some point that free store no longer works (not enough performance, missing feature) then have a competitive benchmark with different stores. For the UniProt endpoint I tested in depth five stores, and in the end chose one with the best performance/price combination. I still regularly test other stores/hardware combinations because it costs me so little time to do so.

Better performance for naturally grouped data

Imagine a webpage for a database record such as a ENA or UniProt record. Each webpage corresponds to a single record. Each record consists of many fields which are again records. In normalized SQL you will have 50-70 tables with n:m connections that need to be joined in to display each page. Often you know exactly which data belongs on that page ahead of time and. You could imaging using a key-value store there (which ENA and UniProt do). The unfortunate side effect is that you would lose your SQL search capabilities.

SPARQL because there is a graph next to the triple does not need to suffer from this problem. If some data is needed for webpage just attach the uri of the webpage as a grap context. Allowing you to use your SPARQL endpoint as a a key-value or document store. Where the key is the graph id and the value is the triples in the graph.

This can give far superior performance over a only relational approach. This of course depends on the specific implementation of your store if this is fast or not. But again, as your access is standardized you get to chose an implementation that is specific for your needs.

Better caching

SPARQL over http offers more benefits: Very cheap caching. Most sysadmins have installed a http cache at one point in their career. A database cache layer such as included with hibernate is much more specialty software with associated complexity and prices.

SADI

For many queries a user needs to combine a information in a database and feed it into a program and then get more information out of a second database. SQL can never drive the program. SPARQL can. See the wonderful SADI framework for more details. But basically, given a SPARQL query and some information in RDF and a RDF description of the input and output of a program. SADI can combine all this and answer questions for which you do not have that data. This is possible because SPARQL/RDF technology has first order logic reasoners using OWL. Nothing comparable for relational technology exists.

A simple SADI example

Given a database with only person height and weight, you want to query for obese people. Your database does not know about obesity. However, any woman's magazine with a website has a BMI(body mass index) calculator that knows about the concept of obesity. SADI can combine your people information in your database with the BMI calculator to return whether each person is obese or not. It can do this because given the input "a person, with a height and a weight" give the output "a person, with a BMI value". Then the SPARQL engine can answer the query give me all people with a BMI over 26.

RDF puts relations first.

While relational databases are called relational they do not put their relations centrally and are only implicit. In RDF the predicate of a triple is the relation and its explicit in its meaning. The foreign key relation between two tables only has meaning derived from context.

Conclusions

Standardisation in SPARQL allows flexibility: for users, for developers, for admins, for buyers.