Publish RDF or provide a SPARQL endpoint?

AleksanderK · November 12, 2009, 11:00pm

I was wondering: how do you decide between making your data available as static RDF and creating a SPARQL endpoint? Should you do both? What things should you consider when making that decision?

SimonReinhardt · November 12, 2009, 11:00pm

Doing both is the ideal solution: publish RDF with dereferenceable URIs for easy data retrievability and provide a SPARQL endpoint for flexible data access.

Note that publishing RDF doesn't have to be static: you can use SPARQL DESCRIBE queries for collecting the data that is to be served under a dereferenceable URI. Also by describing your dataset (with voiD) and linking from published RDF data to that description and linking to the SPARQL endpoint you create a complete picture and increase discoverability.

But really, this is all just the ideal solution and both bits have their own benefits so it depends on what you want to do and what you can easily do. Are you organising a large dataset in a triple store anyway? Then it's probably not too hard to make it accessible through an open SPARQL endpoint. But if you mainly have your data organised in files then publishing static content is the easiest and quickest thing to do and better than nothing. You can even do nice dereferanceable Linked Data URIs with static files and a simple webserver.

Anna · November 12, 2009, 11:00pm

Consider the size of your dataset and the expected load. If you provide a SPARQL endpoint, you run the risk of your service becoming unresponsive due to complex (or poorly written) SPARQL queries, particularly if your dataset is large.

This risk might be mitigated by requiring users to register to use the endpoint, while the static RDF remains open access. If you are not expecting heavy usage of your data, you might consider that the benefits of providing a SPARQL endpoint outweigh the risk.

Andrew · November 12, 2009, 11:00pm

This is a question that applies equally to relational databases or any other large system that uses some kind of shared data store. The problem is to process your query in the place where it will have the least impact on system performance (whilst maintaining your global architecture goals).

Another way to think about this problem is to ask yourself: which makes more sense - to take the processing to the data or take the data to the processing? This will depend on what your bottleneck resource is. If it's bandwidth, then that might suggest using SPARQL. If it's CPU load (or some other scarce resource) on your data store, then perhaps it makes more sense to send the data to somewhere where that resource is less stressed.

As Simon points out, the ideal scenario is to provide for both. This is particularly true if you're optimizing a distributed system. In that case, the bottleneck will tend to move between different parts of your system.

TomaszPlusk · November 12, 2009, 11:00pm

Let me just add that RDF documents as opposition to a SPARQL endopind need not be static files. It should be relatively easy to create a crude app to publish your semantic data as RDF documents.

For example you can have a resources with URI http://data.example.com/category/name. It's Turtle representation could be served by URL like http://www.example.com/category/name.ttl. Note the bolded parts that changed. Similarily this resoruce could be served as HTML, RDF/XML or JSON by simply changing the extension part (underneath a HTTP redirect with content negotiation could happen).

This way you don't have to maintain individual files, which will become a burden as your dataset grows.

By the way have a look at the Linked Data Patterns book. It addresses these and other common issues.