How to use large databases in semantic web applications?


i need some hints, how to use the data from large databases in ontologies to make reasoning.

I have a really big database (billions of customer information). So far i understand the semantic web, i have to import the complete information (as RDF?) from the database in an ontology, to make inferences with an reasoner (for example Pellet). Is it true, do i have to import all data from the database? And if this is true, is this not a really big memory (RAM) problem?

Or are there other ways to use large databases in semantic web applications? How can i make this huge amount of information available for using in ontologies?



There's several ways to use (relational) databases in SW applications. Which solution is most suitable for you depends on the specifics of what you want to achieve. Without claiming to be complete I'll give a few pointers to get you started:

  1. You can do a one-time conversion of your data to RDF/OWL and load that into a triplestore or OWL reasoner;
  2. You can engineer a custom wrapper/translator around your relational database that converts calls on some Semantic Web API to calls on your database dynamically (for example, Sesame's SAIL API can be used for this purpose);
  3. You can use a declarative relational-to-RDF mapping tool, like D2RQ to create either a single conversion or a dynamic mapping link.

If the database you want to use is stable (that is, the data does not change), a one-time conversion might be the easiest solution. As an aside: converting to RDF and using a triplestore and/or OWL reasoner does not automatically mean that you have to keep everything in RAM - there are various good and scalable triplestore/reasoner combinations that have their own on-disk storage solutions.

However, if you require that changes in your relational database are immediately reflected in the Semantic Web application you have in mind, you will need to look at a more dynamic mapping option, like 2 or 3.

...tying in with point 2 in Jeen's answer...

I'm pretty sure that you're not going to be using Pellet over billions of assertions.

Just a pointer that you may want to look at OWL 2 QL: an OWL 2 profile intended to work well with a relational database:

It is designed so that data (assertions) that is stored in a standard relational database system can be queried through an ontology via a simple rewriting mechanism, i.e., by rewriting the query into an SQL query that is then answered by the RDBMS system, without any changes to the data.

It's designed to implemented within a wrapper that sits over your database. How scalable or efficient it would be is an open question:

The OWL 2 QL profile is designed so that sound and complete query answering is in LOGSPACE (more precisely, in AC0) with respect to the size of the data (assertions)...

In particular, it might create SQL queries that are very expensive to run over your data.

One place to look for hints is a previous Common Semantic Web misconceptions you’ve encountered?. One of these is that OWL and the Semantic Web are not synonymous. In particular, for operations on large-scale data, I'd suggest looking to the SPARQL standards instead of OWL. All triple-store back-ends that are worthwhile will be able to handle SPARQL queries. There are also RDF front-ends to relational data, D2R being the most prominent, that will apply SPARQL queries to relational back-ends without having to translate the relational data into RDF.

SPARQL 1.0 is a bit limited, but the SPARQL 1.1. standard, which handles things like updates (which can't be done in OWL) is a significant improvement.

Could you please precise the original data store that you have? Is it a relational database, such as MySQL, Oracle or SQLServer?

Assuming your source data resides in an RDBMS, you can leverage the RDF Views functionality of Virtuoso. This feature enables the following:

  1. Transient RDF based Linked Data Views over ODBC or JDBC accessible Data Sources via their Data Source Names (DSNs)

  2. Materialized RDF based Linked Data Views over ODBC or JDBC accessible Data Sources .

With regards to item #2 you end up with a solution that's still change-sensitive while also working with Virtuoso in-built backward-chaining reasoner and its faceted navigation engine. Both of these features are endowed with linear scalability, so you will have consistent performance as the datasets in the Virtuoso store grow.

Here are some links that show you how to build RDF Views atop ODBC accessible Data Sources using a number of leading RDBMS engines as backends:

  1. -- RDF Views Page

  2. -- Sample Page showing Northwind RDBMS Schema based Materialized Views via a Faceted Browser Interface (click on property or property value links for full effect)

  3. - Sample Descriptor Page showing an Entity derived from a Row in the Northwind RDBMS schema

  4. -- Tutorial excerpt showing how to use backward-chained reasoning (conditionally) via Virtuoso .