What are the most important inference capabilities for a triple store?

What are the most important inference capabilities for a triple store?

I'm interested what the comminity feels are the most important things, for example:

  1. SubClasses - Ability to find all instances of a given type and sub types.
  2. SubProperties - Ability to use a super property (that includes sub properties) in queries.
  3. SymmetricProperties - Ability to infer assertions based on assertions with a symmetric property.
  4. etc...

Just curious where the focus is. What are the most important features and why? What is not so important and why?

Thanks

I'd say it very much depends on what you want to do with your triple store and your data. As far as I know, most triple stores focus on inferences that can be produced by rule-based reasoning and to this end, they implemented OWL 2 RL/RDF rules or a portion of it. These rules include the RDFS rules, so it supports features like rdfs:domain, rdfs:range, rdfs:subClassOf, rdfs:subPropertyOf. I'd say these features are quite well supported and used by the community. owl:sameAs is often considered a very important feature of OWL, to the point that some people would like to see it as part of RDF itself.

Now, these are popular but that does not mean they "important" per se and there are different applications that would consider these not so interesting. I'm thinking of applications where there are lots of updates in the data (especially deleted data), so that reasoning should not materialise inferences that may come from knowledge that will be deleted afterwards. In this case, reasoning at query time is more important and this can be done by query rewriting mechanisms that would not suite well rule-based engines. For this, the OWL 2 QL fragment of OWL is a good candidate to look at.

One can also imagine the case of an application where only classification of instances is important, so that a OWL 2 EL reasoner would do the job efficiently.

Anyway, I can also answer in a more general way to the last question:

What is not so important and why?

There are some inferences that are clearly of weak interest, such as deducing that everything is a resource (?u a rdfs:Resource for all ?u), that everything is equal to itself (?u owl:sameAs ?u), that all classes are equivalent to themselves, that all properties are equivalent to themselves or that every URI in the predicate position is an rdf:Property. I'd say also that owl:qualifiedCardinality is not very important in general because is rarely used and it's difficult to get anything useful from it in most cases. There are probably others things that are not so important but I'll let other people complete the list.

Chapter 7 of Allemang and Hendler's Semantic Web for the Working Ontologist describes RDFS+ that was based on their interpretation of trends in the industry. The subset they describe could be a useful starting point. All of RDFS is included, along with the following from OWL:

  • owl:inverseOf
  • owl:SymmetricProperty
  • owl:TransitiveProperty
  • owl:equivalentClass
  • owl:sameAs
  • owl:equivalentProperty
  • owl:FunctionalProperty
  • owl:InverseFuncionalProperty

I would want to make an argument for owl:disjointClass for consistency checking.

There might be cases in which the triple stores do not provide efficient inferencing capabilities, or the data changes too frequently to make this prohibitively slow. A good alternative in those cases is often to put some of the "inferencing" into the query, e.g.

SELECT ?page 
WHERE {
    ?property rdfs:subPropertyOf* foaf:page .
    ex:something ?property ?page .
}

in SPARQL 1.1: the * operator does the transitive inferencing that is so commonly needed.

I'd say:

  1. Subclasses
  2. Subproperties
  3. Inverse properties
  4. rdfs:domain/rdfs:range
  5. owl:sameAs
  6. owl:InverseFunctionalProperty/owl:FunctionalProperty

... #1 and #2 because people often subclass (and "subproperty", but not sure that works as a verb?) other popular vocabs when designing theirs. #3 because some vocab authors insist on defining two properties representing the same relationship, which I consider an anti-pattern. (Pick one direction and stick with it!) #4 because so many vocabularies define rdfs:domain and rdfs:range for each of their properties, so you ought to get a lot of bang for your buck. #5 because of useful sources like sameas.org. #6 because FOAF uses them for properties like foaf:homepage, foaf:mbox, etc, and they're just so useful for figuring out when two URIs identify the same resource.

Like Antoine says, it depends heavily on the data the triple store will be indexing. The more heterogeneous the data, the better motivation there will be for reasoning. Thus, one of the best use-cases out there for reasoning at the moment is (ironically?) Linked Data.

Case and point; I want to ask for the pages about a resource ex:something using SPARQL over a heterogeneous corpus of Linked Data, and I want fairly complete answers:

SELECT ?page
WHERE {
 ex:something foaf:page ?page .
 UNION { ?page foaf:topic ex:something . }
 UNION { ?page foaf:primaryTopic ex:something . }
 UNION { ex:something foaf:isPrimaryTopicOf ?page . }
 UNION { ex:something foaf:weblog ?page . }
 UNION { ex:something foaf:homepage ?page . }
 UNION { ex:something foaf:tipjar ?page . }
 UNION { ex:something foaf:openid ?page . }
 UNION { ex:something po:microsite ?page . }
 UNION { ex:something mo:biography ?page . }
 UNION { ex:something mo:paid_download ?page . }
 UNION { ex:something mo:onlinecommunity ?page . }
 UNION { ex:something mo:discography ?page . }
 UNION { ex:something mo:myspace ?page . }
 UNION { ex:something mo:discogs ?page . }
 UNION { ex:something mo:fanpage ?page . }
 UNION { ex:something mo:review ?page . }
 UNION { ex:something mo:amazon_asin ?page . }
 UNION { ex:something mo:wikipedia ?page . }
 UNION { ex:something mo:musicmoz ?page . }
 UNION { ex:something mo:olga ?page . }
 UNION { ex:something mo:mailorder ?page . }
 UNION { ex:something mo:imdb ?page . }
 UNION { ex:something mo:download ?page . }
 UNION { ex:something mo:free_download ?page . }
 UNION { ex:something mo:musicbrainz ?page . }
 UNION { ex:something mo:preview_download ?page . }
 UNION { ex:something mo:event_homepage ?page . }
 UNION { ex:something mo:homepage ?page . }
 UNION { ex:something mo:freedownload ?page . }
 UNION { ex:something mo:licence ?page . }
 UNION { ex:something mo:paiddownload ?page . }
 UNION { ex:something doap:homepage ?page . }
 UNION { ex:something rail:departures ?page . }
 UNION { ex:something rail:arrivals ?page . }
 UNION { ex:something xfn:mePage ?page . }
 UNION { ex:something plink:profile ?page . }
 UNION { ex:something plink:content ?page . }
 UNION { ex:something plink:rss ?page . }
 UNION { ex:something plink:foaf ?page . }
 UNION { ex:something plink:atom ?page . }
 UNION { ex:something plink:addFriend ?page . }
 ...
 UNION { ex:somethingAlias foaf:page ?page . }
 UNION { ?page foaf:primaryTopic ex:somethingAlias . }
 ...
}

Making out that list just ruined my day. If I also wanted to get rdfs:labels, that would ruin my weekend. If I wanted to do this for all known aliases of ex:something, that would ruin my "youth".

With reasoning enabled, you'd get the same answers with the non-day/weekend/youth-ruining query:

SELECT ?page
WHERE {
 ex:something foaf:page ?page .
}

That's pretty much why you might want to do reasoning over a triple store.

Anyways, the best way of finding out what inferencing is useful would be to take real query-logs and figure out how many additional answers could be found with rule X enabled. Unfortunately, not many people are using SPARQL endpoints and seemingly no-one has query-logs to share.

So instead, I'll just add my own (subjective) prioritised list for Linked Data to the pile:

  • owl:sameAs
  • owl:InverseFunctionalProperty
  • owl:FunctionalProperty
  • owl:inverseOf
  • rdfs:subPropertyOf
  • owl:equivalentProperty
  • rdfs:subClassOf
  • owl:equivalentClass
  • owl:SymmetricProperty
  • owl:TransitiveProperty
  • rdfs:domain
  • rdfs:range

The above considers (i) perceived usefulness for query-answering, (ii) current prevalence of use.

I've put rdfs:domain and rdfs:range so low because I tend only to notice them when they cause unnecessary grief (esp. in FOAF). This is a good example where, IMO, prevalence != usefulness.

A high-level summary of priority would be:

  • instance-equality-centric reasoning
  • property-centric reasoning
  • class-centric reasoning

There's a lot of instances out there that need to be aligned. After that, property "hierarchies" are typically much more detailed (and more interesting/nuanced) than class "hierarchies" in heterogeneously instantiated vocabularies (if you disagree, try reformulating the above foaf:page example using classes from Linked Data vocabs). I think this trend will continue.