What is skolemization?

What is skolemization? And, what is an example of skolemizing something?

...a not so interesting intro to blank-nodes and Skolemisation

Skolemisation (in the simplest case) refers to replacing existential variables with unique constants. With regards RDF, this particularly applies to blank-nodes.

Under the RDF semantics, blank-nodes are existential variables... there exists something, but that thing isn't specifically named. Thus, given:

_:signified rdf:type sioc:UserAccount .    (1)

We know that there exists something which is a member of sioc:UserAccount. In fact, according to simple entailment, the above triple also entails the following:

_:signified rdf:type _:someclass .         (2)
_:blah rdf:type _:blah2 .                  (3)

...the first triple reads as "there exists something which is a member of some class" and the second triple reads precisely the same—under RDF semantics, the above two triples represent the exact same information (RDF graphs which don't contain such "redundancy" are called lean graphs).

Skolemisation can be used to state that the existential variable _:signified actually denotes something in particular (i.e., that it's a constant). Essentially, you're replacing the blank-nodes with a unique constant not appearing elsewhere. Although this Skolemisation falls outside of the standard RDF semantics, many implementations interpret blank-nodes as "Skolem constants".. Sometimes, people wrap angle-brackets around the blank-nodes <_:signified> to indicate that they're Skolems, or replace them with some IRI like <genid:signified>... others just leave the blank-nodes in their natural syntax and interpret them directly as constants.

Under such interpretations of blank-nodes, triples (2,3) would not be entailed from (1) since _:someclass may refer to something concrete other than sioc:UserAccount since _:someclass is now a (Skolem) constant. Much like a URI, it now refers to something particular.

...a not so brief history of blank-nodes and Skolemisation

...here taking a little poetic licence.

In the beginning, there was RDF, and it was decided that URIs should name things. However, if you (the RDF publisher) were feeling lazy (or if you just needed a little syntactic sugar for lists, etc.), you could omit the URI in the RDF/XML markup and a blank-node would be born...

...and so it soon became time to bestow upon RDF(S) a semantics... a meaning. And so, it was asked, what do these blank-nodes mean? The brightest logicians of the day conferred and decided that blank-nodes should be interpreted as existentially-quantified variables... these nodes exist, but they don't have names. You can look, but you can't touch.

....this also captured the notion of "locality" for blank-nodes such that they could be scoped to the graphs containing them: when merging two RDF graphs, you could keep the blank node labels (labels ...not identifiers or names) unique such that blank-nodes with the same labels in different graphs were different blank-nodes (called an "RDF merge"). This meant that I could use blank-nodes to locally encode, say, a "well-formed" RDF list—with a single value for rdf:first, and a single sub-list value for rdf:rest, and the list being terminated with rdf:nil—and no matter what external graphs would do in the Open World, that local RDF list would always be well-formed and "closed" (assuming the data graphs were ultimately RDF merged together).

...along with this existential/local interpretation of blank-nodes came simple entailment (as above).

...and on top of that was built RDF and RDFS entailment.

...then along came OWL, which was on a slightly different (but closely related) tangent, and folks needed a way of writing OWL down as RDF. In fact, the locality of blank-nodes made them useful for encoding RDF lists (used for OWL union/intersection/enumeration classes), and other n-ary OWL constructs in RDF (for restriction classes like the owl:someValuesFrom construct below)... if the RDF representation was "well-formed" in the local document, it would be "well-formed" in the Open World... for example, an external graph couldn't just stick a new member into your local union-class if you used a blank node.

...and again, the brightest logicians of the day decided that OWL (RDF-based semantics) would also interpret blank-nodes as existential variables, giving rise to the notion of "anonymous individuals", an obvious example of which would be Beatrice's mother. Let's say that we know (1) Beatrice is human (:Beatrice rdf:type :Human), and that (2) all humans have mothers, who are themselves human (:Human owl:equivalentClass [ owl:someValuesFrom :Human , owl:onProperty :mother ]). Given this information, an OWL reasoner can figure out Beatrice has a human mother; even though it can't name that mother, it can still represent the entailment (if needs be) using an existential blank-node: :Beatrice :mother _:anon . _:anon rdf:type :Human ..

...in fact, you could model something like SPARQL ASK queries (in the days before SPARQL) using blank-nodes as variables. Just ask the (simple|RDF|RDFS|OWL RDF-Based) reasoner if the triple with the blank-node is entailed—you could ask "is there a human with a human mother" by asking a reasoner if the RDF claim _:exvar1 rdf:type :Human ; :mother _:exvar2 . _:exvar2 rdf:type :Human . was true.

...but there was some practical problems associated with making reasoners this smart, and allowing them to ponder upon existential matters. In practice, simple entailment became not-so-simple-NP-complete entailment... figuring out if one bunch of data with a lot of blank nodes was entailed by another bunch of data with a lot of blank nodes could get tricky. To get an idea of the problem, imagine a directed graph where each node is unlabelled (a blank-node), and these nodes are connected by directed links... now imagine an arbitrary second graph of the same type. Now, try figuring out if you can "overlay" the second graph somewhere into the first graph, preserving the link structure of both graphs. Roughly speaking, you're looking at the graph homomorphism problem (a.k.a. ~simple entailment). If you can overlay the second graph on top of the first graph (you don't need to cover the first graph), then you have entailment... the second graph is true if the first graph is true: you can "map" the blank-nodes in the second graph such that it becomes a sub-graph of the first graph. [It should be noted that there are fairly efficient algorithms for handling most cases of this problem, but still, it sucks to implement in the general case.]

...furthermore, you may already have spotted that Beatrice's mother is human, and so she must have a mother who is human, who herself must have... ad infinitum. People started dumbing down their reasoners, curtailing or forgetting about existential variables... the logician's hack of choice to overcome this problem was Skolemisation: for example, one could replace existential variables with unique constants and only do reasoning over named individuals (now including the Skolemised blank-nodes). More and more reasoners began to forget about Beatrice's mother and her extended, existential, human maternal lineage (esp. those reasoners that wanted to write entailments down) ...Michael's answer covers this in more depth and with more accuracy.

...all the while, publishers on the Web—mostly FOAF hackers in those days—were struggling with assigning URIs to things that weren't web-pages or email addresses. That's another sub-story (minting etiquette/centralised naming/dereferenceability/http-range-14 [hash vs. slash]), but the upshot was avoidance of URIs and a proliferation of blank-nodes on the Web, mostly representing people and users of blogging platforms. It's safe to say that these publishers couldn't give an existential crap about existentially quantified variables... they were just lazy/confused about minting URIs, and in RDF/XML, you can just omit the rdf:about/rdf:ID attribute and let someone else worry about the blank-node problem.

...and worry other people were forced to. If I parsed an RDF/XML FOAF profile with one parser, and you parsed the same document with another parser, we would need to do a simple-entailment check just to figure out if we were on the same page or not. Also, given the nature of RDF/XML where blank nodes are often assigned a label based on their ordering in a document, if one were unfortunate enough to have to try and monitor the changes of a particular RDF/XML document over time, sticking an extra blank-node into the start of the document would make it look like the whole document changed.

...then Linked Data entered the picture, and URIs became Cool, and you had a choice of 303s or hash URIs, and people started using some URIs some of the time. And blank-nodes were largely villified, with some minor exceptions: they weren't dereferenceable, and they certainly weren't cool. But still, blank-nodes persisted on the Web, be they due to "laziness", "legitimate purposes", "syntactic sugar", or a relic of older data... blank nodes were not going away.

...and then SPARQL came along, and it too sort-of-echoed the existential-variable-blank-node-business, where blank-nodes in queries were interpreted as "non-distinguished variables" (~existential variables). Things were also complicated by the scoping of blank-nodes in named-graphs, queries and results. Now, if I asked a query like "give me all friends of Beatrice", and the result set included a blank node _:bnode14, I couldn't find out more information about this particular friend since _:bnode14 is not a constant, but a variable, and since the _:bnode14 in the query is scoped differently from the one in the named graph representing Beatrice's FOAF file, and since—in any case—the SPARQL engine is allowed to make up any blank-node label as it sees fit.

...and so SPARQL implementors quickly created their own engine-specific solutions to create "hooks" for querying blank nodes, all revolving around Skolemising blank-nodes. Engines began rewriting blank nodes to <genid:bnode14> or <_:bnode14> or just interpreting _:bnode14 as a constant. This allowed for querying blank-nodes in a manner analogous for URIs. Different vendor-specific means of uniquely Skolemising and scoping blank-nodes (in the data, results and queries) began to emerge. Data exchange/federated-querying/etc. involving blank-nodes became a quagmire.

...a not so pretty prognosis for blank-nodes and Skolemisation

Blank-nodes have (arguably) long been an open wound festering in the Semantic Web standards—how severe the issue is is a matter of opinion.

Blank nodes are poorly understood, poorly used, difficult to handle—they have added a lot of complexity to the nuts-and-bolts of the standards built upon them, and the implementations consuming them. What started as a simple "shortcut" for publishers has gotten a little out of hand. Blank-nodes have their uses, but people are beginning to question whether they are worth the effort and what alternatives might exist. People are discussing amputating blank nodes from RDF, but this is not so straightforward.

First off, the standard interpretation of blank-nodes as existential variables is deeply ingrained in the standards: this interpretation was originally used as the only meaningful way the logicians of the day saw to formalise the locality of blank nodes (and to allow for denoting the existence of something you couldn't quite name). Blank nodes were then adopted/supported/mistreated by later standards (sometimes with significant cost) to maintain some form of backwards-compatibility or to meet user needs. Revising these standards now would come at a great cost (esp. wrt. implementations).

With respect to publishing, blank nodes are also used as convenient syntactic shortcuts in Turtle for containers, and for representing OWL in RDF. Blank-nodes are also useful for representing transient information as RDF, or for describing various resources which should not be externally (de)referenceable. Also, the original use-case for blank-nodes still exists: blank-nodes allow publishers to postpone/forego minting URIs for their data.

Given the complexity of the issue, proposed next-steps for dealing with blank-nodes are widely divergent:

  • leave blank-nodes as they are and continue to interpret them as existential variables;
  • leave blank-nodes as they are, but interpret them directly as Skolem constants;
  • leave blank-nodes as they are, but propose Skolemisation schemes for converting blank-nodes to (globally unique) URIs;
  • deprecate blank-nodes and propose Skolemisation schemes for converting legacy blank-nodes to (globally unique) URIs;
  • remove blank-nodes and get publishers to directly use URIs.

...there are various strong opinions out there, but no clear-cut answers.

In addition to Signified's long answer on what Skolem's are and how they entered the RDF world, here is a likewise long answer on the technical properties of Skolems. I will say under which conditions using Skolems is harmless, when it is mostly harmless, and which usage pitfalls exist. I will also talk about the more general case of Skolem functions in contrast to Skolem constants.

Equisatisfiability, or: When Skolems are the Right Choice

Dealing with existential variables in reasoners can become a challenge, but when the task is to check satisfiability (semantic consistency) of a logical formula (e.g. of an OWL ontology), then Skolemization is the right way to proceed, as Skolemization retains "equisatisfiability". This means that if the original formula is satisfiable (unsatisfiable), then the Skolemized formula is also satisfiable (unsatisfiable). That's weaker than logical equivalence and, in fact, a Skolemized formula is generally not equivalent to the original formula, but equisatisfiability is fully sufficient for satisfiability checking, and Skolemization significantly eases this task, as there are no existential variables anymore afterwards.

So far for satisfiability checking, but what about entailment checking? Well, provided a sufficiently expressive logic language, one can translate every entailment checking task into a corresponding satisfiability checking task, where the entailment holds if and only if the translated formula is un-satisfiable. So, in these cases of expressive languages, Skolemization can be used for entailment checking as well, although only indirectly after successful transformation of the original entailment.

How does the transformation look like? If you start from an entailment query "A |= B", where "A" and "B" are formulas of the language being considered, then the translated formula is "A and not B" ("A&~B"): A implies B if and only if "A&~B" is wrong under all possible interpretations, hence unsatisfiable (or semantically inconsistent). And if there are existential variables in the target(!) formula "A&~B", then it is safe to replace them by Skolems, since this will not change it's state of (un)satisfiability.

Under which conditions is such a translation possible? As one can see from the translation above, one needs conjunction ("&") and negation ("~"). Negation is the relevant language feature here, as it has to apply to all possible formulas of the language - or at least to all possible formulas that one is interested in to see in the conclusion part of an entailment query. For standard first-order logic, this is clearly the case. For some description logics it also is. But unfortunately not for RDF, as there is no negation of graphs and triples. (Well, it could become possible with a proper embedding of RDF into a more expressive language, but I'll better don't touch this topic here. :-))

Skolemization Lemma, or: the Mostly Harmless Use of Skolemization

So, we now know that heavy-weight logic-based reasoning can use Skolemization to properly and efficiently deal with existential variables. But the translation into a satisfiability checking task that is possible for these languages does not work for RDF entailment. Can Skolemization still be used without tears? Yes, pretty well, when Skolemization is restricted to the premise of an entailment check. In this case, while some semantic change is happening, this change can well be characterized formally and can be considered "mostly harmless". This is the content of the "Skolemization lemma".

This lemma may look a bit complicated to non-logicians but what it essentially says is that you do not learn much news from an RDF graph after being Skolemized. Of course, you get all the statements of the graph including Skolems as entailments as well. Also, Skolems, being constants, are formally more specific and therefore more "expressive" than existential variables. But as the Skolems have to be "fresh constants", i.e. they are neither used in the graph nor in the background theory, these entailments do not provide much additional information.

Pitfall 1: Skolemization versus Culture

Still, one should be careful with the Skolemization lemma in practice, as it may produce results beyond intuition. For example, imagine the following RDF graph:

G := { _:x owl:sameAs _:y . }

Under RDF semantics (plus equality, as provided by OWL), this just says that there exist two things that are the same, which is certainly true! However, the following happens to be a valid Skolemization of G:

sk(G) := { 

Big news apparently! But there's nothing wrong here: neither TimBL's nor Danbri's FOAF URI is part of G or the RDF background theory and, hence, they are proper Skolem constants.

One has to keep in mind: Skolemization is a purely logical/technical method, which doesn't care about what people have in mind when they see certain names being used as Skolem constants. One could, of course, restrict the set of names being applicable as Skolem constants.

Pitfall 2: non-fresh constants

While the above Skolemization by TimBL's and DanBri's URIs was technically ok (although slightly questionable from a cultural point of view :-)), what one really has to take care of is to not reuse names: a Skolem constant has to be "fresh"!

The freshness is not a vague concept but, rather, it is clearly specified in terms of the vocabulary being used for creating formulas in the language, including the background theory, if such exists. In a formal context, one always has to provide this vocabulary, and then it is easy to find out what names may (technically) act as a Skolem constant: every name not occurring in this vocabulary.

In RDF, the vocabulary is the set of terms used to build the graph, the vocabulary terms of RDF(S), such as "rdf:type", and all possible literals.

But why at all this "freshness" requirement? Because the Skolem constant must be a legitimate replacement for the original existential variable. There must not be any constraints on the existential variable other than those constraints that already exist on it. If you would reuse some URI that is used in the graph to replace some existential variable, there would at least be one RDF triple using this URI in the original RDF graph, and this triple would act as an additional constraint on the Skolem constant that has not originally existed for the existential variable.

Here is an example:

G := {
    :timbl owl:differentFrom :danbri .
    _:x owl:sameAs :danbri .

sk(G) := {
:timbl owl:differentFrom :danbri .
:timbl owl:sameAs :danbri . # invalid Skolemization

Here, I have incorrectly re-used the URI ":timbl", which is already used elsewhere in the graph G, to replace the blank node "_:x". Apart from the obvious strangeness of doing so, one can even see that this leads to a technical problem: while G was satisfiable in OWL (Full), sk(G) is unsatisfiable. As we have already seen that (proper) Skolemization retains satisfiability, this example cannot represent proper Skolemization!

Pitfall 3: Skolems in the Conclusion

While the Skolemization lemma essentially gives go to applying Skolemization to the premise of an entailment query, it does not talk about the conclusion part of the entailment, and this for good reason, as Skolemization of the conclusion will in most cases destroy an entailment. If there is an existential variable in the conclusion, then Skolemization will replace it by some fresh constant that does not occur elsewhere, in particular not in the premise ontology. This Skolem constant can then be interpreted pretty arbitrarily, and this semantic freedom will generally be deadly for an entailment.

For example, the following is a simple entailment, having an existentially interpreted blank node in its conclusion:

{ :danbri foaf:knows :timbl } 
|= { _:x foaf:knows :timbl }

This is a valid (simple) entailment, since the conclusion just says that there is someone (or something) who knows :timbl, and we already know from the premise that :danbri is an existing example of someone knowing :timbl. Now, after Skolemizing the conclusion, we receive:

{ :danbri foaf:knows :timbl } 
|= { sk:42 foaf:knows :timbl }

Here, "sk:42" is the chosen fresh Skolem constant being used to replace the blank node "_:x". It is easy to see that the conclusion does not logically follow from the premise, as "sk:42" can be freely interpreted, even by someone who does not know :timbl (e.g. my mother :-)).

Skolem Functions

For completeness, I wanted to note that for languages more expressive than RDF, Skolemization often leads to Skolem functions instead of Skolem constants, where function terms of the form "f(?x,?y,?z)" are used to replace existential variables in formulas. Actually, Skolem constants can be seen as null-ary Skolem function terms.

Skolem function terms are being used when formulas can have both existentially and universally quantified variables. For example, the statement "every human has a mother" can be formulated using explicit quantifiers as follows:

    Human(?x) => hasMother(?x,?y)

Now, if we naively Skolemize the existential variable "?y" by a Skolem constant "sk:y", we receive

// invalid Skolemization via a Skolem constant
    Human(?x) => hasMother(?x, sk:y)

This essentially reads that there is something denoted by sk:y, such that for every human ?x, sk:y is his/her mother. This is very different from the original assertion: originally, there was for every person some dedicated mother. Now, there is suddenly a single mother for everyone!

The problem is that the Skolem constant is independent of the still existing (universal) variable "?x". The remedy is to formally introduce such a dependency by building a function "sk:y(?x)":

// correct Skolemization via a Skolem function
    Human(?x) => hasMother(?x, sk:y(?x))

Now, for any instantiation ?x of a Human, the "hasMother" predicate relates ?x to the mother of ?x, who is denoted by the function term "sk:y(?x)".