How do you handle non IRI-compatible URIRefs?

JeenBroekstra · April 5, 2011, 10:00pm

This question needs a bit of introduction, bear with me please.

In RDF, resources are identified using URI references. The notion of URI reference was added in anticipation of the standardization of IRIs. The SPARQL spec builds on this and in fact adopts the IRI standard as part of its spec: RDF terms in SPARQL queries are identified using IRIs.

Unfortunately, there is an incompatibility: RDF URI references may contain "<", ">", '"' (double quote), space, "{", "}", "|", "", "^", and "`", but these are not allowed in IRIs (see SPARQL's IRI syntax). The upshot of this is that while it is perfectly legal to have an RDF triple of the following form:

<http://example.org/a b> a ex:Foo .

You can not directly query such a resource using SPARQL, e.g.:

SELECT * WHERE { <http://example.org/a b> ?P ?Y . }

is not a syntactically valid SPARQL query.

All of this is pretty well known of course. The reason I am introducing it is that I would like to ask some "best practice" type questions related to this issue:

have you ever encountered this problem in practice, that is, have you ever had to work with a dataset that contained such non-compatible URIrefs?
how do you deal with this incompatibility? Do you query around it? Does your triplestore/parser toolkit of choice offer you some kind of workaround for this problem? Or do you simply convert the offending data?

I'm not so much looking for theoretical solutions, I'm more interested in what has been done out there in practice, already.

This question was inspired by a recent discussion on the Sesame mailinglist by the way, just in case you thought it looked familiar :)

Nathan · April 5, 2011, 10:00pm

a note: U+0020 SPACE is not a valid IRI char, it must be percent encoded.

for reference:

               ----------------------------------------
              |  U+0009 \t
              |  U+000A \n
              |  U+000B \v
% encoded --> |  U+000C \f
              |  U+000D \r
              |  U+0020 SPACE
              |  U+0085 NEL (NEXT LINE)
               ----------------------------------------
              |  U+00A0 NBSP (NO-BREAK SPACE)
              |  U+1680 OGHAM SPACE MARK
              |  U+180E MONGOLIAN VOWEL SEPARATOR
              |  U+2000 EN QUAD
              |  U+2001 EM QUAD
              |  U+2002 EN SPACE
   allowed -->|  U+2003 EM SPACE
              |  U+2004 THREE-PER-EM SPACE
              |  U+2005 FOUR-PER-EM SPACE
              |  U+2006 SIX-PER-EM SPACE
              |  U+2007 FIGURE SPACE
              |  U+2008 PUNCTUATION SPACE
              |  U+2009 THIN SPACE
              |  U+200A HAIR SPACE
              |  U+2028 LINE SEPARATOR
              |  U+2029 PARAGRAPH SEPARATOR
              |  U+202F NARROW NO-BREAK SPACE
              |  U+205F MEDIUM MATHEMATICAL SPACE
              |  U+3000 IDEOGRAPHIC SPACE
               ----------------------------------------