URIs vs. IRIs in Semantic Web standards and tools

Signified · November 30, 2011, 11:00pm

So, IRIs generalise URIs to allow further unicode support. Cool.

However, it seems that different Semantic Web standards make different choices here. If I understand correctly

So far, I've not really worried about the distinction, but now I'm wondering what are the practical implications of supporting IRIs.

Do they require special consideration in tools and parsers? For example, will an RDF 1.0 parser be able to parse RDF 1.1 data?

Also, for example, do I need to introduce IRIs into the acronym soup of various talks and papers (instead of the trusty URI)?

In summary: when do I need to worry about the distinction between IRIs and URIs?

cygri · November 30, 2011, 11:00pm

RDF never actually used “URIs” but “RDF URI references”, a made-up concept in the RDF spec that tried to reflect what was then expected to be the output of the IRI activity. After RDF was finished, IRIs actually became a reality, and In SPARQL and any following specs, this concept of “URI references” was dropped and replaced by “IRIs”. RDF 1.1 will follow this lead, drop the term “URI reference”, and use “IRI” instead.

One practical consequence of this is pointed out in the draft spec:

Previous versions of RDF used the term “RDF URI Reference” instead of “IRI” and allowed additional characters: “<”, “>”, “{”, “}”, “|”, “\”, “^”, “`”, ‘“’ (double quote), and “ ” (space). In IRIs, these characters must be percent-encoded as described in section 2.1 of [URI].

Besides that, it's a simple search-and-replace change in the specifications – starting in late 2012 or early 2013, RDF will be based on IRIs instead of the weird “URI references”.

When do you have to worry about the distinction? Again, the RDF 1.1 draft spec answers this question:

When IRIs are used in operations that are only defined for URIs, they must first be converted according to the mapping defined in section 3.1 of [IRI]. A notable example is retrieval over the HTTP protocol. The mapping involves UTF-8 encoding of non-ASCII characters, %-encoding of octets not allowed in URIs, and Punycode-encoding of domain names.

To summarize: In the abstract RDF data model (version 1.1, which you should already anticipate), there are only IRIs. IRIs allow all characters beyond the US-ASCII charset. In some situations – notably HTTP retrieval – it is not allowed to transmit non-US-ASCII chars in the network identifier, so the IRI has to be converted to a URI using the process sketched in the note above and formally defined in RFC 3987.

And that's pretty much all.

In public-facing communication, I suggest to ignore the whole issue. Just keep doing whatever you do. The distinction between URIs and IRIs is easily managed for anyone who actually needs to know about it; so it's perhaps best to sort of gloss over it for the benefit of the majority who doesn't need to know about it.