When to mint new URIs?

When do you reuse existing URIs for concepts (eg, dbpedia URIs), and when do you mint your own and owl:sameAs to the more canonical URI ?

Great question! I think it can only be answered by considering the costs involved and who incurs them. The costs here are integration costs associated with linking and querying across multiple sources of data.

Minting a new URI is low cost for the publisher because they can simply focus on modelling and publishing their data. They can add owl:sameAs links at another time so it decouples the cost of locating equivalent resources from the task of publishing. That's a big win for the publisher but the consumer has to bear the cost of integration. If the publisher defers the addition of owl:sameAs then the consumer will have to source them. When the consumer is dealing with a large number of independently published datasets there is no guarantee that the owl:sameAs linkage will be comprehensive enough to completely integrate the various datasets. So while a machine can work through all the equivalences, they have to be there in the first place. Simplisticly every consumer that wants to work with the data has to bear this integration cost.

The alternate approach is to reuse URIs as much as possible. Here the publisher bears all the cost and the consumer simply uses the data as-is. The problem with this approach is that the publisher becomes bound up in searching for equivalences before they can publish their data. The long term advantage is that many more consumers can work with the data straight off. If we assume there are going to be many more consumers than publishers then the total cost of integration across all the parties is much much smaller in this scenario when compared to the first because it's only done once per dataset.

The advice from the LOD project has been to mint new URIs. That's completely the right advice in the bootstrapping phase, where the imperative is to remove obstacles to publishing the data. However I think we are moving beyond the bootstrapping phase and we need the lower the costs for each consumer, so I think we now ought to be encouraging reuse of URIs by publishers and optimise for consumption.