Quality Indicators for Linked Data Datasets

There's an increasing variety of data available as Linked Data coming from a range of different sources. I'm wondering what indicators we might use to judge the "quality" of a dataset.

Clearly quality is a subjective thing, but I'd be interested to know what factors people might use to indicate whether a dataset was trustworthy, well modelled, sustainable, etc.

I've marked this as a community wiki question so we can collaborate on a list.

Summary of some of the key points from these answers and discussion on the LOD list:

  1. Accuracy - are facts actually correct?
  2. Intelligibility - are there human readable labels on things?
  3. Referential correspondence - are resources identified consistently without duplication?
  4. Completeness - do you have all the data you expect?
  5. Boundedness - do you have just the data you expect or is it polluted with irrelevant data?
  6. Typing - are nodes properly typed as resources or just string literals?
  7. Modeling correctness - is the logical structure of the data correct?
  8. Modeling granularity - does the modeling capture enough information to be useful?
  9. Connectedness - do combined datasets join at the right points?
  10. Isomorphism - are combined datasets modeled in a compatible way?
  11. Currency - is it up to date?
  12. Directionality - is it consistent in the direction of relations?
  13. Attribution - can you tell where portions of the data came from?
  14. History - can you tell who edited the data and when?
  15. Internal consistency - does the data contradict itself?
  16. Licensed - is the license for use clear?
  17. Sustainable - is there a credible basis for believing the data will be maintained?
  18. Authoritative- is the provider of the data a credible authority on the subject?

Further sources:

We could define LODrank as a pageRank-like measure that was a function of the number of links to/from other LOD datasets weighted by their LODrank. Alternatively, it might divided by the number of linkable instances in the collection, so that large datasets did not have an advantage. This metric addresses quality indirectly, depending on the fact that other LOD collections will be linked to a dataset if it is useful, which depends on how much data it exposes, how it's encoded, the data's quality, etc.

Brilliant question!

Indeed, there is a lot to do in this area. For a start, yes, I agree with Tim re the ranking (see also our LDOW09 paper "DING! Dataset Ranking using Formal Descriptions", essentially utilising voiD to do the job, now implemented in Sindice).

However, I understand that there are several aspects (as suggested by Leigh) one has to take into account:

  1. Quality of the raw data (the crap-in-crap-out syndrome); tools such as Gridworks can help here
  2. Quality of the RDFisation process (IN:some raw data, OUT: RDF using some vocabularies); here the main issue, I think, is the lack of (good) vocabs and also the discovery process (lot of people reinventing the wheel rather than reusing mature vocabs, where available)
  3. Quality of the interlinking process (IN: RDF dataset, OUT: RDF dataset + links to other datasets); here the main issue IMO is the lack of interlinking frameworks. The only reliable, usable one I'm aware of is Silk, though limited to owl:sameAs links.
  4. Quality of making more data explicit (aka turning data into information or simply put: reasoning) - there are ongoing efforts such as Scalable Authoritative OWL Reasoning for the Web, but that's not domain of expertise, so I better stop here ;)

An accompanying activity is the Pedantic Web group, providing input and support re the quality of Web data.

Thanks, Leigh, for setting the ball rolling - we should definitely continue working on this issue on a collaborative basis.

As an aside: in September 2010 we'll kick-off a project, which mainly focuses to address this issue. If you're interested in details, ping me.

Glenn McDonald provided an excellent list of 15 ways to think about data quality:

  1. Accuracy - are facts actually correct?
  2. Intelligibility - are there human readable labels on things?
  3. Referential correspondence - are resources identified consistently without duplication?
  4. Completeness - do you have all the data you expect?
  5. Boundedness - do you have just the data you expect or is it polluted with irrelevant data?
  6. Typing - are nodes properly typed as resources or just string literals?
  7. Modeling correctness - is the logical structure of the data correct?
  8. Modeling granularity - does the modeling capture enough information to be useful?
  9. Connectedness - do combined datasets join at the right points?
  10. Isomorphism - are combined datasets modeled in a compatible way?
  11. Currency - is it up to date?
  12. Directionality - is it consistent in the direction of relations?
  13. Attribution - can you tell where portions of the data came from?
  14. History - can you tell who edited the data and when?
  15. Internal consistency - does the data contradict itself?

Re. defining data quality (information qualitity) statements, the Info Service Ontology (1,2,3,4) could maybe help, because it offers a hook for defining information service quality ontologies, where information service quality ratings could be based on. A dataset could also be seen as information service (see the definition of the term 'information service'(5)). Hence, it might be of interest for dealing with quality ratings, which might also come from different information service quality rating agencies.

Cheers,

Bob

(1) The Info Service Ontology specification
(2) The concepts and relations of the Info Service Ontology
(3) A proof-of-concept example of the Info Service Ontology
(4) The blog of the Info Service Ontology
(5) Definition of the term 'information service'

Here are 6 subjective factors that affect the quality of Linked Data:

  1. Accuracy
  2. Navigability
  3. Access Protocols (they should be platform independent e.g. HTTP)
  4. Data Representation Formats (the more the merrier)
  5. Provenance
  6. Change Sensitivity (Freshness).

Basically, this is why I say you can look at a unit of Linked Data value in the same way you look at a Cube of Sugar :-)

Note, this is about the data not the place serving up the data, or the data consumption context halo.

Kingsley