Advice on vocabularies for a data catolog

tumgoren · May 12, 2021, 4:49pm

Hello all,
My organization provides a basic data platform where journalists can upload and share static data (typically CSVs). We’ve begun exploring DCAT and vocabularies such as Data Cube as a way to build a knowledge graph and improve the discoverability and interoperability of data on the platform. We’re in the earliest research stages and I was hoping folks here could provide us with a sanity check and some guidance on a few basic questions related to cataloguing.

Specifically, there are some concepts built into our platform that don’t seem to be reflected in DCAT (even the latest v3). For example, our top-level concept is a Project, which typically includes one or more Datasets and related Distributions. Projects also typically have related resources such as READMEs, training videos, and other supplementary material that help users understand how to interpret and work with Datasets contained in a project. These related resources also do not seem to have clear corollaries in DCAT. I’m wondering how more experienced folks on this list would handle this situation. Would you recommend creating a DCAT Profile that extends the base vocabulary? Or perhaps there’s an existing DCAT profile or an entirely different vocabulary that covers these concepts?

We’re new to linked data in general, so any and all advice is greatly appreciated!
Best regards,
Serdar

tumgoren · May 14, 2021, 1:36am

Just discovered foaf:Project! Seems like this should cover our basic use case, though happy to hear thoughts if folks disagree and can recommend a better alternative.

emidiostani · June 2, 2021, 9:29pm

Hello,

I am aware of the DINGO ontology covering the concept of project, roles and grants but nothing would forbid to extend DCAT. I would look at the prince2 or pm2 terminology

aaranged · June 2, 2021, 10:44pm

While you’re exploring DCAT, it seems to me like a lot of your use cases might be better accommodated by schema.org/Dataset and related vocabulary (which includes DataCatalog - “A collection of datasets”).

Even if you’re committed to DCAT the spec says “Complementary vocabularies can be used together with DCAT to provide more detailed format-specific information. For example, properties from the VoID vocabulary [VOID] can be used within DCAT to express various statistics about a dataset if that dataset is in RDF format.” (Though I wouldn’t know to go about it. This might be helpful.)

See also the Google Developers reference for datasets.