Merging / standardising / aggregating taxonomies

robertmuil · April 8, 2021, 3:32pm

What approaches have people used to bring multiple overlapping taxonomies together?

Imagine multiple parties have all constructed their own taxonomy for the same domain. These parties have come together and want to use these individual taxonomies to inform (or, better, to automate) the construction of a new, standard taxonomy that all can use. This standard taxonomy should minimise the collective ‘distance’ to all individual taxonomies, where ‘distance’ is some well-defined metric computable between any taxonomy pair.

Seems there are 3 main things to consider, and I’m most interested in the 1st, and 2nd currently:

Designing the distance metric: comparing two taxonomies it should be possible to identify matches/overlaps/clashes at the node level, but also to quantify the overall ‘distance’ between the two taxonomies as a whole.
Construction of the new standard taxonomy - as automated as possible - to yield something which minimises the sum of the distance metric between new standard and all individual taxonomies. This could be manual, or automatically through inference mechanisms and NLP, or it could happen in some embedded space with statistical techniques?
Representing the new taxonomy: probably with a formal standard like SKOS or OWL and ideally with links to the individual taxonomy (owl:sameAs, subclass, disjoint etc.) - the degree to which the mapping to individual taxonomies is possible will be tightly linked to the construction: if its a statistical technique on embeddings, this may not be possible at all?

BobDuCharme · April 9, 2021, 2:01pm

I would recommend SKOS over OWL because this is exactly what SKOS is designed for and OWL, like in so many other situations, is overkill here. (The Library of Congress makes good use of it; see https://id.loc.gov/download/.)

Even if you’re going to automate, it would be good to get a real taxonomist involved to help ensure that you’re automating appropriately. Sometimes the same concept has two different names for a good reason; “myocardial infarction” may be more technically correct than “heart attack” but may be technical mumbo-jumbo to some audiences. As with software applications, a good understanding of the intended audience is an important part of taxonomy design.

I recommend Heather Hedden’s book “The Accidental Taxonomist” and her blog at http://accidental-taxonomist.blogspot.com.

hhedden · April 9, 2021, 7:33pm

I co-presented a presentation last month at the ENDORSE conference,“Building, Enhancing, and Integrating Taxonomies.” Slides are at: http://www.hedden-information.com/wp-content/uploads/2021/03/ENDORSE_-Building_Enhancing_and_Integrating_Taxonomies.pdf
Part 3: Integrating Existing Taxonomies starts at slide 50
Recording is at: ENDORSE 2021, Day 2, 17 March, Law as code, Track 2 - Fundamentals - YouTube
Come it at timestamp 2:07:00 (two hours, 7 minutes in) for the part on integrating taxonomies.

Thanks, Bob, for referring my book.