I keep an OWL2 ontology in a git repository, and I've noticed that every time I edit and save the file in Protege, there's some non-deterministic behavior.
For example, OWL restrictions appear to be
rdf/xml serialized in an arbitrary order each time I save, such that a very large percentage of lines in my .owl file change even when I make a tiny edit. This makes it impossible tell what's happened using
git diff, for instance.
Could anyone share or suggest approaches to using Protege with version control? In an ideal world I'd love to be able to make changes in multiple branches and merge -- but short of that, I'd at least like to be able to tell what's changed :-)
This my question is very similar to yours. Thank you for pinpointing the problem.
One should not probably worry too much about internals of version control systems. I have managed to put all versions of NCI Thesaurus (about 8 Gb of RDF/XML) into Mercurial repository. The whole repository size (the
.hg folder) is only 112 megabytes. On the other hand, Git stores all versions, not diffs (Git repository with NCI Thesaurus takes about 1 Gb).
The real problem is, as you mentioned, to tell what is changed, and perform a three-way merge.
I have done a research and the only tool for diffing OWL ontologies (not RDF graphs) I have found was OWLDiff. It does not have any scripts to integrate with version control systems, although it is possible to write them. Another shortcoming of OWLDiff is that it only compares the logical constituent, i.e. axioms, and does not take into account changes of namespace prefixes, imports, ontology format, etc.
So I had to develop a tool to tame version control systems for developing ontologies. You can download it at http://code.google.com/p/ontovcs. It is still beta and contain some flaws (especially the 3-way merge tool) and I would appreciate any feedback from you, positive or negative.
I have started rewriting the tool which now better matches OWL API.
You can find the latest version of owl2vcs at https://github.com/utapyngo/owl2vcs.
In terms of serializing output in a deterministic way, TopBraid Composer currently offers a Sorted Turtle option and some internal tools to find the graph-based differences in triples. The next version of the tool (TBC 3.6 in January) will support two serialization formats for version control systems, Sorted Turtle and c14n.
An RDF file serializes a graph. Diffing a graph is fundamentally much more difficult than diffing a line-based text file.
I don't think there's much you can do here really, except ditching Protégé and editing your files by hand in Turtle (and that's not a totally insane suggestion.)
If you don't care so much about expressivity you could use obo-format (http://oboformat.org). It has a deterministic serialization. You can see examples of diffs here:
This answer is 75% tongue-in-cheek as obo-format is deprecated and everyone involved with obo-format is urging users to switch to owl. One of the main obstacles is the diffs.
ontovcs is great, you should upvote @utapyngo. Some way to get this into web SVN views would be great.
I do think that ontologies should be treated like source code however, I don't think a new VCS stack should be developed to fit the IDEs, rather the IDEs should adapt to the stack.