How to split .n3/.ttl/.rdf files into smaller chunks

kushwahanidhi · August 19, 2013, 10:00pm

Suppose i have a large file 20lakh in (.n3,.ttl,.rdf) which technique is sutitable it to break into smaller chunks?

It is possible but what is the best way to do it?

paco · November 19, 2020, 10:04pm

Yes, it’s possible. People (and machines) generally serialize RDF triples (.n3, .ttl, .rdf) in forms that are relatively “normalized”. In other words, a subject may have multiple predicates using the ; syntax, while a (subject, predicate) pair may have multiple objects using the , syntax. However, internally, these are “denormalized” into triples. If you think of a graph in that way, there’s really no reason that one cannot partition a graph – as long as you don’t “sever” any of the triples.

Also, when reading in graphs (deserializing) most of the libraries that I’ve seen supporting incremental additions, i.e., from multiple files. I use that pattern a lot, e.g., have one file for “public” parts of a graph, then multiple other files for the different “private” data.

Also, I’m trying to make Parquet (or related alternatives) more of a first-class serialization format for KGs, which encourage file partitioning – to leverage for better parallel processing downstream.

BobDuCharme · December 8, 2020, 7:35pm

This is where the verbosity of n3 files are your friend. Because they store one complete triple per line, you can break an n3 file at any line break and the result will be two syntactically valid n3 files.

Turtle files are more concise because, for example, you can declare and use namespace prefixes, but if you split a Turtle file at a line break you may be separating the use of a prefix from its declaration, so a parser won’t know what the prefix stands for. Similar syntax abbreviation conveniences in Turtle can also be broken by arbitrary line splitting.

I found the self-contained nature of each line of n3 to be handy when I split up some RDF for a MapReduce Hadoop experiment described at http://www.bobdc.com/blog/driving-hadoop-data-integratio/.

To answer the question of your subject more directly, Apache Jena includes an ntriples command-line utility that converts RDF to ntriples, and the Linux split utility can split up a big text file into smaller ones pretty painlessly.

VladimirAlexiev · December 14, 2020, 7:33pm

ntriples is trivial to split or cat, do it at any line
turtle is:
- easy to split: most tools produce “paragraphs” i.e. blank lines separate s-p-o clusters; but TopBraid separates with lines having a single dot; and you may have newlines in """ delimited strings. Plus you need to replicate the prefixes. I.e. can do it by hand, not completely automatically
- trivial to cat: you can have/redefine base and prefixes in the middle of a file
rdfxml is hard to split or cat

But may I ask WHY you need to split?

eg GraphDB’s fast loader loadrdf does the splitting for you, and loads in parallel
(disclosure: I work for Ontotext)