Fast tool to convert TTL to NTriples?

database_animal · January 23, 2012, 11:00pm

I'm wondering if there's a really fast and scalable command-line tool to convert TTL files to NTriples -- I'm aware of the Jena command line tools, but I'm sure at least a factor of two could be gained by eliminating the Java tax. One critical feature is that it has to be streaming... It can't load everything into a model and spit them out.

oesxyl · January 23, 2012, 11:00pm

serdi http://drobilla.net/software/serd/ is a streaming tools so doesn't need too much memory and can work with huge input files. Make conversion in both directions turtle to ntriples and ntriples to turtle.

mhgrove · January 23, 2012, 11:00pm

I'm not sure if you consider Java itself taxing, or just using it, but Sesame's RIO parsers & writers are streaming. Writing that conversion program would be a handful lines of code at most, it probably would take you five minutes to write something that'll take the statements reported by the turtle parser and run them out the NTriples writer. If you don't want to use Java, that won't help you, but I'd go with RIO.

bobdc · January 23, 2012, 11:00pm

Try the redland utilities like rapper (http://librdf.org/raptor/). It's written in C and very fast. The Windows version is pretty old, but still does the job, and there are much more up-to-date Linux versions.

Bob

AndyS · January 23, 2012, 11:00pm

In Jena RIOT, N-Triples is faster to parse than Turtle (a lot of system find related effects, not just Java ones - despite more bytes, the effect of sitting in tight loop outweighs the byte count because of stream I/O). Note that RIOT parses need enabling if you just use Jena core code, no ARQ etc, then they may not wired in. There are command line tools which stream (modulo bNode labels, which is a language design issue) from Turtle to N-Triples. Validation can be added with --validate (e.g. checks lexical forms of XSD literals - this, unsurprisingly, slows parsing down).

A good practice is to always parse new data to N-Triples with as much checking as possible, then store compressed N-triples (expect x8-x10 compression). RIOT reads from gzip files.

RobVesse · January 23, 2012, 11:00pm

There is rdf2rdf which wraps the Sesame RIO parsers and writers but I know from experience within our company that they do still use a lot of memory over time for very large data, this may be an artifact of the data (lots of BNodes) but in our experience RIO only scales so far.

We've had decent success with RIOT (from the Jena ARQ package) and that's probably our go to tool for conversion

My own toolkit has a command line conversion tool though it's nowhere as performant as Sesame RIO or Jena RIOT in the released version at least - certainly it struggles on large data (millions of triples) and the current version does have a bug where BNode IDs may not be correctly written in NTriples when converting from some formats.