Best practices for processing RDF data using MapReduce (via Hadoop)?

What are the best practices for processing RDF data using MapReduce (via Hadoop)?

I am looking for people sharing their suggestions and lessons learned, such as:

  • Use N-Triples and/or N-Quads, don't hurt yourself with RDF/XML or Turtle
  • If you really want to use RDF/XML and/or Turtle, process one file at the time, don't split them.
  • ...

Using N-Triples or N-Quads will work and there are a lot of benefits, as the original answer listed. Problem though is that the meaning of the data is not really at the granularity of a statement. For example, let's say that we want to state that ball #1 has a color with rgb value of 255,0,0:

@prefix rel: <urn:myont:rel:> .
@prefix class: <urn:myont:class:> .
@prefix ball: <urn:myont:class:Ball:> .

ball:1
      a class:Ball;
      rel:hasColor [
           a class:Color;
           rel:hasRedComponent "255"^^xsd:integer;                            
           rel:hasGreenComponent "0"^^xsd:integer;
           rel:hasBlueComponent "0"^^xsd:integer
           ].

How would you meaningfully distribute the statements in the above while staying with granularity at the statement level?

I'd recommend using the concept of RDF Molecules. An RDF molecule is the minimal RDF graph you can create such that you do not lose information when distributing it (my summary, not a formal definition).

Some links on RDF molecules:

ftp://ksl.stanford.edu/pub/KSL_Reports/KSL-05-06.pdf

http://www.w3c.rl.ac.uk/SWAD/papers/RDFMolecules_final.doc

http://www.itee.uq.edu.au/~eresearch/presentations/Hunter_BioMANTA_UK_eScience.pdf

http://jrdf.sourceforge.net/status.html

http://morenews.blogspot.com/2008/07/yads-and-rdf-molecules.html

I developed a Map/Reduce framework that runs on a single computer to do the data processing that creates :BaseKB from the Freebase quad dump. I get nearly linear speed up with as many as 8 CPU cores.

The system works in several modes. Mappers tend to be written as Java functions that operate on Jena Triple objects. For instance if I want to filter out triples that have a certain predicate, or if I want to rewrite identifiers, I just write Java to do it.

The usual design for a reducer is that it groups predicates by subject or object and loads them into a Jena model. The reducer then fills up another model with statements, primarily using CONSTRUCT SPARQL queries. The new model then gets flushed down the stream to go to the next processor. This is a comfortable style of programming, although SPIN might be even nicer.

This model can be supplemented in many ways. For instance, a rulebox can be inserted into the model or you could enhance the model by inserting triples from another source. Freebase, for instance, uses compound object types to create something like an "RDF Molecule" and if the CVTs were shoved into a key-value store in compressed form, you could get something with better efficiency and scalability than a conventional triple store. Reducers could reconstruct the "CVT Halo" around an subject and thus be able to work with an enhanced graph.

The above system is able to use SPARQL on single subjects or objects, but can't do general SPARQL queries.

That's one of the reasons I've been using Hadoop lately. Pig and Hive let you write computation pipelines by using relational operators and then automatically compile them to parallel Java code.

I've been using Pig extensively and the only complaint I have has to do with data structures. I'd really like to see something that uses RDF native data types. Right now I need to parse the RDF, convert it to standard data types, then remember what everything is supposed to me when I'm writing RDF.

Another fundamental issue is identifier compression. If I'm working with Freebase or DBpedia, I can just whack off the prefix of the URI to lower the memory/disk/network bandwidth of the data. It's even better if you can map the identifiers to integers. (Easy in either of those cases)

In general it would be nice to have some system that compresses URIs when they go into the system and decompresses them when they come out, just like most triple stores do. There's the option of doing this with a K/V store or by doing it with algorithms that involve sorting and joining. The second one might be a little less flexible (in particular it may be hard to look up identifiers that are specified in your code), but it ought to be scalable.

Working with Pig you'll probably use both s-p-o triples and also rows that are like SPARQL result sets.

If you don't want to use Pig, you can write your own mappers and reducers that take Triples as input. The exact same approach used in my old framework should work well.

I've not done MapReduce specifically, but have done other parallel RDF work (on clusters and supercomputers), so not sure if this will apply to you directly (but hopefully it does). I can say yes, definitely use N-Triples or N-Quads. It takes all the complexity out of thinking about parsing and serializing, and you can split and concatenate the files without worrying about namespace declarations or splitting right in the middle of a complex structure (as you might find in RDF/XML or Turtle).

Here is another MapReduce approach for Linked Data: mrlin - MapReduce processing of Linked Data