I developed a Map/Reduce framework that runs on a single computer to do the data processing that creates :BaseKB from the Freebase quad dump. I get nearly linear speed up with as many as 8 CPU cores.
The system works in several modes. Mappers tend to be written as Java functions that operate on Jena
Triple objects. For instance if I want to filter out triples that have a certain predicate, or if I want to rewrite identifiers, I just write Java to do it.
The usual design for a reducer is that it groups predicates by subject or object and loads them into a Jena model. The reducer then fills up another model with statements, primarily using CONSTRUCT SPARQL queries. The new model then gets flushed down the stream to go to the next processor. This is a comfortable style of programming, although SPIN might be even nicer.
This model can be supplemented in many ways. For instance, a rulebox can be inserted into the model or you could enhance the model by inserting triples from another source. Freebase, for instance, uses compound object types to create something like an "RDF Molecule" and if the CVTs were shoved into a key-value store in compressed form, you could get something with better efficiency and scalability than a conventional triple store. Reducers could reconstruct the "CVT Halo" around an subject and thus be able to work with an enhanced graph.
The above system is able to use SPARQL on single subjects or objects, but can't do general SPARQL queries.
That's one of the reasons I've been using Hadoop lately. Pig and Hive let you write computation pipelines by using relational operators and then automatically compile them to parallel Java code.
I've been using Pig extensively and the only complaint I have has to do with data structures. I'd really like to see something that uses RDF native data types. Right now I need to parse the RDF, convert it to standard data types, then remember what everything is supposed to me when I'm writing RDF.
Another fundamental issue is identifier compression. If I'm working with Freebase or DBpedia, I can just whack off the prefix of the URI to lower the memory/disk/network bandwidth of the data. It's even better if you can map the identifiers to integers. (Easy in either of those cases)
In general it would be nice to have some system that compresses URIs when they go into the system and decompresses them when they come out, just like most triple stores do. There's the option of doing this with a K/V store or by doing it with algorithms that involve sorting and joining. The second one might be a little less flexible (in particular it may be hard to look up identifiers that are specified in your code), but it ought to be scalable.
Working with Pig you'll probably use both s-p-o triples and also rows that are like SPARQL result sets.
If you don't want to use Pig, you can write your own mappers and reducers that take
Triples as input. The exact same approach used in my old framework should work well.