For a while now, I am observing a trend that a considerable fragment of Semantic Web research is about extremely large-scale data processing. Triple Stores seem to compete primarily with respect to the quantity of Billions of triples they can handle. Query engines have to cope with these work loads as well. Many recent papers on reasoning focus on how to efficiently handle hundreds of Millions of statements. The EU project LarKC is dedicated to the development of a "platform for massive distributed incomplete reasoning". And, of course, there is the Billion Triple Challenge, which appears to me as one of the most prominent events in the Semantic Web year.
Is very-large-scale data processing really such an important topic? Why? For whom? To me, it looks strange and without precedence, neither on the "traditional" Web nor in the good old "database-driven world". Certainly, there is Google and a handful of other companies doing processing at web scale, but they solve their particular problems themselves, and I believe that by far the majority of today's web applications deals with considerably small data sets, focusing on making clever use of data rather than on handling ultra-large data masses.
So what do you think? Is it justified to have so much research in web-scale processing?
I am particularly interested in your opinion about the rational for large-scale reasoning. Or put it this way: If I would claim that a reasoner that can efficiently handle 100,000 assertions will be sufficient for 99,99% of all usecases to come, what would a large-scale advocate answer to me?
Yes it is justified though I'm not sure that many of the systems that will be built based on the results of such research will necessarily be public i.e. they will be part of Semantic Intranets rather than the Semantic Web.
It's a fact of our modern information age that most companies collect/generate vast quantities of data about their customers, client, projects, research etc most of which is in traditional RDBMS or data warehousing applications. Semantic Web technologies provides a way for a company to integrate all their disparate systems into one system (even if only in virtual terms e.g. using D2R or similar to map RDBMS to RDF). When they do this large scale reasoning is often very important for doing things like automated data clean up and validation, automated entity disambiguation/matching and for the traditional reasoning use case i.e. inferring new knowledge from your existing knowledge.
Now most companies are not going to want to make all this data publicly avaiable as much of it may be commercially sensitive and so perhaps you'll hear very little of such systems. I believe there is definitely a need for this kind of system and they need some level of reasoning capability (even if only rules based) to be truly effective IMHO. Personally I'm aware of several companies already building or using this kind of system and am working on such a system myself currently.
Try searching for UCB Celltech/Amdocs in conjunction with SemTech for a couple of good examples of exactly what I've outlined. Also if you're familiar with the BBC's Semantic Web stack backed by OWLIM that's a more public facing example of large scale reasoning doing something useful!
From my viewpoint, there's a certain scale that Linked Data works at.
Think of the game of "20 questions"; if you can really find what I'm thinking of by asking 20 binary questions, that means there are a million or so things in our collective consciousness. That's not far off from the 3 million articles in Wikipedia.
Freebase has about 850 million triples worth of data in it. It's very easy to find a few billion triples, so this is a natural scale to be working at.
What's interesting about RDF is that it offers a lot of choices. You might have a super-expressive reasoner that works on 10,000 triples and you might have a system with 10,000,000,000 triples and very light inference. These two might be part of the same system. There are lots of cool things you can do with a million triples, but you can't fault anybody for climbing the billion-triple mountain "because it is there"
My personal opinion is that federation is more important. Being able to query hundreds of decentralised, independent databases in a single operation, is one of the key use cases of Semantic Web. I always thought decentralisation was a major aspect of Semantic Web. I guess it comes down to the sorts of question people want to ask, but in my my comparison and syndication come up pretty high along with linking data across triplestores. The primary goal of reasoning/inference is to make this possible via materialisation and faceting.
A tool that works with a billion triples works very well with a 100,000 triples. And as a person dealing with multi billlion triple data sets I very much appreciate the effort done in scaling past that size. In my opinion a billion triple data set is just not that uncommon. e.g. excel allows for more data points in a single work sheet.
We should not need turn off features just to be able to deal with data size. Especially if your data is not unreasonably big ;)
Research in moving the point where one needs to make trade offs is appreciated and for me still needed.
I would argue that the original question is phrased backwards - ask not what big data processing can do for the semantic web, but what the semantic web can do for big data processing.
Rhetoric aside, I think that big data and the semantic web are associated with each other, because the semantic web promises to make HUGE amounts of structured information in any domain easily available, so the next question is what to do with all that data -> well I suppose that we had better process it, do something with it, demonstrate it better for our use case.
I think that as the Semantic Web catches on, and people want to do more, and more powerful development tools become available, it will become evident that 100,000 triples is a rather paltry amount - imagine how many triples this site has!