I've got collections of documents of many types (web pages, blog entries, press releases, scientific paper abstracts, music lyrics) that contain terms in them that exist in Linked Data, concepts like
There are a number of commercial products that occasionally find these terms in text, such as OpenCalais, Zemanta, AlchemyAPI and there's also a remarkable Open Source product called DBpedia Spotlight. What these all have in common is that results are awful. My hunch is that these all of these have a poor precision-recall curve and the commercial products suppress recall to (somewhat) avoid embarrassing problems in the precision department.
Tom Teague from OpenCalais wrote a blog article asking, "Is Open Calais Dead?" expressing the idea that "the market for semantic extraction is (im)mature."
How do we make this market clear? On one hand I think we need much better text analysis products, but on the other hand, other barriers are in the way of Linked Data use. When I introduced an API that lets people find photographs of DBpedia terms, I found that many potential users had no idea of how to reconcile DBpedia identifiers to something they can use.
The core problem with the text extraction / linked data combo is disambiguation. Text extraction tools have become quite good at picking up significant terms and named entities (people, places, etc.). But even once you've recognized that the ngram "Chicago" is a location, you're still nowhere near to figuring out which linked data concept that corresponds to (at the risk of stating the obvious: there's more than one place called Chicago).
There's tricks and heuristics that can be applied to help with disambiguation. Things like co-occurrence analysis on terms occurring in the text and/or on sibling concepts in the linked data source. This can get you a long way, but as far as I know there are no ready-made tools that support that kind of thing (though perhaps GATE Developer comes close - still on my ToDo list to check that out again, it's been a while since last I played with it).
Don't you think the market is very clear ? If I am not wrong, I think that what you talk about is the core problem of web search, text classification, etc. Which is also certainly one of the main concern of the semantic web: as most of the human knowledge will not be published as semantic data, but in natural language, we need tool to make the links between natural language and ontologies.
Using controlled vocabularies and ontologies to tackle the disambiguation and text classification problem gives very interesting results according to recent papers (you gave an answer on my post where I cite some projects about this), but as this is the core business of many societies (starting from Google..) it seems that good AND "free" tools are not ready to come out, no ?
Until then, I guess many of us are working on this very problem in our different labs. I would be happy to share about it and join forces if any interest.
For information, with GATE, I saw a "free" disambiguation tool WSDGate, but it is a "supervised" one.