Obvious noise on dbpedia

I find some obvious noise on dbpedia data (Ver 3.6) such as: http://dbpedia.org/page/News redirected to: http://dbpedia.org/page/Network_News_Transfer_Protocol and http://dbpedia.org/page/Been and http://dbpedia.org/page/Into and many other disambiguation pages ...

Is there any solution for release from these noises?

The problem was related to splitting the Wikipedia URI in its namespace and article name parts (e.g. "File:Example.jpg" into "File" and "Example.jpg").

http://en.wikipedia.org/wiki/News: redirects to http://en.wikipedia.org/wiki/Network_News_Transfer_Protocol

When creating the DBpedia URI for the subject of this triple, the Java method "News:".split(":") is called. This returns a one element Array("News"), just as if there would not be a colon there at all. Calling split(":", -1) instead returns Array("News", ""). In turn, the correct DBpedia URI http://dbpedia.org/resource/News: as the subject can be constructed.

This bug is now fixed in the extraction framework.

DBPedia is extracted from an unverified, open, highly-dynamic corpus, collaboratively edited by millions(?) of independently-operating (and sometimes impudent) users. Noise (in many forms) is to be expected... overcoming all forms of noise is nigh-on impossible.

Still though, DBpedia is a pretty impressive dataset! (...at least from the perspective of someone who was sick of looking at livejournal exports.)

My advice is to tolerate its quirks (much like you would on Wikipedia)...

(...unless its a quirk introduced by the DBpedia guys, and not coming from the underlying corpus, like their previous use of foaf:img for arbtirary resources... but that's another story.)


EDIT So, it's not a quirk of Wikipedia, but a quirk of DBpedia. Apologies.

The particular case you speak of is because of the following URL in Wikipedia:

http://en.wikipedia.org/wiki/News://

...which redirects to the "Network News Transfer Protocol" article you speak of. It seems that DBpedia does some cleanup of URIs which stripped the tail characters off of the above News:// and incorrectly determined that News redirects to Network_News_Transfer_Protocol.

Yes, the solution is pretty simple. There are two ways for doing this:

  1. You create reports highlighting the issues, discussing them with the Wikipedia community, and clean the data source where the DBPedia data comes from
  2. Head over to Wikipedia and fix the problems yourself

On top of that, there can be editorial policies that make translation of Wikipedia into RDF tricky. Here too, I can suggest to bring up the inconsistencies you see, suggest fixes, and discuss things with the community.

I cannot provide a general solution, however, why not using Category:News?