Configurable Semantic Web Crawlers

RobVesse · November 21, 2010, 11:00pm

Can anyone recommend some good Semantic Web crawlers?

Preferably they should be configurable to limit depth and if possible the type of links (i.e. predicates) that should be followed.

The more suggestions the better as I'm looking to compare the results of various crawlers as part of an experiment I'm designing.

I'm aware of the following already:

Anyone know of any others?

Note - I'm aware of the similar existing question http://www.semanticoverflow.com/questions/1447/working-linked-data-crawler but this only has two possibilities listed

Edit: Eliminated RDF Crawler from the list as it is so old it supports the old version of RDF (it's RDF API appears to be from 2000) and so can't actually crawl any RDF properly.

JeenBroekstra · November 21, 2010, 11:00pm

Have a look at the Aperture project, it contains crawlers, parsers and (RDF) metadata extractors for a plethora of formats, might be useful.

Barna · November 21, 2010, 11:00pm

Just read about Extractive recently... http://extractiv.com/

Looks like up to 1000 URLs their API is free.

Disclaimer: I have nothing to do with this company...

tobyink · November 21, 2010, 11:00pm

I don't think Kjetil is still maintaining it, but:

http://search.cpan.org/~kjetilk/RDF-Scutter/

It doesn't allow depth to be configured, but does allow you to set a limit on the total number of URLs retrieved (which it crawls breadth-first). The only predicate it follows is rdfs:seeAlso, but it being open source should be pretty easy to patch to follow other predicates. (It's just a single SPARQL query to pull out the links - just need to add a few UNIONs.)