Apache Nutch favicon

Apache Nutch

Apache Nutch is a highly extensible and scalable open source web crawler software project. Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering. The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

StormCrawler

StormCrawler

StormCrawler is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache licens ...