DiffBot

Why Diffbot? We're focused exclusively on getting you better web data. Some of the reasons hundreds of customers make (hundreds of) millions of calls every month:

The Web's Best Content Extractor:

Diffbot works automatically—without rules or training. There's no better way to extract data from web pages. See how Diffbot stacks up to other content extraction methods: Feature Comparison Text-Extraction Quality Shootout

Identify Pages Automatically:

Use the Analyze API to automatically find and extract all products, articles, discussions or images while crawling any site. Analyze API

Detailed product data:

The Product API automatically returns complete product info, including all pricing data, product IDs, brand and full specifications tables. Product API

Clean text and html:

Articles, discussion threads, product descriptions and image captions are returned in pure text and sanitized HTML. Start testing today

Structured Search:

Search structured content from any crawl on-the-fly using our Search API, returning only the matching results.

Plus... ¤ All APIs execute Javascript so content is parsed like a regular browser. ¤ Works on most non-English pages thanks to visual processing. ¤ Date normalization: Datestamps are normalized and presented in RFC 1123 (HTTP/1.1) standard format. ¤ Multipage articles are automatically joined together in a single API response. ¤ Entity extraction: automatic tagging identifies major topics and entities within article text. ¤ Fix any issues realtime with the API Toolkit. ¤ Bulk API allows the extraction of hundreds to hundreds-of-thousands of pages. ¤ Access Crawlbot and Bulk job data in full JSON or CSV formats. ¤ Optionally crawl using a diverse array of IP addresses.