I’m reduced to blegging (a lexical mashup of “blog” and “beg”). We run into this problem time and time again and have to tell customers we just don’t know the answer. If you do, please help by leaving a comment or send me e-mail and I’ll summarize on the blog.
How Can I Scrape Web Content?
I really don’t know a good general-purpose method of pulling the content out of arbitrary web pages and leaving the boilerplate, advertising, navigation, etc. behind.
The research papers I’ve seen look at large corpora of documents and build background models for sites to pick out boilerplate from fresh content.
They also resort to techniques like 2D rendering just to put contiguous blocks of HTML together logically. Then there’s the problem of pulling out text that’s interrupted by advertisements, photos, figures, quotes, etc.
Yes, I Can Tidy HTML
No, I Don’t Need Structured Parsing
Yes, I know there are tons of things out there that let you build custom rule-based scrapers for particular sites. It’s a pain, but we’ve done this for fixed content.
But I’m talking about making a query on a search engine and downloading the HTML (or whatever) from the top 500 hits and trying to turn it into something you can feed to a natural language processing system to do something like indexing or information extraction.
AJAX and Flash a Plus
Just thinking about getting the HTML as rendered given all the AJAX makes my head spin. I also hear that Flash allows some kind of searchability now, so there might be relevant text in there.
If there’s a service out there that does a decent job of this or software we can buy, we’ll gladly pony up. It’s not like we want to build this ourselves.