A lead from my previous post on scraping text out of boilerplate was the following paper, scheduled to be presented next month (specifically, Saturday 6 February 2010) right here in Brooklyn at ACM’s 3rd annual conference on web search and data mining, WSDM 2010:
- Kohlschütter, Christian, Peter Fankhauser, and Wolfgang Nejdl (2010) Boilerplate Detection Using Shallow Text Features. In WSDM ’10.
A java API-based implementation of the system described in the paper is available under a generous Apache 2.0 license from:
- Google Code: Boilerpipe
It uses our favorite normalizer, NekoHTML as a front end.
I’ll take it out for a spin in a separate review. The Java API nature and free licensing make it a very tempting option for the likes of us.
Kohlschütter et al. evaluate a range of features working directly over cleaned HTML for scraping content out of boilerplate like navigation and advertisements.
They do not consider HTML rendering solutions, which simplifies but also limits the applicability of the final system. Of course, there’s no reason you couldn’t plug in something like Crowbar (thanks for that link, too) and use its HTML instead of the direct download.
Newswire vs. Cleaneval
As he pointed out in a comment in the last post, Kohlschütter (and crew) concluded that the Cleaneval bakeoff used highly non-representative web text, and thus wasn’t a good baseline. As an alternative, they collected a corpus of 250K English news articles downloaded from Google News links from different countries (USA, Canada, UK, South Africa, India, Australia) in four topics (world. technology, sports, entertainment) from over 7500 different sources.
The paper advertises the data’s availability from the following site:
But it’s not there yet (11 January 2010).
Four-Way and Two-Way Tasks
The authors consider the two-way task of boilerplate versus content and the four-way classification task of boilerplate, full-text/comments, headline, and supplemental.
Decision Tree Learner
They evaluated decision tree learners, though claim linear SVMs (fit with sequential minimial optimization, aka SMO) had similar performance.
They evaluated on a word-by-word basis. This is a kind of micro evaluation. A macro evaluation might rank by block.
The authors didn’t consider structure other than H, P, DIV, and A.
The features that seemed to work involved number of words and density of text relative to links.
The only global features they use of a collection are the global word frequencies (thus getting at something like stop list type features).
They consider all sorts of local features, such as average word length, total number of words, average sentence length (using a simple heuristic sentence detector), ratio of periods to other characters, number of capitalized words and ratio of capitalized to non-capitalized words, etc. etc.
One of their most useful features is link density, which is number of words in anchor (A) spans compared to total number of words.
They also consider position in the document, though this is tricky with only the raw HTML rather than the rendered output, as relative locations may change visually relative to the underlying HTML.
Quotients Features and Context
The authors extract these basic features from blocks, but then use the current (C), previous (P), and next (N) values in classification. Specifically, they find it useful to look at the quotient of the number of words in the previous block to the current block (their notation “P/C”).
The authors say “the text flow capturing variants of thse features (number of words quotient, text density quotient, …), which relate the value of the current block to the previous one, provide the highest information gain, indicating that intradocument context plays an important role”.
It’d be nice if these papers came with a table of features with simple definitions to make it easy for someone to browse them.
Sequence Model Decoding?
The classification in this paper is apparently done independently in a block-by-block fashion. I’d think using a sequence model like structured SVMs or CRFs would provide higher accuracy. Given the small number of categories (i.e. boilerplate/not), it’d be very speedy, too, even with second-order models.
(Greedy) Information Gain
The authors rank features by information gain. The big winners are number of words quotient, text density quotient, number of words, and the link and text densities. Clearly these are related measures, but it’s prohibitive to consider all the possible subsets of features.
They conclude word count is most important. Coupled with link density, you’re most of the way to the best results achieved using all local features plus global frequency.
Or, Ask Ryan
Ryan Nitz, enterprise programmer par excellence, just happened to be over for dinner last night. I asked him if he’d dealt with this problem and he said he just used the ratio of punctuation to letters. That seems similar to but simpler and more portable than the stopword approach Aria Haghighi suggested in a comment in the last post.