A lead from my previous post on scraping text out of boilerplate was the following paper, scheduled to be presented next month (specifically, Saturday 6 February 2010) right here in Brooklyn at ACM’s 3rd annual conference on web search and data mining, WSDM 2010:
- Kohlschütter, Christian, Peter Fankhauser, and Wolfgang Nejdl (2010) Boilerplate Detection Using Shallow Text Features. In WSDM ’10.
Boilerpipe
A java API-based implementation of the system described in the paper is available under a generous Apache 2.0 license from:
- Google Code: Boilerpipe
It uses our favorite normalizer, NekoHTML as a front end.
I’ll take it out for a spin in a separate review. The Java API nature and free licensing make it a very tempting option for the likes of us.
The Approach(es)
Kohlschütter et al. evaluate a range of features working directly over cleaned HTML for scraping content out of boilerplate like navigation and advertisements.
HTML Only
They do not consider HTML rendering solutions, which simplifies but also limits the applicability of the final system. Of course, there’s no reason you couldn’t plug in something like Crowbar (thanks for that link, too) and use its HTML instead of the direct download.
Newswire vs. Cleaneval
As he pointed out in a comment in the last post, Kohlschütter (and crew) concluded that the Cleaneval bakeoff used highly non-representative web text, and thus wasn’t a good baseline. As an alternative, they collected a corpus of 250K English news articles downloaded from Google News links from different countries (USA, Canada, UK, South Africa, India, Australia) in four topics (world. technology, sports, entertainment) from over 7500 different sources.
The paper advertises the data’s availability from the following site:
But it’s not there yet (11 January 2010).
Four-Way and Two-Way Tasks
The authors consider the two-way task of boilerplate versus content and the four-way classification task of boilerplate, full-text/comments, headline, and supplemental.
Decision Tree Learner
They evaluated decision tree learners, though claim linear SVMs (fit with sequential minimial optimization, aka SMO) had similar performance.
Micro-F Evaluation
They evaluated on a word-by-word basis. This is a kind of micro evaluation. A macro evaluation might rank by block.
Features
The authors didn’t consider structure other than H, P, DIV, and A.
The features that seemed to work involved number of words and density of text relative to links.
The only global features they use of a collection are the global word frequencies (thus getting at something like stop list type features).
They consider all sorts of local features, such as average word length, total number of words, average sentence length (using a simple heuristic sentence detector), ratio of periods to other characters, number of capitalized words and ratio of capitalized to non-capitalized words, etc. etc.
One of their most useful features is link density, which is number of words in anchor (A) spans compared to total number of words.
They also consider position in the document, though this is tricky with only the raw HTML rather than the rendered output, as relative locations may change visually relative to the underlying HTML.
Quotients Features and Context
The authors extract these basic features from blocks, but then use the current (C), previous (P), and next (N) values in classification. Specifically, they find it useful to look at the quotient of the number of words in the previous block to the current block (their notation “P/C”).
The authors say “the text flow capturing variants of thse features (number of words quotient, text density quotient, …), which relate the value of the current block to the previous one, provide the highest information gain, indicating that intradocument context plays an important role”.
It’d be nice if these papers came with a table of features with simple definitions to make it easy for someone to browse them.
Sequence Model Decoding?
The classification in this paper is apparently done independently in a block-by-block fashion. I’d think using a sequence model like structured SVMs or CRFs would provide higher accuracy. Given the small number of categories (i.e. boilerplate/not), it’d be very speedy, too, even with second-order models.
(Greedy) Information Gain
The authors rank features by information gain. The big winners are number of words quotient, text density quotient, number of words, and the link and text densities. Clearly these are related measures, but it’s prohibitive to consider all the possible subsets of features.
Conclusions
They conclude word count is most important. Coupled with link density, you’re most of the way to the best results achieved using all local features plus global frequency.
Or, Ask Ryan
Ryan Nitz, enterprise programmer par excellence, just happened to be over for dinner last night. I asked him if he’d dealt with this problem and he said he just used the ratio of punctuation to letters. That seems similar to but simpler and more portable than the stopword approach Aria Haghighi suggested in a comment in the last post.
January 30, 2010 at 4:22 pm |
The data mentioned in the paper is now online:
http://www.L3S.de/~kohlschuetter/boilerplate/
The collection L3S-GN1 contains 621 web pages crawled via GoogleNews. both raw HTML as well as HTML including human annotations (headline, fulltext, comments etc.).
You may use the collection freely for your research purposes. Have fun!
February 1, 2010 at 1:23 pm |
Thanks. I managed to download it, but unpacking the gzipped bundle produces something in “warc format”. Ironically, the header links to a source that’s 404:
http://www.archive.org/documents/WarcFileFormat-1.0.html
There’s only one reference on the Wikipedia, but it doesn’t have its own page, because the link’s to:
http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
Does anyone know how to decode this stuff on Windows, linux or a Mac?
February 1, 2010 at 1:57 pm
WARC is an ISO-standardized format for digital preservation of web pages (see the accompanied Readme-L3S-GN1.txt file for details).
Details about WARC can be found here:
http://bibnum.bnf.fr/WARC/
http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki-print.php?page=Working%20with%20WARC%20Files#Data_Format_on_Disk
I used the WARC-related classes in Heritrix ( http://crawler.archive.org/ ) to create the WARC file.
However, reading the file is really easy (especially if you want it quick and dirty):
Once unpacked (or after using a GZipInputStream), you just need to split at lines containing “WARC/1.0”.
Just like HTTP, The lines after WARC/1.0 are record-specific headers, which you may parse (e.g. to get the original document’s URI, given at WARC-Target-URI, for instance). The content followed after an empty line then is the raw HTML body.
For each Target-URI we have two subsequent entries, one is the original HTML and one is the HTML version that contains human assessments.
If you want to do it “right”, you should check out the WARCReader class in Heritrix. It’s really straightforward to use.