Blegging for Help: Web Scraping for Content?

by

I’m reduced to blegging (a lexical mashup of “blog” and “beg”). We run into this problem time and time again and have to tell customers we just don’t know the answer. If you do, please help by leaving a comment or send me e-mail and I’ll summarize on the blog.

How Can I Scrape Web Content?

I really don’t know a good general-purpose method of pulling the content out of arbitrary web pages and leaving the boilerplate, advertising, navigation, etc. behind.

The research papers I’ve seen look at large corpora of documents and build background models for sites to pick out boilerplate from fresh content.

They also resort to techniques like 2D rendering just to put contiguous blocks of HTML together logically. Then there’s the problem of pulling out text that’s interrupted by advertisements, photos, figures, quotes, etc.

Yes, I Can Tidy HTML

I love tools like NekoHTML that do a reasonable job of converting any old crufty HTML from the web into XHTML so I can parse it with SAX. It’s what we use in our web demos for HTML input.

No, I Don’t Need Structured Parsing

Yes, I know there are tons of things out there that let you build custom rule-based scrapers for particular sites. It’s a pain, but we’ve done this for fixed content.

But I’m talking about making a query on a search engine and downloading the HTML (or whatever) from the top 500 hits and trying to turn it into something you can feed to a natural language processing system to do something like indexing or information extraction.

AJAX and Flash a Plus

Just thinking about getting the HTML as rendered given all the AJAX makes my head spin. I also hear that Flash allows some kind of searchability now, so there might be relevant text in there.

We’ll Pay

If there’s a service out there that does a decent job of this or software we can buy, we’ll gladly pony up. It’s not like we want to build this ourselves.

50 Responses to “Blegging for Help: Web Scraping for Content?”

  1. Michael Scharkow Says:

    I use a slightly modified version of Bodytext Extractor (http://github.com/aidanf/BTE) which works pretty well for decently structured sites. It’s graph-based and completely unsupervised. No solution for Flash or AJAX, though.

    • lingpipe Says:

      As far as I can tell, BodyTextExtractor.py is just a general text extractor. That’s easy — I can do that with NekoHTML.

      I’m even comfortable getting rid of javascript sections.

      What I want to do is get rid of the boilerplate navigation and ads and deal with putting non-decently structured content back together.

      • Michael Scharkow Says:

        Not sure if I understood the task correctly, but: BTE extracts the part of the input html which has the highest content/tag ratio and consequently removes navigation, ads and headers/footers. What you get is just the main text of the page, ideally. I thought that was the goal.

      • lingpipe Says:

        Thanks for the clarification. That was not at all clear that’s what they were doing from the code, but them I’m not exactly fluent in Python.

        That is an approximation to part of what I want, which is to get rid of clear non-text content.

  2. Shion Says:

    Try our service – 80legs. It’s a web-crawling service with most of the tools you need already available.

    • lingpipe Says:

      I just looked at the 80legs API doc:

      http://80legs.pbworks.com/80legs-API

      I don’t see anything like what I was asking for. Am I missing something?

      Am I just not making myself clear in terms of what I want?

      I want something that’ll take, for example, an arbitrary web page from a newspaper and return the actual news content. For instance, Google News must be doing that, because they find the title and the content (at least most of the time).

      • Shion Says:

        You can make your own 80Apps to fetch the exact content you want: http://80legs.pbworks.com/80Apps. We currently have semantic/NLP users that do this very thing with custom 80Apps.

        Of course, you can do something less sophisticated and use regular expressions through our pre-built regex 80App.

        The API is for programmatic management of 80legs jobs. The 80App framework is for making custom apps for your extraction needs.

  3. John Says:

    Have you looked at the book “Natural Language Processing in Python” by Steven Bird et al? I haven’t read the book yet — I just discovered it this week — but it looks like it might answer your question.

    • lingpipe Says:

      Yes, I have looked at the NLTK book:

      http://www.nltk.org/book

      Steven Bird was a classmate and Ewan Klein my Ph.D. thesis supervisor! It’s a very small world in natural language processing. I got an advance copy and gave them some feedback. What’s neat is that NLTK is designed for teaching, and the book’s suitable for a complete novice. It’s aimed at a completely different audience than LingPipe; we pretty much presuppose the knowledge in NLTK in our docs, so it’d be a great place to start.

      As far as I remember or can see on their site and in the book, I don’t think NLTK does any web scraping. Certainly not of the kind I’m looking for.

  4. Kyle Maxwell Says:

    It’s far from perfect, but the readability bookmarklet ( http://lab.arc90.com/experiments/readability/ ) does what you want.

  5. Vít Baisa Says:

    If you want, I can supply you with my thesis and a tool which deal with boilerplate removing (and text extracting). The tool (WC Cleaner) has slightly better results than BTE, it is written in Python and completely language-independent.

    • lingpipe Says:

      Cool. This is the kind of tool I’m looking for. I found your MS thesis on this topic:

      http://is.muni.cz/th/139654/fi_m/thesis.txt?lang=en

      Is the code for WC Cleaner open sourced under a license less restrictive than GPL so that we can use it in commercial projects?

      I love the name. Clearly you know what it means (“water closet”), because you say in your thesis “we named the whole tool WC Cleaner
      with regard to its unenviable job.”. It does sort of mess up search — without your surname, all I get are toilet-bowl cleaning products.

      • Vít Baisa Says:

        I knew why to write my thesis in English instead of Czech. :) I am glad for your feedback. If you are still interested, please, don’t hesitate to contact me via email (google “baisa vít muni”) and we will surely make understood. :)

    • Emre Sevinç Says:

      Vít Baisa,

      I’d definitely would like to have a copy of your software, is it open source?

  6. Kirk Says:

    I’ve used some block segmentation algorithms that work fairly well. I’ve implemented this:

    Christian Kohlschütter, Wolfgang Nejdl: A densitometric approach to web page segmentation. CIKM 2008: 1173-1182

    It was quite easy to implement. The trouble is, as you suggest, selecting the blocks to keep and the ones to discard. Usually ranking blocks is fairly easy as the non-content blocks are small, but the threshold seems to be very specific to the site. Are your customers more concerned with precision or recall?

    It looks like the same guy has a project here that you should check out:

    http://code.google.com/p/boilerpipe/

    • lingpipe Says:

      Thanks. This is the kind of thing I’ve seen and want a general purpose version of.

      Having tunable precision/recall is ideal, so we could live with ranking on content blocks, even if they’re only normalized sensibly on a per-page basis.

      At least initially, what we want is content blocks expanded around search hits. For instance, search for “iPod” on the web, download pages, then find blocks of running text containing the match.

      • Kirk Says:

        yeah it seems like an obvious case for some customer-side active learning system

      • lingpipe Says:

        It would be, but in many of the general cases we’re looking at, there’s very consistency across the page views. They’re just results of general searches. On the other hand, the same suspects (like Wikipedia) come up on web searches, so maybe you could peel off some part of the short head of the distribution this way.

      • Christian Kohlschütter Says:

        Hello,

        I am the author of “boilerpipe” and the two mentioned papers (“Boilerplate Detection using Shallow Text Features” at WSDM 2010 and “A Densitometric Approach to Web Page Segmentation” at CIKM 2008).

        Please let me know if the Boilerpipe code works for your purpose. Moreover, as a researcher, I am particularly interested in the cases where it *fails* to work :)

        Feel free to contact me if you have any questions.

    • lingpipe Says:

      Looks like the boilerpipe paper’s going to be given here in NY in a month:

      http://www.wsdm-conference.org/2010/

    • Emre Sevinç Says:

      I’m trying Christian Kohlschütter’s boilerpipe code with Dutch and Turkish news sites, magazines and blogs. So far I’m satisfied with the results. Thanks for making it open source!

  7. Aria Haghighi Says:

    You can often remove the portion with ads and non-content text by looking at the text in the given html tag and ensuring it has near the right density of stop words. Typically ads or navigation menus lack closed class words (determiners, prepositions, etc). I’ve done this many times and its generally pretty effective when the main text consists of more or less complete sentences.

    • lingpipe Says:

      That’s a nice set of small, easy to implement features. We really only want to pull out text that’s more or less complete sentences anyway.

    • Emre Sevinç Says:

      > Ok. I might package up the code I’ve used and release it GPL’d on my
      > website.

      I’d definitely give it a try. Can you share your code?

  8. Dr. Jochen L. Leidner Says:

    For easy cases, “lynx -dump > ” is your friend.

    You might also be interested in these papers:

    http://pages.cs.wisc.edu/~anhai/papers/cyclex-icde08.pdf

    http://textpro.fbk.eu/docs/wac3-htmcleaner.pdf

    Neither of these is what I’d call universal.

    Have you come across the “CLEANEVAL” evaluation that turns HTML cleanup into a competition?

    http://cleaneval.sigwac.org.uk/

    I haven’t tried outsourcing, but apparently that’s another option:

    http://www.iwebscraping.com/Web_Scraping_Service.php

    • Christian Kohlschütter Says:

      There are several nice content extraction algorithms presented around CleanEval. However I am not so confident with the quality of the assessed data.

      I have re-analyzed the CleanEval competition data (human assessments vs. automatic cleaning) and found that not much text has actually been cleaned off the raw text. In fact, keeping all text appeared to be a good “cleaning strategy” for that dataset…

      This either means that general-purpose web page cleaning is not a big deal at all (probably not) or it means that the dataset (and its assessments) are not “representative” enough (more likely).

      I have evaluated text cleaning (template/boilerplate removal) in the domain of news articles on the Web. Here, the “boilerpipe” strategy worked extremely well.

      Link to paper (WSDM 2010), code and test dataset:

      http://www.l3s.de/~kohlschuetter/boilerplate/

  9. T Says:

    If your really want structure, it’s hard to see how you can avoid crawling with an actual browser like webkit, and then taking data after its been rendered. This takes care of javascript, flash, etc. Layout is an important data point for some applications and parsing it from the raw html seems harder than rendering it via webkit.

  10. Andraz Tori Says:

    Oh, glad you blogged about this. It was on my mind for a long time. A perfect open source project that is just missing from the landscape.

    _lots_ of people and companies need this as a basic block of their web analysis/search/crawling/something infrastructure, but no really good open source solutions exist. It would be interesting to have resources poured into open solution instead of everyone building his own.

    The first one I’ve stumbled upon was WebStemmer in 2007 – http://www.unixuser.org/~euske/python/webstemmer/
    It works by trying to cluster pages from the same domain with tree patterns as features. Works pretty good when it works. The problem is that for some sites it just does not produce anything at all. And learning is quite slow.

    Later we’ve written our own at Zemanta – using libsvm2 and flattening of the DOM parse tree. Works good enough, but we don’t have enough time to really tune it well.

    One of the major issues is also the correctly marked up text to be used for learning the classifier. That’s really really hard to find – especially examples where advertising is inside blocks of valid text.

    This summer I’ve been referred to an article “Extracting article text from the web with maximum subsequence segmentation” http://www.www2009.org/proceedings/pdf/p971.pdf which reports some pretty interesting (and good looking) results.

    Haven’t yet found any open source implementation. So the field is still wide open for the general solution…

    bye
    Andraz Tori, Zemanta

  11. Jeffye Says:

    I also use nekohtml, which can be used to remove whatever tag in HTML (include js). see the following code.
    /**
    * Get all the text, include text in childNode
    * @param sb, store the content into sb
    * @param node, use org.cyberneko.html.parsers.DOMParser to parse a HTML
    */
    public static void getText(StringBuffer sb, Node node) {
    if(node ==null) return;
    String nodeName = node.getNodeName();
    //delete the SCRIPT text
    if(scriptdelTag && nodeName != null && nodeName.equalsIgnoreCase(“SCRIPT”)){
    return;
    }

    if (node.getNodeType() == Node.TEXT_NODE) {
    sb.append(node.getNodeValue());
    }
    NodeList children = node.getChildNodes();
    if (children != null) {
    int len = children.getLength();
    for (int i = 0; i < len; i++) {
    Node nd = children.item(i);
    getText(sb, nd);
    }
    }
    }

  12. John Herren Says:

    I think Kevin Burton at http://spinn3r.com/ was working on this a while back. Might wanna ask him.

  13. Stuart Sierra Says:

    If you find such a thing, please post about it! I searched for months with noluck.

  14. Elliot Says:

    There’s an API for that: http://www.alchemyapi.com/api/text/

  15. a Says:

    http://simile.mit.edu/wiki/Crowbar

  16. Yoav Goldberg Says:

    I don’t have any code to offer, but here are some approaches I used:

    I second Aria’s suggestion of looking at the proportion of stop-words, this worked well for me too, works fast, and easy to implement.

    The BTE algorithm is easy to implement and also worked well a few years ago, when most of the boilerplate / advertisements were at the top and bottom of the page. This changed, and now there are many advertisements on the center of the page as well, which BTE will include. I believe it can be modified to handle this case also.

    If you are willing to pay some extra in network usage and processing time to increase accuracy, another method which worked well for me in the past was to fetch the page several times, and also to do a small crawl and fetch neighboring pages from the same website. The multiple versions of the same page are then used to identify the advertisements. The multiple pages from same website are used to identify the boilerplate (we did this by collecting n-grams over the site pages, and areas where with many frequent n-grams were considered as boilerplate. This had the nice side-effect of identifying hidden spam content as well as the boilerplate).

    For a small scale task which started from google queries, going to the google cache and then to the “text only” version handled most of the advertisements, though the navigation remained. I am not sure if doing this on a large scale is allowed in google’s terms of service.

    I have no idea how to efficiently deal with AJAX. If anyone knows a way which does not require rendering the page through Firefox/Safari/IE, I am very interested in hearing about it.

  17. David R. MacIver Says:

    I don’t have any massively useful answers to provide (I’d love something like this as well), but here are some leads.

    Readtwit (http://www.readtwit.com/) do something like this. I would describe the results as ok but not outstanding. Still, it might be worth asking them what they use.

    anand kishore (@semanticvoid on twitter) seems to have done something like this recently. I’m not aware of the project being open source (or commercially available), but it might be worth asking. He’s also made his training set available: http://semanticvoid.com/blog/2009/08/22/web-content-extraction-dataset/

  18. JO'N Says:

    I don’t think anyone’s mentioned Webstemmer yet. It’s in Python, and its basic algorithm it to compare all the pages on a site and remove the structures shared across all the pages. It expects to be pointed at a web site to crawl, but since it’s open-source, its behavior can be changed. In my experience, it’s been pretty effective, when you’re crawling a site with a reasonable number of pages with a shared distinct structure, and when you’re willing to let it learn (or update, when necessary) its model for each site.

  19. David R. MacIver Says:

    I asked @semanticvoid about that, and he pointed me towards http://www.alchemyapi.com/api/text/ which seems to do what you need.

    • Yoav Goldberg Says:

      Hmm.. when tested on itself, it doesn’t seem to work very well at filtering navigation elements. Same for some other random web pages. Work well for CNN though!

      • David R. MacIver Says:

        I tried it on a few random pages. It seemed to do reasonably well, though I found it got confused by things which had a relatively large amount of navigation relative to its content and that in some other cases it missed out the edges of the content.

        Generally it seemed to be good enough to run analysis on, though probably not quite good enough to give perfect human readable output.

  20. David R. MacIver Says:

    Oh, that’s annoying. That comment references a previous one which is still held in moderation. Must have hit a number of links filter the first time.

  21. Kevin Harris Says:

    I created a Java implementation of the algorithm outlined in the paper Text extraction from the web via text-to-tag ratio (http://www.uni-weimar.de/medien/webis/research/workshopseries/tir-08/proceedings/18_paper_652.pdf). It actually works quite well. We are using lingpipe heavilly to extract information from the results. If you are interested in it, I can post it on my website or you can email me for a copy of the code at harriskevine@ou.edu.

  22. Keyvan Says:

    Someone mentioned Readability. I ported it a while ago to PHP and have a web-service available which tries to pick out content from HTML pages: http://fivefilters.org/content-only/

    There’s a newer version released late January which is supposed to be more accurate: http://blog.arc90.com/2010/01/26/introducing-readability-1-5/

  23. Chris Says:

    I’ve spent the past nine months studying (and waiting out) the market to see what products would best sell via the Internet. I’ve come across a few price and product scraping services that all seemed just about the same with pricing and results… though some simply provided the data in a collective dump while others provided a more readable format for additional cost.

    The one that seems most attractive to us at this time is Mozenda. They are the least expensive for the same level of results as a number of others. If you don’t have budget issues, there are a few that can definitely provide more robust and powerful results. Those are worth considering down the road for us, just not yet.

    Good luck!

  24. Dori Stein Says:

    I think you should also read about scraping tools and how to compare them on http://www.fornova.net/blog/?p=18

  25. Web text extraction systems: How to get the main text of an arbitrary web page | FZ Blogs Says:

    [...] I have seen that the developer of the famous Lingpipe software also looked for a similar thing: Blegging for Help: Web Scraping for Content? [...]

  26. What are some ways to extract the main text from an blog entry using Python? - Quora Says:

    [...] Stenström There are lots of good suggestions in the comment to this blog post: http://lingpipe-blog.com/2010/01…Insert a dynamic date here BIU     @     Edit [...]

  27. fminer Says:

    You can try our production web scraping tool FMiner: http://www.fminer.com .
    Now we will make a FREE extraction project for every new user.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 822 other followers