Archive for the ‘Reviews’ Category

Information Gain and Task-Based Costs: Biased versus Spam Annotators

November 22, 2010

Following on from Sheng et al.’s Get Another Label? paper, Panos and crew have a Human Computation Workshop paper,

The motivation for the new paper is to try to separate bias from accuracy. That is, some annotators would try hard, but the bias in their answers would give them an overall low exact accuracy. But their responses can still be useful if we correct for their bias. Luckily, the Dawid-and-Skene-type models for estimating and adjusting for annotator’s accuracies does just that.

G, PG, R, or X?

As an example, Ipeirotis et al. consider having workers on Amazon Mechanical Turk classify images into G, PG, R, and X ratings following MPAA ratings guidelines.

This is really an ordinal classification problem where the approach of Uebersax and Grove (1993) seems most appropriate, but let’s not worry about that for right now. We can imagine a purely categorical example such as classifying tokens in context based on part-of-speech or classifying documents based on a flat topic list.

Bias or Inaccuracy?

Uebersax and Grove discuss the bias versus accuracy issue. It’s easy to see in a two-category case that sensitivity and specificity (label response for 1 and 0 category items) may be reparameterized as accuracy and bias. Biased annotators have sensitivities that are lower (0 bias) or higher (1 bias) than their specificities.

What Ipeirotis et al. point out is that you can derive information from a biased annotator if their biases can be estimated. As they show, a model like Dawid and Skene’s (1979) performs just such a calculation in its “weighting” of annotations in a generative probability model. The key is that it uses the information about the likely response of an annotator given the true category of an item to estimate the category of that item.

Decision-Theoretic Framework

Ipeirotis et al. set up a decision-theoretic framework where there is a loss (aka cost, which may be negative, and thus a gain) for each decision based on the true category of the item and the classification.

One nice thing about the image classification task is that it makes it easy to think about the misclassification costs. For instance, classifying an X-rated image as G and having it land on a children’s site is probably a more costly error than rating a G image as PG and banning it from the same site. In the ordinal setting, there’s a natural scale, where rating an X-rated image as R is closer to the true result than PG or G.

Getting Another Label

Consider a binary setting where the true category of item i is c_i, the prevalence (overall population proportion) of true items is \pi, and annotator j has sensitivity (accuracy on 1 items; response to positive items) of \theta_{1,j} and a specificity (accuracy on 0 items; 1 – response to negative items) of \theta_{0,j}. If y_{i,j} is the response of annotators j \in 1{:}J, then we can calculate probabilities for the category c_i by

\mbox{Pr}[c_i = 1 | y_i] \propto \pi \times \prod_{j=1}^J \theta_{1,j}^{y_{i,j}} \times (1 - \theta_{1,j})^{1 - y_{i,j}} and

\mbox{Pr}[c_i = 0 | y_i] \propto (1 - \pi) \times \prod_{j=1}^J \theta_{0,j}^{1 - y_{i,j}} \times (1 - \theta_{0,j})^{y_{i,j}}.

The thing to take away from the above formula is that it reduces to an odds ratio. Before annotation, the odds ratio is \pi / (1-\pi). For each annotator, we multiply the odds ratio by the annotator’s ratio, which is \theta_{1,j} / (1 - \theta_{0,j}) if the label is y_{i,j} = 1, and is (1 - \theta_{1,j})/\theta_{0,j} if the label is negative.

Random Annotators Don’t Affect Category Estimates

Spammers in these settings can be characterized by having responses that do not depend on input items. That is, no matter what the true category, the response profile will be identical. For example, an annotator could always return a given label (say 1), or always choose randomly among the possible labels with the same distribution (say 1 with 20% chance and 0 with 80% chance, correspdonding, say, to just clicking some random checkboxes on an interface).

In the binary classification setting, if annotations don’t depend on the item being classified, we have specificity = 1 – sensitivity, or in symbols, \theta_{0,j} = 1 - \theta_{1,j} for annotator j. That is, there’s always a \theta_{1,j} chance of returning the label 1 no matter what the input.

Plugging this into the odds ratio formulation above, it’s clear that there’s no effect of adding such a spammy annotator. The update to the odds ratios have no effect because \theta_{1,j}/(1 - \theta_{0,j}) = 1 and (1-\theta_{1,j})/\theta_{0,j}=1.

Cooperative, Noisy and Adversarial Annotators

In the binary case, as long as the sum of sensitivity and specificity is greater than one, \theta_{0,j} + \theta_{1,j} > 1, there is positive information to be gained from an annotator’s response.

If the sum is 1, the annotator’s responses are pure noise and there is no information to be gained by their response.

If the sum is less than 1, the annotator is being adversarial. That is, they know the answers and are intentionally returning the wrong answers. In this case, their annotations will bias the category estimates.

Information Gain

The expected information gain from having an annotator label an item is easily calculated. We need to calculate the probability of true category and then probability of response and figure out the odds of each and the contribution to our overall estimates.

The random variable in question is c_i, the category of item i. We will assume a current odds ratio after zero or more annotations of \phi/(1-\phi) and consider the so-called information gain from observing y_{i,j}, the label provided for item i by annotator j,

\mbox{H}[c_i] - \mathbb{E}[\mbox{H}[c_i|y_{i,j}]],

where the expectation is with respect to our model’s posterior.

Expanding the expectation in the second term gives us

\mathbb{E}[\mbox{H}[c_i|y_{i,j}]] = \mbox{Pr}[y_{i,j} = 1] \times \mbox{H}[c_i|y_{i,j}=1] + \mbox{Pr}[y_{i,j} = 0] \times \mbox{H}[c_i|y_{i,j}=0]

The formulas for the terms inside the entropy are given above. As before, we’ll calculate the probabilities of responses using our model posteriors. For instance, carrying this through normalization, the probabilities on which the expectations are based are

\mbox{Pr}[y_{i,j} = 1] = Q_1/(Q_1+Q_2), and

\mbox{Pr}[y_{i,j} = 0] = Q_2/(Q_1 + Q_2), where

Q_1 = \phi \times \theta_{1,j} \ + \ (1-\phi) \times (1 - \theta_{0,j}), and

Q_2 = (1-\phi) \times \theta_{0,j} \ + \ \phi \times (1 - \theta_{1,j}).

so that the probability the annotation is 1 is proportional the sum of the probability that the true category is 1 (here \phi) and the response was correct (\theta_{1,j}) and the probability that the true category is 0 (1 - \phi) and the response was incorrect (1 - \theta_{0,j}).

It’s easy to see that spam annotators who have \theta_{0,j} = 1 - \theta_{1,j} provide zero information gain because as we showed in the last section, p(c_i|y_{i,j}) = p(c_i) if annotator j provides random responses, then \mbox{H}[c_i|y_{i,j}] = \mbox{H}[c_i].

Decision-Theoretic Framework

Ipeirotis et al. go one step further and consider a decision-theoretic context in which the penalities for misclassifications may be arbitrary numbers and the goal is minimizing expected loss (equivalently maximizing expected gain).

Rather than pure information gain, the computation would proceed through the calculation of expected true positives, true negatives, false positives, and false negatives, each with a weight.

The core of Bayesian decision theory is that expected rewards are always improved by improving posterior estimates. As long as an annotator isn’t spammy, their contribution is expected to tighten up our posterior estimate and hence improve our decision-making ability.

Suppose have have weights L_{k,k'}, which are the losses for classifying an item of category k as being of category k'. Returning to the binary case, consider an item whose current estimated chance of being positive is \phi. Our expected loss is

\mathbb{E}[\mbox{loss}] = L_{1,1}  \phi^2 + L_{1,0}  \phi  (1-\phi) + L_{0,1} (1-\phi)  \phi + L_{0,0} (1-\phi)^2.

We are implicitly assuming that the system operates by sampling the category from its estimated distribution. This is why \phi shows up twice after L_{1,1}, once for the probability that the category is 1 and once for the probability that’s the label chosen.

In practice, we often quantize answers to a single category. The site that wants a porn filter on an image presumably doesn’t want a soft decision — it needs to either display an image or not. In this case, the decision criterion is to return the result that minimizes expected loss. For instance, assigning category 1 if the probability the category is 1 is \phi leads to expected loss

L_{1,1} \phi + L_{0,1} (1-\phi)

and the loss for assigning category 0 is expected to be

L_{0,0} (1-\phi) + L_{1,0} \phi.

The decision is simple: return the result corresponding to the smallest loss.

After the annotation by annotator j, the positive and negative probabilities get updated and plugged in to produce a new estimate \phi', which we plug back in.

I’m running out of steam on the \LaTeX derivation front, so I’ll leave the rest as an exercise. It’s a hairy formula, especially when unfolded to the expectation. But it’s absolutely what you want to use as the basis of the decision as to which items to label.

Bayesian Posteriors and Expectations

In practice, we work with estimates of the prevalence \pi, sensitivities \theta_{1,j}, and specificities \theta_{0,j}. For full Bayesian inference, Gibbs sampling lets us easily compute the integrals required to use our uncertainty in the parameter estimates in calculating our estimate of \mbox{Pr}[c_i = 1|y_i] and its uncertainty.

Confused by Section 4

I don’t understand why Ipeirotis say, in section 4,

The main reason for this failure [in adjusting for spam] is the inability of the EM algorithm to identify the “strategic” spammers; these sophisticated spammers identify the class with the highest class prior (in our case the “G” class) and label all the pages as such.

Huh? It does in my calculations, as shown above, and in my experience with the Snow et al. NLP data and our own NE data.

One problem in practice may be that if a spammer annotates very few items, we don’t have a good estimate of their accuracies, and can’t adjust for their inaccuracy. Otherwise, I don’t see a problem.

Grammar as Science

September 29, 2010

I had an hour free time on the University of Chicago campus, so after strolling the quad and gawking at the Robie house, I toddled into the university bookstore. After marveling at the number of X and Philosophy titles*, I wandered over to the linguistics section, where I found:

I was struck by the beginning of the back-cover blurb:

This introductory text takes a novel approach to the study of syntax. Grammar as Science offers an introduction to syntax as an exercise in scientific theory construction.

If I’m not mistaken, this is implying that other introductions to syntax are exercises in something other than science. I’ll buy that. But I’m not sure I buy the next bit:

Syntax provides an excellent instrument for introducing students from a wide variety of backgrounds to the principles of scientific theorizing and scientific thought; it engages general intellectual themes present in all scientific theorizing as well as those arising specifically within the modern cognitive sciences. …

This book aside, I doubt Chomskyan linguistics is a good exemplar of science for reasons that (part of) Peggy Speas’s endorsement illuminates:

[Grammar as Science] shows students explicitly how to ‘think like a linguist.’ Students who use this book will come away with an extraordinarily strong grasp of the real underpinnings of linguistics.

The key here is “think like a linguist”. If you’ve ever tried to take Chomsky’s conception of grammar from Aspects and explain it to students (which I’m embarrassed to admit that I have done in the past), you’d have your doubts about the relation between “think like a scientist” and “think like a linguist”, too.

I was further amused by (the beginning of) Norbert Hornstein’s endorsement, also on the back cover:

What makes modern generative linguistics a science, and an interesting one at that, is a subtle interaction among several factors: the questions that it asks, the abstractions that it makes to answer these questions, the technical apparatus it uses to make these questions precise, the evidence it uses to adjudicate between different answers, and the aesthetic that animates it when putting all these ingredients together.

Silly me. Here I was thinking what makes something interesting science is predictive power.

Show, Don’t Tell

One piece of very good advice for writers in any genre is show, don’t tell.

Sci-fi writers and linguists seem to have a particularly hard time following this advice. I’m afraid the newbie/off-worlder/woken-up-from-cryosleep listener who has everything explained to him obeys the letter but not the spirt of the dictum. That’s similar to “serious” texts where thin characters have a “dialogue” that’s a proxy for the author explaining to the reader. Examples include Gödel, Escher, Bach in logic/AI (which I loved when I was in high school when it came out) and Rhyme and Reason in linguistics, which covers very similar “philosophy of science” ground as Grammar as Science.

I’ve never seen a physics or chemistry or even a biology book start off by asserting that what follows is really science. Most scientists, and most scientific texts, just do science.

Methinks the linguists do protest too much.

Been There, Done That

If you look at the first chapter of the second HPSG book, Carl Pollard (this was his axe to grind) preaches the same gospel, as in this excerpt from page 7:


* The following are real, though I’m too lazy to link past the first: Anime and Philosophy, Star Wars and Philosophy, Alice in Wonderland and Philosophy, The Matrix and Philosophy, Facebook and Philosophy, Transformers and Philosophy, Batman and Philosophy, The Undead and Philosophy and so on (and on and on). Is this some kind of works project for philosophers?

McCandless, Hatcher & Gospodnetić (2010) Lucene in Action: 2nd Ed. (Review)

August 16, 2010

I’m very happy to see that the 2nd Edition is out:

  • McCandless, Michael, Erik Hatcher, and Otis Gospodnetić. 2010. Lucene in Action. Second Edition. Manning Press.

Manning’s offering 40% off until September 30, 2010. Follow the link to the book and use code lingpipeluc40 when you check out.

You Need This Book

If you’re interested in version 3 or later of the Apache Lucene search library, you need this book. As far as I know, this book is the only way to come up to speed on the intricacies of Lucene.

What if I have the First Edition?

Keep it for your projects that require 2.x versions of Lucene. So much has changed in version 3 that is not backward compatible with previous versions that the first edition of this book will be of almost no use. Every example has changed. It’s pretty much a whole new book (other than the first page where it describes Lucene as simple — that used to be true, but is not any longer).

I’ve been coding in Lucene since version 1.2 (2002), and the javadoc’s been getting better over time. I still needed the code samples from this book to get a basic index up and running and to figure out how to enumerate the terms given an analyzer. (In the first edition of the book, I contributed a case study on n-gram-based analysis, so it’s not like I was unfamiliar with the whole process!)

Is it a Good Book?

I’m going to violate the “show, don’t tell” rule of writing and answer my own question with an unequivocal “yes”. Now let me show you why.

The authors clearly know their Lucene. Not surprising, given that they’re three of the core developers for Lucene. (Erik Hatcher was also involved in the Ant build system, and has another Manning … in Action book for that.) Following the Manning Press in Action methodology, they lay everything out with very clear examples that you can run. As such, the book’s better than good — it’s really great set of code examples presented in a coherent style and order, with easy-to-follow explanations.

The book covers a wide range of introductory and advanced topics in a way that’s about as easy to follow as possible. It starts from simple indexing and query construction and then moves into more advanced topics like spell checking and term vector more-like-this implementations. It does all of this with examples, cleverly sidestepping most of the (relatively simple) mathematics underlying search and scoring, while still conveying the basic principles.

They cover extras like the Tika document framework, where they provide a very clear intro to the SAX XML parsers and a nice overview the Jakarta Commons Digester (I dislike the config-and-reflection-based Digester so much that I built LingPipe’s delegating and delegate handlers as an in-the-code alternative).

If you follow the examples in this book and try them out you’ll be in a good position to use Lucene in a serious application. Even better, you’ll be prepared to join the project’s amazingly helpful and responsive mailing list (for years now, the first response to most newbie questions has been to read Lucene in Action!).

They cover many more advanced topics than the last book, like the range of file-level locking options and their use in a more transactional setting than basic Lucene provides itself.

Needs a Revision

Reading the book, I can’t shake the guilt arising from not sending them feedback on drafts. I had the drafts lying around in PDF form for ages and kept promising to send Otis feedback. As is, this book needs another editorial pass. (That’s not to say it’s not fantastic as is — all the glitches in the book I saw were relatively minor and didn’t get in the way of the code examples actually illustrating what’s going on with Lucene.)

The initial two lines of a paragraph are repeated on pages xxx and xxxi. There are also typos like inserted question marks, as in “book?two hundred years ago” on p. xxxii.

There are mistakes in code fragments, like a missing comma in the parse() example on p. 239.

There are also bugs in code. I only studied a few programs in the term vector section (which is great) and in the Tika section. Listing 5.18 has a bug in that it calls with a maximum number of hits of 10 (below pushpin 6), whereas the method is specified to take a configurable max (above pushpin 3).

There are dangling references like “uncomment the output lines toward the end of the docsLike method” (on p. 195), where there are no output lines toward the end of the said method (on p. 193).

I’d be in favor of getting rid of the content-free text like “It’s time to get our feet wet!” and so on, while you’re at it.

What’s Missing?

There’s a huge range of contributed packages surrounding Lucene these days, coupled with an ever-expanding range of applications to which Lucene is being put. As a result, the authors face a daunting curatorial challenge.

From my perspective, the biggest gap is around Unicode and how it behaves with respect to normalization and search. It’s more than just stripping accents from Western characters.

JUnit Examples

Don’t get me wrong. I love JUnit. We use it for unit testing in LingPipe. And I recommend reading our tests to help understand our classes. I just don’t think it’s a good way to present examples in an intro book. On top of that, they’re using the old style of JUnit rather than the newer style with annotations for tests and startup/teardown methods.

What about the Code?

Authors of textbooks trying to introduce an API are presented with the difficult problem of balancing simplicity with solid coding practice. I have to fight my own tendency to err on the latter side. The authors favor simplicity in most cases, as they explain up front. The authors realize that users are likely to start from cut-and-pasted versions of their code, and try to warn people about the shortcuts they took for the sake of clarity. (We do the same thing in our tutorials.)

Many (but not all) methods that throw exceptions are just declared to throw Exception. I think this does readers a disservice in not knowing what’s actually throwing what kind of exception. Others might argue it makes the code cleaner and less cluttered. The authors explain that’s their motivation and point out proper exception etiquette. We do the same thing in some of our tutorial code.

The use of generics is spotty, with allocations like new TreeMap() that should be new TreeMap<String,Integer> and subsequent use of non-generic Iterator instances. As a result, they have to cast their gets and can’t use auto-boxing. This just clutters up the code using the maps and makes their intended contents opaque.

I’d recommend autoboxing, or its explicit result, Integer.valueOf(int), over the older style new Integer(int) (the latter always creates a new instance, whereas valueOf() being a factory can share instances).

I’d like to see more foreach and fewer explicit iterators.

They don’t like ternary operators or min/max operators, resulting in awkward fragments like:

int size = max;
if (max > hits.scoreDocs.length) size = hits.scoreDocs.length;

rather than the simpler

int size = Math.min(max,hits.scoreDocs.length);

I don’t know if this is the different authors’ example coding style (they all know how to write real production code, like Lucene itself!).

There is a more serious issue for floating point precision they try to deal with in listing 5.22 in the last if/else block. They try to avoid having cosines take on values outside of (-1,1), which would be illegal inputs to acos(). Their approach is insufficient. The only way to guarantee that your natural cosine computations fall in this range is to do:

double boundedX = Math.max(-1.0, Math.min(1.0, x));

That’s another problem we ran into for our k-means clustering implementation in LingPipe. Once the vectors get big, it’s easy to get results greater than 1 for close matches.

Also, if you’re worried about efficiency, replace Math.sqrt(x) * Math.sqrt(y) with Math.sqrt(x * y).

Layout, Etc.

I’m not a big fan of Manning’s approach to code formatting and presentation. Whole programs are printed out, often across multiple pages, and a numbered push-pin like graphic in very heavy bold is used to pick them out and cross-reference them in the text. I prefer the interleaved style I’ve been using, and you see in something like Bloch’s Effective Java, but I can see that it’s a matter of taste.

The code formatting also bugged me. Rather than breaking before assignment equal signs, they break after. They use two characters of indent, which is hard to scan. And the code’s too light for the rest of the text (the “color” doesn’t balance in the typographical sense).

The real typographical sin this book commits is breaking class and method names and file paths and URLs across lines. A class name like IndexWriter is apt to show up with Index- at the end of one line and Writer at the start of the next. The reason this is bad is that hyphens are legal characters in paths and may be confused with underscores in class names (hyphens aren’t legal in Java identifiers).

How Much Math?

For the most part, the authors stick to concrete examples and shy away from abstract mathematics (and related algorithm analyses). If you want to understand the math behind comparison or the probability models behind compressed indexes, I’d still recommend Witten, Moffat and Bell’s Managing Gigabytes (Morgan Kaufmann text, so it’s prohibitively expensive). Mannnig, Raghavan and Schütze’s more recent Introduction to Information Retrieval (Cambridge Uni. Press let them leave an online version up for free!) has most of the math and many more recent topics related to language processing such as edit distance algorithms and their relation to spell checking, probabilistically motivated retrieval, and a range of classifiers like K-nearest neighbors and support vector machines.

In the cosine example in the “Leveraging Term Vectors” (section 5.10), they define cosine in a formula without ever defining the length or dot product functions. If someone needs cosine defined for them, they probably need the definitions for basic vector operations. In the example code, they assume that the word vector has only a single instance of each word, which is certain to be confusing for novice geometers. I would’ve just stuck with largest cosine for the sort, rather than smallest angle and skipped the call to Math.acos(), the arc (or inverse) cosine operation (which they later describe as prohibitive in terms of computation costs, though it’ll only make a difference percentage-wise if the term vectors are short).

For spell checking, it’d have been nice to get up to something like the noisy channel model. Peter Norvig provides an elegant introduction in his chapter in the Beautiful Data book from O’Reilly. Lucene’s spell checking is still pretty simple and word based (rather than frequency based), though the authors provide some suggestions on how to work with what’s there.

What about Solr?

Now that the Solr search client-server project is merging with Lucene, it’ll be interesting to see if the third edition includes information about Solr. This book only mentions it in passing in a page-long section 10.8.

Other Books on Lucene

I haven’t seen Smiley and Pugh’s 2009 Solr 1.4 Enterprise Search Server, but it has positive feedback on Amazon.

Manu Konchady’s book Building Search Applications with Lucene, LingPipe, and Gate is more introductory (more like the O’Reilly NLTK book), and needs an update for Lucene 3 and LingPipe 4.

About those Cover Illustrations

On page xxxii, they tell the story of an editor buying the book from which the cover art is drawn. They say the cover page is missing so they couldn’t track it down, but they mention the date and artists. I just did a search for the terms they mention, [ottoman empire clothing 1802 william miller] and found two copies of the book for sale at a Donald Heald Books (you may need to search again). Here’s the scoop

[DALVIMART, Octavien (illustrator) – William ALEXANDER (1767-1816)].

The Costume of Turkey, illustrated by a series of engravings; with descriptions in English and French

London: printed for William Miller by T. Bensley, 1802 [but text watermarked 1796 and 1811; plates 1811 and 1817). Folio (14 1/8 x 10 1/2 inches). Parallel text in French and English. Letterpress title with integral hand-coloured stipple-engraved vignette, 60 hand-coloured stipple-engraved plates by J. Dadley or William Poole after Octavien Dalvimart. Contemporary green straight-grained morocco, covers bordered in gilt and blind, skilfully rebacked to style with the spine in six compartments with double raised bands, the bands highlighted with gilt tooling, lettered in gilt in the second compartment, the others with repeat decoration in gilt, oatmeal endpapers, gilt edges. Provenance: William Rees Mogg (Cholwell House, Somerset, England, armorial bookplate).

A good copy of this classic work on the costume of the Ottoman Empire.

Starting with the “Kislar Aga or first black unuch [sic.]” in the “Grand Signior’s Seraglio” the subjects covered are quite wide-ranging but centered on the inhabitants of Constantinople and those who were visiting the capital city: a “Sultana”, “Officers of the Grand Signior”, “Turkish woman of Constantinople”, “Turkish woman in provincial dress”. Dalvimart does make occasional forays out into the provincial areas of the empire: included are images of an “Albanian”, an “Egyptian Arab”, a “Bedouin Arab”, a “Dervise [sic.] of Syria”, an “Armenian”, a “Bosniac” as well as a number of fine plates of the female costume of the Greek Islands (which are much admired in the text). “The Drawings, from which … [the] plates have been engraved, were made on the spot … [in about 1798] by Monsieur Dalvimart, and may be depended upon for their correctness. They have been accurately attended to in the progress of the engraving; and each impression has been carefully coloured according to the original drawing, that the fidelity of them might not be impaired” (Preface). Abbey points out that the informative text is attributed to Willliam Alexander in the British Library catalogue.

Abbey Travel II,370; Atabey 313; cf. Colas I, 782; cf. Lipperheide I, Lb37; cf. Vinet 2337.

Sierra & Bates. Head First Java 2nd Ed. (Review)

August 9, 2010

After the comments on the last post about relying more heavily on Java textbooks, I ordered the most popular ones from Amazon. I’ll start the reviews with the number one most popular book on Java (ranked by Amazon sales; it also has 269 reviews; I’m happy to see Effective Java is number 4):

  • Sierra, Kathy and Bert Bates. 2005. Head First Java. 2nd Edition. O’Reilly.

Mashup: Elementary School Workbook + Kitschy Humor

Image from Title Page.

I grew up with texts like The Little LISPer, but this book’s even stranger. It strikes me as a mashup of an elementary school math problem workbook and one of those cutesy humor books you see near the checkout counters of bookstores around the holidays.

Here’s an example of one of their quizzes, where you fill in missing pieces of a Java program; just like school writing exercises:

The Design and Layout

I think they were striving for a sense of the enthusiastic amateur. The “fun” fonts and askew alignments make it look non-linear. Code comments are added in a handwritten post-it note style.

You really need to head over to Amazon and “look inside”.

Who’s it For?

It’s aimed at Java novices. Other than that, I’m not sure. Maybe for potential Java developers who like crossword puzzles or true/false quizzes with titles like “Be the Compiler”. I’m more of a Java Puzzlers guy myself (thank my other childhood friend, Martin Gardner).

Unless you laughed at the photo included above, don’t buy it for the humor.

How Much Does it Cover?

It covers a startling amount of Java, including advanced topics like threads, sockets, RMI, Swing, audio, …

Of course, there can’t be much depth given this range. But you do get a wide range of hello-world-type programs.

Image from Page 101.

Is it Sound Java?

The authors clearly know what they’re doing. The advice and coverage are too basic if you want to program Java seriously, but it’ll get you started on the right track.

There’s even good high-level advice like telling you to write unit tests (though not using anything as complex as JUnit).

It’s an Industry

Like other popular series (i.e., …for Dummies, … in Action, … in a Nutshell), this one’s turned into an industry, with dozens of titles covering everything from HTML to statistics.

As such, they’ve created all the trappings of such an enterprise, such as “Head First learning principles”:

  • make it visual — put the words within or near graphics
  • use a conversational and personalized style
  • get the learner to think more deeply
  • get — and keep — the reader’s attention
  • touch their emotions

And they break out all sorts of “metacognitive” tips explaining why they wrote such a whacky book and how you should go about learning with it (like read it before bed and mix right-brain/left-brain activities for learning).

Many reviewers like the approach well enough to blurb about it.

Li, Ruotti, Stewart, Thomson and Dewey (2010) RNA-Seq Gene Expression Estimation with Read Mapping Uncertainty

June 9, 2010

It was bound to happen. The model I was proposing for splice-variant (or isoform) expression was too obvious (and too good!) for it not to be independently discovered. Seeing my ideas discovered by others is on one hand a bummer, but on the other hand gives me confidence that I’m thinking about these models in the right way.

The following paper presents roughly the same model of expression I presented in a previous blog post, Inferring Splice Variant mRNA Expression with RNA-Seq.

It also incorporates some of the features I suggested for edit distance in another blog post, Sequence alignment with conditional random fields.

The Basic Expression Model

In fact, it’d almost be easier to say what’s different in Li et al.’s model. The expression model is almost identical, down to the name of the variable, \theta, used for read expression. The model in this paper assumes N reads, implicitly assumes an uniform prior for \theta \sim \mbox{\sf Dirichlet}({\bf 1}), then for each read 1 \leq n \leq N, chooses splice variants z_n \sim \mbox{\sf Discrete}(\theta). So far, identical (other than that I considered arbitrary Dirichlet priors).

[Update, June 10, 2010: I forgot to mention:

Noise Isoform

Li et al. include a so-called “noise isoform”, which they assign a simple probability distribution to. This will act as a sink for reads that do not map elsewhere. I don’t quite see given how it’s defined how anything would get mapped to it if we restricted attention to reads that mapped within a few edits with BOWTIE.]

Position and Strand-Based Reads

Things look different when Li et al. start to generate reads y_n. What I did was assume an arbitrary distribution \mbox{\sf Read}(y_n|z_n,\varphi), with parameters \varphi characterizing the likelihood of observing read y_n from reference source z_n. Li et al. decompose part of \mbox{\sf Read} in their model. First, they assume s_n is the start position of read y_n on reference sequence z_n and assume o_n is the strand, both of which are generated from the ref sequence z_n, by distributions p(o_n|z_n) and p(s_n|z_n).

This is where Li et al.’s proposal relates to my probabilistic aligner proposal. In particular, with the position, strand and reference sequence, s_n, o_n, z_n, the reads may be defined to be sensitive to location (say in relative position measured from 5′ to 3′), or to underlying sequence (such as the hexamer bias due to priming studied in [Hansen, Brenner and Dudoit 2010]). They only study the positional biases, and for their data, they were small. But groovy nonetheless. It’s building in the right kind of sensitivity to the biological sample prep and sequencing.

Non-Probabilistic Alignment

[Updated, June 10, 2010: I don’t know what I was thinking — Li et al. definitely use probabilistic alignment. In fact, they use a position-specific edit matrix w_t for substitutions (no indels). I’m not sure how the matrices are estimated.]

What they’re not using is the the quality scores on the reads themselves (as is done in Clement et al.’s GNUMap aligner).

I think there’s an opportunity to squeeze more sensitivity and specificity out of the model by using the quality scores from the sequencer (if they can be calibrated properly, that is).

The aligner I propose also normalizes the edits to proper probabilities using a sum-product algorithm; it’d be easy to extend to read quality as in GNUMap, or to compose it with a color space finite-state transducer, as in SHRiMP for SOLiD reads.

EM MLE vs. Gibbs Sampling Bayesian Posterior

The other main difference is that I was proposing using Gibbs sampling to produce samples from the posterior p(\theta|y,\alpha,\varphi) of expression given reads y and channel model \varphi and Dirichlet prior \alpha. Li et al. use EM to find the maximum likelihood estimate, \theta^* = \arg\max p(y|\theta,\varphi). As usual, the MLE is just the MAP estimate with a uniform prior, so in my model, the MLE is \theta^* = \arg\max_\theta p(\theta|y,\alpha={\bf 1},\varphi).

Bootstrap Standard Error Estimates

One of the main arguments I made for the Bayesian approach is that, like all truly Bayesian approaches, it supports posterior inference with uncertainty. This is very useful for doing things like pairwise or multiple comparisons of expression or adjusting for false discovery rates.

Li et al. use a bootstrap estimate of standard error, which is great to see. I wish more researchers provided variance estimates.

The danger with only reporting standard error (or variance) in these skewed binomials (or multinomials) is that the parameter value’s very close to the edge of the allowed values, so the 95% interval can contain illegal values. You see the same problem for normal approximations of variance for the Poisson, where a few standard deviations below the mean can result in negative counts.

[Update: June 10, 2010 Of course, you can plot plug-in quantiles such as 95% intervals with the bootstrap, too. It’s really pretty much like Gibbs sampling in terms of the application, if not philosophy.]


They run a bunch of simulation experiments to show that this kind of model makes sense. I did the same thing on a (much much) smaller scale. They use Langmead et al.’s BOWTIE aligner with up to two mismatches, which seems a rather dodgy basis for this kind of expression model. It will depend on the settings, but the defaults provide a greedy heuristic search that’s not guaranteed to be complete in the sense of finding all or even the best alignment.

[Update: June 10, 2010: BOWTIE has a --all setting that the documentation to generate all matching reads, but there’s also a maximum number of backtracks parameter that can eliminate some matches if there are 2 or more edits allowed.

Even if BOWTIE can be configured to find all the matches up to edit distance 2, there’s no reason to assign them all the same probability in the model or to assume that a read is accurately mapped at edit distance 2 and not at edit distance 3 or greater.

My understanding is that due to its best-first heuristic search, BOWTIE does not guarantee it will find all reads even up to a given edit distance.]

What we really need is some real applications. Mitzi and I are reanalyzing the human liver and kidney Illumina RNA-Seq data described in (Marioni et al. 2008), and also some new data generated by Ken Birnbaum (and Paul Scheid) at NYU using a single-cell protocol on SOLiD on Arabidopsis roots over a 5-day period. Paul Scheid, the director of the SOLiD core facilty at NYU, just presented the data at a SOLiD user’s group meeting he hosted at NYU last Friday. The short story is that Mitzi crunched the data to show that you can use a single-cell protocol on SOLiD and use SHRiMP for alignment to derive expression results similar to that estimated from parallel microarray day using Li and Wong’s D-Chip factor model for microarray expression.

20 to 25 Bases!

The conclusion of Li et al. is that if each base costs the same to read independent of read length, then according to their simulations, the optimal read length for caclulating variant expression is around 20 to 25 bases, not the 50 or longer that Illumina and SOLiD are producing these days.

Piecewise-Linear or Isotonic Regression for Calibrating Naive Bayes (or Other Predictors)

June 3, 2010

I was just reading a really nifty Bayesian approach to clickthrough rate estimation from Microsoft Research Cambridge (UK), which used as a baseline the following paper’s approach to calibrating naive Bayes classifiers:

I’ve blogged about Rennie et al.’s approach to tackling the poor assumptions of Naive Bayes for text classifiers; oddly, the Rennie et al. 2003 ICML paper I blogged about before doesn’t cite the 2001 Zadrozny and Elkan ICML paper on the same topic.

Failed Independence Assumptions in Naive Bayes

The well-known problem with naive Bayes (i.e. a discrete mixture of multinomials) derives from using features that are correlated in a model that assumes they’re not. This is, in fact, the naivete from which the name is derived, though it’s a misnomer, because the model’s right for some situations. For instance, it’s common to see naive Bayes applied to model bags of words in human language, which are highly correlated both locally (by syntax and collocation) and globally (by topic). It’s what you’ll get from LingPipe’s NaiveBayesClassifier and TradNaiveBayesClassifier classes.

In practice, what happens, especially for longer documents, is that the probabilities tend toward 0 or 1 because of redundant tokens with high correlations being treated as independent. This is especially easy to see for repeated words. If we’re classifying sports documents and it’s 90% likely to be an article about basketball if it mentions “lebron” (actually it’s probably higher, due to the celebrity of LeBron James). So what if we see “lebron” twice? In the model, the probability of being about basketball goes up to 0.9^2 / (0.9^2 + 0.1^2) \approx 0.987. The problem is that two “lebron”s don’t tell you much more about the topic than one. An article about the filmmaker Eddie Lebron or the Salsa band The Lebrón Brothers is also likely to mention “lebron” more than once. Once you’ve seen one, the effect of the next one should go down. (The fact that all three uses of “lebron” are spelled differently is an orthogonal issue I shouldn’t have confused here; the same holds for a fully capitalized name like “Smith”.)

Measuring Term Correlation

One thing to do is try to measure the correlation. Along these lines, I really liked Ken Church’s analysis of correlation in a paper titled One term or two? (but only available pay-per-view) and Martin Jansche’s careful follow up.

It’s traditional to hack calibrations by converting every document x to the same virtual document length \lambda, using p_{\lambda}(x) \propto p(x)^{\lambda/{\rm length}(x)}. This is what we do in our traditional naive Bayes class in LingPipe (it doesn’t change first-best results, but helps enormously for EM and presumably for n-best results ordering).

Binned Calibration

What Zadrozny and Elkan did for naive Bayes is to train a classifer and and then calibrate its predictions. In particular, they sort the inputs by predicted probability for the most likely category. They then sort these into ten bins of equal size. For each bin, they calculate the proportion of true answers in the bin. For new items, they are assigned probabilities based on which bin their estimate for the probability of the best category falls into.

They talk about using held-out data, but don’t. I’m not sure why — naive Bayes is pretty easy to overfit. Ideally they’d have just done cross-validation on the whole data set to get predictions, or use a representative held-out data set of appropriate size.

Isotonic Regression

The Microsoft paper referred to Zadrozny and Elkan’s approach as isotonic regression, but on my reading, Zadrozny and Elkan didn’t enforce isotonicity (i.e. upward monotonicity) on the bin probabilities. They probably didn’t need to with lots of items per bin and a reasonably ordered (if not calibrated) set of naive Bayes results. Structural constraints like isotonicity (or simiarly intentioned smoothness priors for something like loess) are especially helpful if some of the bins don’t contain much data or you’re fitting a more complex model.

Binning versus Piecewise Linear Normalization

Binning may be useful for absolute probability prediction, but it’s not very useful for sorting results because it’s not a properly isotonic function (that is, we can have p(x_1) > p(x_2) and p'(x_1) = p'(x_2)).

Instead of binning, a more reasonable solution seems to be a piecewise linear approximation. Just fit linear functions through each bin that remain isotonic. That’s what you typically see in non-parametric statistics approaches (e.g. for normalizing intensities in micro-array data or empirically adjusting for false discovery rates).

Length, Anyone?

In addition to predicted probability on new items, I’d think document length would be a huge factor here. In particular, a two-way binning by length and by prediction probability makes sense to me if there’s enough data. As documents get longer, the overconfidence due to failed independence assumptions gets worse.

Category Effects?

I’d also think you could adjust for category effects. There’s often one or more small foreground categories and a big background category. I’d expect them to calibrate differently.

Comaprative Effects

It’s common in estimating confidence for speech recognizer output to look at not only the first-best result, but the difference between the first-best and second-best result’s probability. That may also help here.

McCallum, Bellare and Pereira (2005) A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance

April 27, 2010

Following up on my previous post on alignment with CRFs, I’ve found a related paper that I’d like to review here. I’ve spent the last 20 years following Fernando around the intellectual landscape, from logic programming and grammatical formalisms to statistics and biology. (So it’s perhaps not surprising that we’re both speaking next week at the Edinburgh HCRC and Cog Sci Reunion.)

I have to admit I’d read this paper in the dim and distant past (OK, it’s from 2005), so something might’ve stuck in my head. But once you grok CRFs, building new models is easy, because as I said last time, CRFs are really good at normalizing sequential labeling or classification problems.


McCallum et al. applied their model to matching restaurant names and addresses and also to matching bibliographic citations. In contrast, I was interested in matching short reads against reference genomes (for DNA sequencing) or gene models (for mRNA sequencing).

What’s Being Modeled?

McCallum et al.’s goal was to model the probability that two strings match. This reduces the problem to a binary classification problem.

In contrast, what I was modeling in the last post is the probability of a query/read sequence being generated from a reference sequence. The reads are more like the tags generated in a tagging CRF. The alignment is a local alignment of the read to the reference.

We both approached the problem by first defining a notion of an alignment sequence and its relation to the strings. They modeled the probability of a global alignment given both strings, whereas I modeled the probability of a read locally aligned to a reference sequence.

The difference is in normalization terms, both of which are easy to calculate via dynamic programming.

As they point out in their discussion, what they did is to pair HMMs (i.e. as used by [Ristad and Yianilos 1998]), as CRFs are to regular HMMs. I turned a similar crank to estimate a related probability.

Other Innovations

What I really like about McCallum et al.’s approach is that it allows chunked matching operations. Although they don’t consider affine gap models (where longer gaps aren’t that much more unlikely than shorter gaps), they do consider token-sensitive operations like deleting a token that appears in both strings or in a dictionary, and consider deleting to the end of the current token (useful for matching acronyms). The possibilities are endless.

With their CRF FST implementation in Mallet, I’m guessing it was as easy to code as the paper makes it seem. The FST coding is straightforward, with the only problem being the usual one of explosion of dependencies. Neatly (for both McCallum et al. and me), the sum-product (forward/back) and max-product (Viterbi) dynamic programming algorithms fall out of the encoding.

They don’t allow non-local transpositions (as in, say a soft TF/IDF-like approach to matching), so everything remains reasonably tractable.

They note that the estimation problem is not concave, because they use latent variables (alignments and states aren’t directly observed here). I’m pretty sure I’ll have the same problem.

They treat matches and mismatches as a mixture model, with two distinct subgraphs of the FSTs. This allows them to have separate features as in a max entropy or discrete choice (DCA) model and unlike in a logistic regression-type approach where there are no outcome-specific features.

Section 5.1 provides a hint at the kinds of features they use to set the values for the various edit operations, like is a numerical character substituted for another numerical character. Curiously, there are no features for the actual characters being matched, just classes. Maybe they didn’t have enough data.

What I added that they didn’t have was a different normalization and features for the alignment position of local alignments (useful for modeling positional or polymer biases in reads).

Direct Regression

(Tsuruoka, McNaught, Tsujii and Ananiadou 2007) introduced an approach to string similarity that was a straight-up logistic regression based on features extracted from the two strings. These features were things like n-grams.

This also sounds like a good idea. I’m guessing the error patterns would be dissimilar enough that the two classifies could be fruitfully combined.

The Tsuruoka model would be easy to implement in LingPipe. I had to do a custom implementation for my own model decoder and haven’t built the estimator yet.

Further Applications and Generalizations

I’d love to see something like this applied to the BioCreative III gene linkage task (and thank you, organizers for switching to camel case from seemingly random case). You’d have to generalize somewhat, because you’re matching against a whole set of known aliases for a gene. The discriminative training would be a huge help to sort out common phrases, the importance of numbers, the correspondence of Greek and Latin characters, etc.

McCallum et al. only had a chance to scratch the surface of the feature space that might be useful for this and related problems. I’m surprised now that I haven’t seen more work building on this.

It’d also be easy to take any of the pair HMM applications to things like phonology or morphology that are out there.


February 18, 2010

AKA, I’m going to use discrete choice analysis (DCA) to provide a probabilistic model for the learning as search optimization (LaSO) paradigm and I’ll fit it with stochastic gradient descent (SGD). Too bad LaSO isn’t one, or I’d have hit the TLA trifecta.

It occurred to me to do this after considering DCA for coref and then reading Daumé and Marcu’s paper:

(Hint to authors: anything you call “large scale” today will look tiny tomorrow.)

The paper does a really nice job of describing their LaSO technique and demonstrating its utility for within-document coreference resolution. There are other more general papers on LaSO, but this one’s nice in that it works through a clean applied example involving within-document coreference that we care about here at Alias-i.

What’s really cool is that they jointly solve the entity extraction and coref problem, and get to use arbitrary history-based features.


The LaSO paradigm is simply one of scoring various search states. You extract feature vectors from a search state and input. At a given search state, you can generate a finite number of next possible search states consistent with the input. These are then scored using the feature vectors.

The trick is in the training, where they consider a single deviation from the correct search path at every step. They train these decisions discriminatively.

It’s more general than the kind of history-based parsing:

that I used in my very first paper on stat NLP:

In the usual history-based parsing setup, there was a fixed set of choices corresponding to parser operations. For instance, Manning and I used left-corner derivations and Black et al. used top-down derivations. In the LaSO setup, there can be an arbitrary number of choices. For instance, when considering to which antecedent a pronoun refers, the number of antecedents isn’t bounded in advance.


Daumé and Marcu consider two non-probabilistic scoring approaches, a perceptron-like approach and an approximate wide-margin approach.

Given the setup, we can model the finite set of choices from any given search state using DCA. This in turn provides a proper probabilistic measure over the entire set of possible outputs.

I’m not sure it’s going to work well in practice because it’ll locally normalize all the decisions. If you look at Andrew McCallum’s work on CRF-like approaches to structured prediction problems like coref, or if you look at Daumé, Marcu and Langford’s extension of LaSO to SEARN, one might think about normalizing on a more global level. The problem, as usual, is the normalization constant.

On the other hand, considering Ratinov and Roth’s experiments with entity detection, it seems like it’s worth pursuing the simple greedy (or more generally beam-based) approach.

Fitting with SGD

It’s easy to generate maximum a posteriori (MAP) coefficient estimates for the DCA model using stochastic gradient descent (SGD). In fact, it’s just like fitting other DCA models. Not coincidentally, I finished unit-testing DCA fitting via SGD yesterday.

Look for it in the next LingPipe release. I’ll either figure out how to abstract the entire search, like LaSO, or just provide the DCA fitter and a coref-based example.

ThinkPad W510 Review: First Impressions

February 17, 2010

[Update: 15 March 2011: I finally caved in and got a 13″ MacBook Air. I just got it yesterday and am already up and running. I love the low-throw keyboard and massive multitouch touchpad. And the (lack of) weight.

Today, I got rid of the W510. That’s all it took — one day. I haven’t used a Mac since 1992, but it is, in fact, pretty easy [though I was perplexed by the extension cord, which like many of Apple’s products, is completely non-obvious.] Luckily, I do know Unix and the Mac OS is easy to learn with their online tutorials.

The Air’s hardly a desktop replacement, though I’d probably buy one even if I wanted to run Windows on it. Alternatively, I might consider a MacBook Pro, which a year after Lenovo came out with a machine with roughly similar specs to the W510.]

[Update: 25 September 2010: After six months or so, the machine’s developed a serious looseness on the lower right side of the keyboard/case that makes typing on it even worse. It rattles, and I can feel the lower right side of the case moving along with the keys. Typing, I notice it on the whole right side of the keyboard, but it’s most pronounced with keys in the lower right. Wah! My favorite part of the IBM Thinkpads was the keyboard.

Does anyone have other suggestions out there for a high-performance notebook with a good keyboard? It’s not something most reviewers even mention. At Best Buy, I hated pretty much all the notebook keyboards I tried, but they hardly had a full range, and mostly consumer models.]

I’ve had my Lenovo ThinkPad W510 for a week [update: I’ve now had it for nearly a month and my opinion hasn’t changed]. Long enough to transfer all of my work, install everything I need, and use it for a couple days for work.

My previous machine was a Lenovo ThinkPad Z61P with a Core Duo, 4GB of RAM and a 1920×1200 15.5″ monitor. I’ve had ThinkPads going back to the late 1990s.

The Specs

Here’s the most info I could find from Lenovo:

CPU and Memory

I got the cheapest CPU, an Intel Quad Core Processor i7 720QM. Intel gave it a slow 1.6GHz clock for when all four cores are engaged, but then uses “TurboBoost” to max out at 2.8GHz under less load. The step up to the 820QM at 3GHz max didn’t seem worthwhile, much less the step to the extreme version.

The W510, unlike the T series notebooks, has 4 DIMM slots. I got 8GB of DDR3 memory in 4 DIMMs, but only at 1066MHz. I figure I’ll upgrade to 16GB of DDR3 at 1333MHz when it gets a bit cheaper or when I have a job for which I need 16GB.

128MBGB Solid State Drive

I have to say that the SSD isn’t the life-changer I’d been led to believe it would be. It may just be all the over-hype for desktop use.

What it does do is make startup from sleep super-duper fast, as in almost instant. And I have to keep checking that my gzipped tar files really have unpacked. Lots of random file access is way faster.

It’s also startlingly quiet. So the notebook doesn’t thrum any more when in use. Very cool.

I think it’s also more energy efficient than traditional disks, but I’ve only been plugged in w/o a power meter so far.

The Monitor

Why don’t other manufacturers release higher definition monitors in smaller sizes? Apple’s 17″, not to mention HP and Sony’s 18″ plus monsters are just too big. But I want high res in a smaller size, which pretty much leaves me with Lenovo.

I opted for the 15.6″ FHD Display (95% Gamut) with LED Backlight. I didn’t get touch screen, and the option for FHD without touch screen seems to have gone away. They’ve already had supply issues with these machines.

It took some digging to decode FHD into “full HD”, meaning 1920 x 1080 native resolution. I think this is a huge mistake; I want my 1920 x 1200 back. I’m a programmer and I like my vertical buffer space in both the shell and emacs and web docs/PDFs.

The brightness and color balance are nothing short of amazing. From across the room, the screen with an emacs window up reminds me of my white-balanced photo light box. My old ThinkPad screens look green compared to the new one. After working for an hour or two in emacs, I find the incandescents in our house look out of balance. Photos look way better on this screen than on our previous ThinkPad screens and better than on our Samsung 24″ LCD monitors at work. Did I say “amazing” yet?

It’s so bright I can’t leave it on full brightness indoors.

Speaking of monitors, it also has a DisplayPort output, which I haven’t tried yet. That’ll mean yet another dongle — DisplayPort to HDMI. No idea if the W510 sends out sound on that or not.

The Keyboard

What Lenovo have done to the ThinkPad keyboard is nothing short of criminal. I type really quickly, and I loved the old ThinkPad keyboards. I’m using a ThinkPad USB keyboard on my workstation at work as I type this.

Over the last three ThinkPads (over a period of 6 or so years), the keyboards have just gotten worse. Basically, the keys are longer throw, less definitively clicky, and much much stiffer.

The new keyboard size for escape and delete keys was a good idea; at least they left all the keys I use in the same place as before.

Does anyone know if the T410 and T510 use the same keyboards?

Lenovo — give me my old keyboard back. I’m willing to pay in both US$ and weight!

I tried all the other brands out there, and there’s nothing out there like the old ThinkPad keyboards. And Sony’s and Apple’s are just dreadful (though I like the compact iMac keyboards more than any notebook keyboards out now).

[Update: After a month of use, the keyboard strikes me as even worse than it did on first impressions. I’m consistently dropping keystrokes because the force required is so much greater than my old ThinkPad. Argh!]

The Battery and Power Supply

WTF? There’s a 9-cell battery which sticks out of the back of the machine a good ways and is really heavy. I almost never get stuck relying on battery power, but then I don’t take many long flights any more. What’s annoying is that there’s no 6-cell option, so at least for now, I’m stuck with this unsightly thing.

The 135 watt power supply is enormous and also heavy. It’s more amps than the previous power supply. I have no idea if I could get away with a less powerful one.

Thankfully they’ve kept the same power-supply connector (what they call a “tip”). IBM used to swap them every model so I had to keep buying new secondary power supplies.

I don’t see any lighter model that’s compatible with my machine, which is a real drag.

Useless Built Ins

No, I don’t need a 2MP video camera (though who knows, I may Skype with video now), and I don’t need a fingerprint reader, and I don’t need memory card readers.

I also uninstalled the MS Office trials, but the machine was otherwise free of preinstalled crap.

Windows 7 Pro 64 bit

This machine came with Windows 7 Pro 64-bit version (it was $70 cheaper than “Ultimate”, whatever that is).

I was burned on my dual quad core workstation by not getting Windows Vista Pro (which is required to run two physical cores — I hate crippleware, of which Windows Home is a prime example), this machine came with Windows 7 Pro. Of course, 64 bit.

All I can say is “seventh verse, same as the first”. I’m a Windows user and can’t tell the difference here. But then some of the speed may be the new Windows instead of the new machine.

I put everything back to Windows Classic mode. I just can’t get used to the fat translucent title bars.

I turn off ClearType because I prefer my fonts to be crisp.

I put everything back to normal size (100% zoom), because at 125% zoom, the fonts all look yucky. It’s tiny, but then I can see a lot. As I get older, it’s tougher on my eyes. I wish Windows resized everything more gracefully so I can keep emacs and shell at native res, but crank up icons, windows, menus, etc., to something more manageable without them looking ugly and unbalanced.

But what’s with not letting me resize windows near the bottom of the screen? It just exacerbates the loss of 120 vertical pixels with the “FHD” format. If anyone knows how to turn that off, let me know. [Update: Windows 7 snaps the window to full height if you let it go in this region, which is really useful.]

The Optical Drive

Alas, “Multi Recorder Optical Drive (12.7mm)” did not mean Blu-Ray. You get Blu-Ray at this price on other mainstream computers other than the Apple. C’mon, they’re not that expensive, and I’m paying the premium to Netflix because I love 1080p.

Graphics Card

This thing comes with an “NVIDIA Quadro FX 880M Graphics with 1GB DDR3 memory”. I don’t really use it to play 3D games. It seems to have more memory assigned to it — I’ll have to look into that.

I’d probably have preferred a lower-power built-in solution.

The Design

Are you kidding me? It was like there was a contest for plain and Lenovo put their entire brain trust behind it.

I kind of like it. It sort of reminds me of something Cayce Pollard, the protagonist of William Gibson’s Pattern Recognition might like.

Performance Evaluation

I actually care more about the subjective performance than the kinds of charts they print in most online computer reviews. Having said that, here’s the “Windows Experience Index” for the machine, with components assessed on a 1.0 to 7.9 scale (I’m not making that range up):

  • Processor: 6.9
  • Memory: 7.3
  • Graphics: 6.4
  • Gaming Graphics: 6.4
  • Disk: 6.9

Two-Day Shipping

Sure, that was two day shipping. Not including the trip from Shanghai via Korea and then the days it took clearing customs in Kentucky. I’m not sure why customs takes so long. Overall, it was torture watching the shipping tracker.

Lenovo’s Call Center

I’d just ordered a T400 a week before the new models came out, and they let me cancel that order, no questions asked and no hassle at all. Their call center is awesome. From the accents, I’d guess it’s in India. It’s one of the best call centers I’ve ever called. I’ve had two interactions with them over the last few years; one to change a complicated order and one to cancel the T400. Both were extremely fast, effective, and clear.

Kohlschütter, Fankhauser, Nejdl (2010) Boilerplate Detection Using Shallow Text Features

January 11, 2010

A lead from my previous post on scraping text out of boilerplate was the following paper, scheduled to be presented next month (specifically, Saturday 6 February 2010) right here in Brooklyn at ACM’s 3rd annual conference on web search and data mining, WSDM 2010:


A java API-based implementation of the system described in the paper is available under a generous Apache 2.0 license from:

It uses our favorite normalizer, NekoHTML as a front end.

I’ll take it out for a spin in a separate review. The Java API nature and free licensing make it a very tempting option for the likes of us.

The Approach(es)

Kohlschütter et al. evaluate a range of features working directly over cleaned HTML for scraping content out of boilerplate like navigation and advertisements.


They do not consider HTML rendering solutions, which simplifies but also limits the applicability of the final system. Of course, there’s no reason you couldn’t plug in something like Crowbar (thanks for that link, too) and use its HTML instead of the direct download.

Newswire vs. Cleaneval

As he pointed out in a comment in the last post, Kohlschütter (and crew) concluded that the Cleaneval bakeoff used highly non-representative web text, and thus wasn’t a good baseline. As an alternative, they collected a corpus of 250K English news articles downloaded from Google News links from different countries (USA, Canada, UK, South Africa, India, Australia) in four topics (world. technology, sports, entertainment) from over 7500 different sources.

The paper advertises the data’s availability from the following site:

But it’s not there yet (11 January 2010).

Four-Way and Two-Way Tasks

The authors consider the two-way task of boilerplate versus content and the four-way classification task of boilerplate, full-text/comments, headline, and supplemental.

Decision Tree Learner

They evaluated decision tree learners, though claim linear SVMs (fit with sequential minimial optimization, aka SMO) had similar performance.

Micro-F Evaluation

They evaluated on a word-by-word basis. This is a kind of micro evaluation. A macro evaluation might rank by block.


The authors didn’t consider structure other than H, P, DIV, and A.
The features that seemed to work involved number of words and density of text relative to links.

The only global features they use of a collection are the global word frequencies (thus getting at something like stop list type features).

They consider all sorts of local features, such as average word length, total number of words, average sentence length (using a simple heuristic sentence detector), ratio of periods to other characters, number of capitalized words and ratio of capitalized to non-capitalized words, etc. etc.

One of their most useful features is link density, which is number of words in anchor (A) spans compared to total number of words.

They also consider position in the document, though this is tricky with only the raw HTML rather than the rendered output, as relative locations may change visually relative to the underlying HTML.

Quotients Features and Context

The authors extract these basic features from blocks, but then use the current (C), previous (P), and next (N) values in classification. Specifically, they find it useful to look at the quotient of the number of words in the previous block to the current block (their notation “P/C”).

The authors say “the text flow capturing variants of thse features (number of words quotient, text density quotient, …), which relate the value of the current block to the previous one, provide the highest information gain, indicating that intradocument context plays an important role”.

It’d be nice if these papers came with a table of features with simple definitions to make it easy for someone to browse them.

Sequence Model Decoding?

The classification in this paper is apparently done independently in a block-by-block fashion. I’d think using a sequence model like structured SVMs or CRFs would provide higher accuracy. Given the small number of categories (i.e. boilerplate/not), it’d be very speedy, too, even with second-order models.

(Greedy) Information Gain

The authors rank features by information gain. The big winners are number of words quotient, text density quotient, number of words, and the link and text densities. Clearly these are related measures, but it’s prohibitive to consider all the possible subsets of features.


They conclude word count is most important. Coupled with link density, you’re most of the way to the best results achieved using all local features plus global frequency.

Or, Ask Ryan

Ryan Nitz, enterprise programmer par excellence, just happened to be over for dinner last night. I asked him if he’d dealt with this problem and he said he just used the ratio of punctuation to letters. That seems similar to but simpler and more portable than the stopword approach Aria Haghighi suggested in a comment in the last post.