Book Review: The Numerati, by Stephen Baker

Cover art for Stephen Baker's The Numerati

I was very excited when this book came out, because I love (well, love-hate) pop science books. Each chapter is a standalone case study similar to a New Yorker profile. Each chapter covers a different application of data mining, ranging from supermarket organization to political marketing to finding terrorists to diagnosing disease. Baker even has a blog on the same topic as the book.

I (and many readers of this blog) belong to the tribe Baker’s calling “The Numerati”. There’s a well known problem with reading pop science in one’s own field. The mistakes and inconsistencies are glaring, and somehow seem very annoying, like we’ve been let down by the press.

The use of “-ati” in the title evokes the mysticism of the Illuminati (which, according to Wikipedia means “enlightened” in Latin, traditionally referred to a secret society of the 1700s, and now refers to a “purported conspirational organization which acts as a shadowy power behind the throne”.) And yes, Baker mentions total information awareness in the section on finding terrorists, so I’m not thinking the connection’s too much of a stretch.

It’s really the mysticism that bugged me. I don’t mind the flowery language. Baker introduces Nicolos Nicolov (who many of you may know) thusly:

“I’ve got bigger fish to fry [than sarcasm detection],” says Nicolas Nicolov, Umbria’s chief scientist. A Romanian-born computer scientist, Nicolov got his doctorate in Edinburgh before moving to America, first to IBM’s Watson Lab and then to Umbria [now part of J.D. Power]. He has an angular face and dark deep-set eyes, and he sports thick black bangs — a bit like Jim Carrey in his early movies. He works in a small, dark office down the hall from [Ted] Kremer’s [the chief technical officer’s] sunny, expansive digs. It feels like I’ve stepped into a cave.

Cool. Baker’s writing about one of us, right down to the modest digs [OK, we have a sunny loft in Brooklyn]. How will he describe the technology? Always start with an example (my mom was a writing teacher). Nicolas offers that when analyzing consumer’s feeling about a product, “big” is a bad thing for a notebook computer, but a good thing for a notebook computer’s hard drive. Ouch. What about the tech?

Imagine a vast multidimensional space, Nicolov instructs me. Remember that each document Umbria studies has dozens of markers — the strange spellings, fonts, word choices, colors, and grammar that set it apart from others. In this enormous space I’m supposed to imagine, each marker occupies its own patch of real estate.

We can tell he’s talking about features/predictors, where each feature’s “patch of real estate” is actually just a dimension. You know, like the old X, Y, and Z of 3D real space. The X dimension doesn’t occupy its own patch of real estate. After some descriptions of dimensions possibly involving punctuation (you can tell Nicolas must’ve really been working to get this much across), we get:

And each document — blog or splog — is given an assignment: it must produce a line — or vector — that intersects with each and every one of its own markers in the entire universe. It’s a little like those grade-school exercises where a child follows a series of numbers or letters with her pencil and ends up with a picture of a puppy or Christmas tree.

Puppies? Christmas trees? What’s going on?

Nicolov tries to draw a diagram on the whiteboard. But he gives up in short order. It’s impossible, because in a world of two dimensions, or even three, each of the vectors would have to squiggle madly and perform ridiculous U-turns to meet up with each of its markers.

OK, clearly the author doesn’t understand vectors. We’re talking high school physics here, not algebraic geometry. After an anecdote about detecting spies by asking them about spitballs, we’re back to spam detection.

What next? The splog neighborhood must be cordoned off, condemned. Imagine placing a big shield between the good [ham] and bad [spam] vectors. Speaking geometrically, the shield is a plane. The spam fighters maneuver it with a mouse, up and down, this way and that. The plane defines the border between the two worlds, and as the scientists position it, the machine churns through thousands of rules and statistics that divide legitimate blogs from spam.

If you liked this, you’ll love the rest of the book. It’s all like this.

Your reward for reading to the end of the review is a recommendation for a book I couldn’t put down, Cory Doctorow’s Little Brother, from which you’ll learn a whole lot more technical detail about data mining than from Baker’s book (including spam filtering, gait detection, histograms, etc.), all in the context of an action-packed work of young-adult fiction. Doctorow has his own blog. Thanks to Forbidden Planet for recommending this one!

One Response to “Book Review: The Numerati, by Stephen Baker”

  1. Ken Williams Says:

    Ouch. Unfortunately I’m reading a math book with similar (though not quite so grave) shortcomings ( – I’ve decided that some editor told the author that adage about “every equation halves the readership”. If so, the author was a fool for believing it about a MATH book. Nevertheless, it’s highly frustrating when the only explanation we are offered of homology groups is that they’re like little army parachutes getting stuffed into backpacks.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s