A PDF’n Mess

by

Don’t believe the marketing. I’m working on a little project to mine research articles in “portable document format,” known on the street as PDF. Maybe they know something about the word “portable” that I don’t.

I didn’t quite realize just how much the “document” in “PDF” was a visual markup. I’ve been using PDFBox, a Java API for extracting content from PDFs. It makes a noble effort and is recommend by projects such as Lucene and Nutch.

Anyway, here’s the problem. Let’s take a random PDF-formatted paper from last year. Now open the doc in Adobe Acrobat, the reference browser, and try to find the phrase “fulfills” (Hint: it follows the phrase “which fully”). No luck? Not surprising. The “fi” in “fulfills” is what’s known as a ligature.

Those who are very picky about type know that just jamming an “f” and “i” character together is suboptimal because of the drop on the “f” interfering with the dot on the “i” if they are kerned properly. So in hot-metal type days, and through this day on well-designed typefaces for the computer, any time an “f” is followed by an “i”, the pair are replaced with a special “fi” ligature “character”.

There are even unicode code points for ligatures to further confuse the issue. At least for Unicode, we have a clear standard and IBM’s liberally licensed International Components for Unicode, which will normalize those ligatures right back to two character sequences.

Don’t even get me started with anything outside of ASCII; simple Latin1 vowels with umlauts come out as two character sequences in acrobat and as “currency1” in PDFBox. Japanese is lost in deep character conversion lookup exceptions, etc. etc.

And yes I’ve already pleaded for help on PDFBox’s forums, but I can hardly blame the poor guy running this thing as a free open source project if Adobe can’t get it right in their latest Acrobat.

If anyone’s got a better idea for converting PDFs to usable text, drop me (carp) a line (at alias-i.com).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s