New Book: Text Processing in Java


I’m pleased to announce the publication of Text Processing in Java !
Text Processing in Java

This book teaches you how to master the subtle art of multilingual text processing and prevent text data corruption.  Data corruption is the all-too-common problem of words that are garbled into strings of question marks, black diamonds, or random glyphs.  In Japanese this is called mojibake (“character change”), written 文字化け, but on your browser it might look like this: �����  or this: 文字化ã‘. When this happens, pinpointing the source of error can be surprisingly difficult and time-consuming. The information and example programs in this book make it easy.

This book also provides an introduction to natural language processing using Lucene and Solr. It covers the tools and techniques necessary for managing large collections of text data, whether they come from news feeds, databases, or legacy documents. Each chapter contains executable programs that can also be used for text data forensics.

Topics covered:

  • Unicode code points
  • Character encodings from ASCII and Big5 to UTF-8 and UTF-32LE
  • Character normalization using International Components for Unicode (ICU)
  • Java I/O, including working directly with zip, gzip, and tar files
  • Regular expressions in Java
  • Transporting text data via HTTP
  • Parsing and generating XML, HTML, and JSON
  • Using Lucene 4 for natural language search and text classification
  • Search, spelling correction, and clustering with Solr 4

Other books on text processing presuppose much of the material covered in this book. They gloss over the details of transforming text from one format to another and assume perfect input data. The messy reality of raw text will have you reaching for this book again and again.

Buy Text Processing in Java on Amazon

8 Responses to “New Book: Text Processing in Java”

  1. Dave Lewis Says:

    Congratulations!   This will get used widely.  Will they be doing editions on other languages?

     Sent via a mobile device

  2. Dave Lewis Says:

    Whoops, that’s “in other languages”.

     Sent via a mobile device

  3. Ivan Says:

    Is the ebook available in a non-Kindle (or non DRMed) version?

    • mitzimorris Says:

      The Kindle version is not DRMed. That’s the only ebook version available.

      • Kal Leblanc Fultz Says:

        Hello Mitz,

        Thank you for writing T.P. in J.
        I invented a new logic for sorting and translating individual phrases into a 5 -tuple matrix. Image all sentence falling neatly under a five column sortable table..which part of Java would help best with this?

        Thanks for any advice,


  4. Amanda Stent Says:

    Congratulations, Mitzi!

  5. Lucene 4 Essentials for Text Search and Indexing | LingPipe Blog Says:

    […] Natural Language Processing and Text Analytics « New Book: Text Processing in Java […]

Leave a Reply to Lucene 4 Essentials for Text Search and Indexing | LingPipe Blog Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: