TF/IDF, Scaling and Symmetry

by

Search engines typically use TF/IDF scaling only on the query, because it’s too hard to compute dynamically on the documents, leading to an asymmetry in scoring.

I’m building some discriminitave classifiers for LingPipe. I just finished with voted kernel perceptrons, which are way cool, and though lighter than SVMs, still awfully heavy.

A simpler discriminative classifier is based on inverse-document frequency weighting, as is commonly used in search engines. In this situation, the weight applied to a search term is inversely related to the number of documents in which it appears. For instance, a term appearing in docFrequency documents, where there is a total number of numDocs documents, is typically weighted by a damped inverse ratio like:

  idf(docFrequency,numDocs) = log(numDocs/docFrequency)

For base 2 logs, and numDocs=1024, we have

    Doc Freq   IDF
    --------   ---
       1        10
       2         9
       4         8
     ...
     512         1
    1024         0

Basically, this approach upweights terms that don’t show up in many documents. If we are using TF/IDF ranking for classification, IDF weighting provides a parametric form of discriminitive feature weighting.

Term frequencies are typically dampened as well, often with a square root penalty:

    tf(term,doc) = sqrt(frequency(term,doc))

This has the effect of making score rise linearly with the number of overlapping counts, rather than quadratically, as it’d do in the un-dampened raw frequency counts.

Scoring is typically done by cosine. That is, a length-normalized dot product of query and document vectors.

    score(docVec,queryVec)
      = cosine(docVec,queryVec)
      = docVec * queryVec / (length(docVec) * length(queryVec))

What I didn’t notice until I dug into the internals of the Apache Lucene search engine and really paid attention to the scoring section in Witten et al.’s Managing Gigabytes is that IDF normalization is only applied to the query vector. That is:

    docVec[term] = tf(term,doc)
    queryVec[term] = tf(term,query) * idf(term,numDocs)

The reason for this is that if doc frequencies and the number of docs continually changes, so would the length of a document vector if it were normalized by IDF. In this case, every document in the result set would have to be renormalized for every query, requiring multiplications and additions for each term in the document. Even if we could afford the arithmetic overhead, we find that reverse indexes don’t even support looking up all of the terms in a document easily, because they’re maps from terms to documents, not the other way around.

Instead, in dynamic search engines, IDF is not applied to doc vectors, allowing their lengths to be computed as soon as they are added to the index.

The upshot of only IDF-weighting queries is a surprising asymmetry. If we index a document by its text and then create a query out of the same text, the score will be less than 1 (recall that if all entries are positive counts, cosines range between 0 and 1 inclusive). To score maximally against a document, what is required is a query with weights that will invert the IDF weighting applied to queries so that the query vector actually matches the document vector. That is, the best scoring vector against a document is the vector created from the document term frequencies with inverse IDF weighting.

For the TF/IDF classifier I’m building into LingPipe, the IDFs are computed for document vectors at compile time, thus ensuring that the only query that will score 1.0 against a document is the query constructed out of the document itself.

Of course if you’re a major search engine, and thus matching by boolean match and sorting by some social network metric of influence with some fiddling, none of this really matters. I actually miss Excite, which did TF/IDF search. Perhaps that’s not surprising given that Doug Cutting, the founder and original developer of Lucene, was their lead search engine developer.