I’ve just implemented information gain for feature selection, but I’m unhappy with how poorly its selection criterion is suited to linear classifiers like logistic regression, which take real-valued inputs, or even naive Bayes, which takes count-valued inputs. I’ve been thinking about the best I can do is to quantize the input features to three categories, positive, negative, and zero.

### Information Gain (aka Mutual Information)

Information gain (closely related to mutual information) is defined intuitively as the reduction in your uncertainty about the category of an item being classified (X) provided by knowing the value of a feature (Y). Uncertainty is measured in entropy for distributions, sample entropy or estimated model entropy for data sets. Uncertainty given the value of a feature is conditional entropy. The gain is the difference between the entropy and conditional entropy.

In symbols, the entropy H(X) of a discrete random variable X with distribution p() is

H(X) = Σ_{x in X} p(x) log p(x).

Conditional entropy is just the entropy of random variable X given the value of Y,

H(X|Y) = Σ_{ y in Y} p(y) H(X|Y=y),

where H(X|Y=y) is the entropy of the variable X restricted to cases where Y=y,

H(X|Y=y) = Σ_{x in X} p(x|y) log p(x|y).

The information gain of a feature Y with respect to X is just its mutual information with respect to X:

IG_{X}(Y) = I(X; Y) = H(X) – H(X|Y).

Because H(X) is fixed, a feature Y has a higher information gain than feature Z if H(X|Y) < H(X|Z). Information gain is just mutual information as a function of Y given a fixed X. This confusing terminology is like referring to p(y|θ) as a likelihood function when considered a function of θ with y fixed and as a conditional probability function when considered as a function of y with θ fixed.

### Count and Continuous Variables

In the most common case, Y is a boolean (Bernoulli) random variable, taking on only two possible values, 0 and 1. More than two outcomes is typically coded with boolean indicator variables, exactly one of which takes on value 1 for any input. In these cases, information gain makes sense.

But we’re usually dealing with count variables (e.g. naive Bayes, language model classifiers) or continuous variables (e.g. logistic regression, K-nearest neighbors, decision trees). The definition of conditional entropy already coves the count case, because it’s countable. But it assumes that somehow naive Bayes could somehow treat the counts non-monotonically. In fact, naive Bayes probability estimates are always monotonic in the features (because they’re a simple multiplying counts by word log probability estimates).

Continuous variables are handled by replacing the sum with an integral. But with discriminative classifiers like logistic regression, we don’t even have a generative model of the input data, so what do we integrate? Logistic regression, at least without expanding the feature basis with interactions, quadratic terms, etc., is also monotonic, though it can go up and down because of the possibility of negative values.

What I’ve been toying with doing is just quantizing continuous variables and count variables into three categories: negative (Y < 0), zero (Y = 0), and positive (Y > 0). If no one has any objections or better ideas, that’s what’s going into `com.aliasi.features`

for the next release. It reduces to the usual definition for binary features (coded either way, as variables that take on only values 0 or 1, or take on only values -1 or 1).

This is pretty much what George Forman did in his 2003 *JMLR* paper on feature selection, An Extensive Empirical Study of Feature Selection Metrics for Text Classification (in their special issue on feature selection). *Spoiler Warning:* Forman found information gain to be among the top feature selection methods for text classifiers, but only considered count data quantized into two categories, zero and positive.

May 14, 2009 at 11:54 pm |

Hi,

I briefly looked into the same problem a few years ago and tried out a number of different techniques. The most appealing to me were the entropy-based dynamic methods which essentially selected a number of bins and their thresholds based on a kind of minimum description length principle.

I can’t remember the exact method I used but a quick search using terms like “entropy” and “discretization” found the following papers which look vaguely familiar: Evaluating the performance of cost-based discretization versus entropy- and error-based discretization and Error-based and entropy-based discretization of continuous features. The categories of discretization in the conclusions of the latter paper are useful.

I also stumbled across this survey which may have some more pointers: Discretization techniques: a recent survey.

Hopefully some of that may help.

May 16, 2009 at 12:47 pm |

Yes, MDL works great as an stopping criterion.

Sometimes it is easy to overfit if we try too hard when discretizing the features before applying IG (for example, allowing too many bins). Another option that can work surprisingly well is to consider a single best-split-point approach, that is, to restrict ourselves to binary splits. One looks for the min-entropy split value for each feature by sorting the instances by that feature and measuring the entropy at the interesting points. Interesting points are those where there is a change of class, as it is obvious those are the only where the maximum IG can be found.

June 2, 2009 at 9:13 am |

Why not just using L_1 regularization as a means for sparsity in Logistic Regression? :-)

June 2, 2009 at 12:48 pm |

LingPipe implements L_1 regularization, even going so far as to allow varying means and variances per feature.

The reasons to also use feature selection are (1) efficiency [stochastic gradient takes time proportional to number of non-zero features in the input], (2) accuracy [we’ve found doing info gain and then fitting has better 0/1 accuracy than relying on L_1 to do both {yes, I know that opens another can of worms on not optimizing the evaluation metric directly}], and (3) not all of our feature-based classifiers are regression based [we want to use this for perceptrons, for K-nearest neighbors, etc.].

May 17, 2010 at 9:50 am |

[…] and only classify using sentiment rich words. This is usually done using the concept of information gain, aka mutual information, to improve feature selection, which I'll also explore in a future […]