I’ve just implemented information gain for feature selection, but I’m unhappy with how poorly its selection criterion is suited to linear classifiers like logistic regression, which take real-valued inputs, or even naive Bayes, which takes count-valued inputs. I’ve been thinking about the best I can do is to quantize the input features to three categories, positive, negative, and zero.
Information Gain (aka Mutual Information)
Information gain (closely related to mutual information) is defined intuitively as the reduction in your uncertainty about the category of an item being classified (X) provided by knowing the value of a feature (Y). Uncertainty is measured in entropy for distributions, sample entropy or estimated model entropy for data sets. Uncertainty given the value of a feature is conditional entropy. The gain is the difference between the entropy and conditional entropy.
In symbols, the entropy H(X) of a discrete random variable X with distribution p() is
H(X) = Σx in X p(x) log p(x).
Conditional entropy is just the entropy of random variable X given the value of Y,
H(X|Y) = Σ y in Y p(y) H(X|Y=y),
where H(X|Y=y) is the entropy of the variable X restricted to cases where Y=y,
H(X|Y=y) = Σx in X p(x|y) log p(x|y).
The information gain of a feature Y with respect to X is just its mutual information with respect to X:
IGX(Y) = I(X; Y) = H(X) – H(X|Y).
Because H(X) is fixed, a feature Y has a higher information gain than feature Z if H(X|Y) < H(X|Z). Information gain is just mutual information as a function of Y given a fixed X. This confusing terminology is like referring to p(y|θ) as a likelihood function when considered a function of θ with y fixed and as a conditional probability function when considered as a function of y with θ fixed.
Count and Continuous Variables
In the most common case, Y is a boolean (Bernoulli) random variable, taking on only two possible values, 0 and 1. More than two outcomes is typically coded with boolean indicator variables, exactly one of which takes on value 1 for any input. In these cases, information gain makes sense.
But we’re usually dealing with count variables (e.g. naive Bayes, language model classifiers) or continuous variables (e.g. logistic regression, K-nearest neighbors, decision trees). The definition of conditional entropy already coves the count case, because it’s countable. But it assumes that somehow naive Bayes could somehow treat the counts non-monotonically. In fact, naive Bayes probability estimates are always monotonic in the features (because they’re a simple multiplying counts by word log probability estimates).
Continuous variables are handled by replacing the sum with an integral. But with discriminative classifiers like logistic regression, we don’t even have a generative model of the input data, so what do we integrate? Logistic regression, at least without expanding the feature basis with interactions, quadratic terms, etc., is also monotonic, though it can go up and down because of the possibility of negative values.
What I’ve been toying with doing is just quantizing continuous variables and count variables into three categories: negative (Y < 0), zero (Y = 0), and positive (Y > 0). If no one has any objections or better ideas, that’s what’s going into
com.aliasi.features for the next release. It reduces to the usual definition for binary features (coded either way, as variables that take on only values 0 or 1, or take on only values -1 or 1).
This is pretty much what George Forman did in his 2003 JMLR paper on feature selection, An Extensive Empirical Study of Feature Selection Metrics for Text Classification (in their special issue on feature selection). Spoiler Warning: Forman found information gain to be among the top feature selection methods for text classifiers, but only considered count data quantized into two categories, zero and positive.