Information Gain¶

1. Definition of Information Gain¶

Definition: Information Gain (IG) is a measure used in decision trees to determine which feature to split on at each step of the tree-building process. It quantifies the reduction in entropy (uncertainty) after a dataset is split based on a particular feature. The feature that provides the highest information gain is chosen as the splitting feature, as it best separates the data according to the target variable.

2. Information Gain Formula¶

Information Gain

where:

Entropy of set S
T – Subset created by splitting S by attribute A
p(t) – The proportion of the number of elements in t to the number of elements in S
H(t) – Entropy of subset t

Example:¶

Suppose we have a dataset with a binary classification problem and we want to calculate the information gain for a specific feature ( A ):

Calculate the initial entropy ( H(D) ).
Split the dataset based on feature ( A ) and calculate the entropy for each subset ( H(D_v) ).
Calculate the weighted average entropy of the subsets.
Subtract the weighted entropy from the initial entropy to get the information gain.

3. Role of Information Gain in Decision Trees¶

Feature Selection:
- Information Gain is used to decide which feature to use to split the data at each node of the decision tree. The feature with the highest information gain is chosen, as it results in the most significant reduction in uncertainty (entropy).
Building the Tree:
- The decision tree algorithm recursively splits the data using the feature that provides the highest information gain until a stopping criterion is met (e.g., all data points in a node belong to the same class or no further improvement is possible).

4. Intuition Behind Information Gain¶

High Information Gain:
- Indicates that a split based on this feature results in subsets with low entropy, meaning the data points in each subset are more homogenous or pure with respect to the target variable.
Low Information Gain:
- Indicates that a split based on this feature does not significantly reduce the entropy, meaning the subsets are still mixed and uncertain with respect to the target variable.

5. Examples of Information Gain¶

Binary Classification Example:¶

Suppose we have a dataset with two classes (e.g., "Yes" and "No") and a feature ( A ) that can take two values ("High" and "Low"). The steps to calculate the information gain are:

Calculate Initial Entropy: [ H(D) = -p_{\text{Yes}} \log_2 p_{\text{Yes}} - p_{\text{No}} \log_2 p_{\text{No}} ]
Split the Dataset by feature ( A ), calculate the entropy for each subset ( D_{\text{High}} ) and ( D_{\text{Low}} ).
Calculate the Weighted Entropy after the split.
Calculate Information Gain: [ IG(D, A) = H(D) - \left( \frac{|D_{\text{High}}|}{|D|} H(D_{\text{High}}) + \frac{|D_{\text{Low}}|}{|D|} H(D_{\text{Low}}) \right) ]

Multi-Class Example:¶

The same process applies for a feature with multiple categories or a target variable with more than two classes.

6. Application of Information Gain¶

Decision Trees:
- Information Gain is crucial in building decision trees, helping to determine the best features to split on to minimize uncertainty and create accurate and efficient trees.
Feature Selection:
- In broader machine learning contexts, Information Gain can be used as a criterion for feature selection, helping to identify the most informative features for model building.