3.3.4. Gain Ratio

C4.5, an improvised version of ID3, uses gain ratio to create splits in the input data. The Gain ratio removes the bias that exists in calculating the information gain of an input parameter. Information gain prefers the parameter with a large number of input values. To neutralize this, the Gain Ratio divides the Information Gain by the number of branches that would result from the split.

$$\text{Gain Ratio} = \frac{\text{Information Gain}}{\text{Split Information}} = \frac{\text{Entropy}(\text{before}) - \sum\_{j=1}^{K} \text{Entropy}(j, after)}{\sum\_{j=1}^{K} -w\_j \log\_2 w\_j} \tag{7}$$

where the index *j* runs from 1 to *K* possible number of nonempty classes, and *wj* is the percentage of class *i* in the node or the probability. The lower the Split Information, the higher the value of Gain Ratio. The Information Gain is essentially modified by the diversity and distribution of attribute values into the quantity known as Gain Ratio.
