The notation of $splits(label)$ under Random Forest

Question

On the "Fair Forests: Regularized Tree Induction to Minimize Model Bias", it is written that

We propose a simple regularization approach to constructing a fair decision tree induction algorithm. This is done by altering the way we measure the information gain $G(T, a)$, where $T$ is a set of training examples, and $a$ is the attribute to split on. We will denote the set of points in each of the $k$ branchs of the tree as $T_{i \ldots k}$. This normally is combined with an impurity measure $I(T)$, to give us $$ G(T, a)=I(T)-\sum_{\forall T_{i} \in \text { splits }(a)} \frac{\left|T_{i}\right|}{|T|} \cdot I\left(T_{i}\right) $$ The information gain scores the quality of a splitting attribute $a$ by how much it reduces impurity compared to the current impurity. The larger the gain, the more pure the class labels have become, and thus, should improve classification performance. In the CART algorithm, the Gini impurity is normally used for categorical targets. $$ I_{\mathrm{Gini}}(T)=1-\sum_{\forall T_{i} \in \text { splits }(\text { label })}\left(\frac{\left|T_{i}\right|}{|T|}\right)^{2} $$

Especially, those two things make me confused very much.

The definiton of $\mathrm{splits(a)}$ is not written in the paper.
It is written that $T_i$ is the branch, however, for me, it seems a node.

I have already learned about Random Forest on the ESL, so I know the normal definition of impurity as: $$ I_G(p) = 1 - \sum_{i=1}^{k} p_i^2 $$ However, I can't figure out each definition in this paper...

score 0 · Answer 1 · answered Feb 26 '21 at 05:33

0

You correctly mentioned the definition of impurity which is $$I_G(P) = 1 - \sum _{i=1}^k p_i^2$$ This can be written as $$I_G(P) = \sum _{i=1}^k p_i*(1 - p_i)$$

At any split, for any branch $T_i$, you calculate the probability using the classical definition i.e.,

$$p_i = \frac{|T_i|}{|T|}$$

Using this, you can derive the impurity definition $I(T_i)$ given in the paper.

Once you are able to calculate the impurity of a branch you can calculate the impurity of a node using the weighted sum of each branch, where weights are the probability of that branch. Hence,

$$I_{node} = \sum_{\forall T_i \in splits(a)} \frac{|T_i|}{|T|} I(T_i)$$

Using the above impurity, you can calculate the change in impurity or the information gain as given in the paper.

I hope this solves your doubt.

answered Feb 26 '21 at 05:33

Ashish Jain

1
1

I'm sorry for the reply late. – mayu Mar 02 '21 at 02:02
Hope your doubt is clear! – Ashish Jain Mar 02 '21 at 07:02
I'm sorry I failed to post my additional question...
I can't understand that "the definition of $splits(a)$". $T_i$ seems the set of training points, for example, $T_1 = {x_k | x_k < 1}, T_2 = {x_k | x_k \ge 1}$. Since $T_i \in splits(a)$, $splits(a)$ seems including $T_i$, but I can't understand the meaning... I'm sorry for bad explaining. Anyway, the most strange point for me is the definition of $splits(a)$.
– mayu Mar 02 '21 at 09:26

The notation of $splits(label)$ under Random Forest

1 Answers1

Linked