On the "Fair Forests: Regularized Tree Induction to Minimize Model Bias", it is written that
We propose a simple regularization approach to constructing a fair decision tree induction algorithm. This is done by altering the way we measure the information gain $G(T, a)$, where $T$ is a set of training examples, and $a$ is the attribute to split on. We will denote the set of points in each of the $k$ branchs of the tree as $T_{i \ldots k}$. This normally is combined with an impurity measure $I(T)$, to give us $$ G(T, a)=I(T)-\sum_{\forall T_{i} \in \text { splits }(a)} \frac{\left|T_{i}\right|}{|T|} \cdot I\left(T_{i}\right) $$ The information gain scores the quality of a splitting attribute $a$ by how much it reduces impurity compared to the current impurity. The larger the gain, the more pure the class labels have become, and thus, should improve classification performance. In the CART algorithm, the Gini impurity is normally used for categorical targets. $$ I_{\mathrm{Gini}}(T)=1-\sum_{\forall T_{i} \in \text { splits }(\text { label })}\left(\frac{\left|T_{i}\right|}{|T|}\right)^{2} $$
Especially, those two things make me confused very much.
- The definiton of $\mathrm{splits(a)}$ is not written in the paper.
- It is written that $T_i$ is the branch, however, for me, it seems a node.
I have already learned about Random Forest on the ESL, so I know the normal definition of impurity as: $$ I_G(p) = 1 - \sum_{i=1}^{k} p_i^2 $$ However, I can't figure out each definition in this paper...
I can't understand that "the definition of $splits(a)$". $T_i$ seems the set of training points, for example, $T_1 = {x_k | x_k < 1}, T_2 = {x_k | x_k \ge 1}$. Since $T_i \in splits(a)$, $splits(a)$ seems including $T_i$, but I can't understand the meaning... I'm sorry for bad explaining. Anyway, the most strange point for me is the definition of $splits(a)$.
– mayu Mar 02 '21 at 09:26