5

Several papers/books I have read say that cross-entropy is used when looking for the best split in a classification tree, e.g. The Elements of Statistical Learning (Hastie, Tibshirani, Friedman) without even mentioning entropy in the context of classification trees.

Yet, other sources mention entropy and not cross-entropy as a measure of finding the best splits. Are both measures usable? Is only cross-entropy used? Since the two concepts significantly differ from each other as far as my understanding goes.

shenflow
  • 153
  • 1
  • 4

2 Answers2

6

Are both measures usable? Is only cross-entropy used?

They both could be used for this special case. However, I personally prefer "entropy" because it requires less mental gymnastics.

Let's first review the definitions. The most agreed upon and consistent use of entropy and cross-entropy is that entropy is a function of only one distribution, i.e. $-\sum_x P(x)\mbox{log}P(x)$, and cross-entropy is a function of two distributions, i.e. $-\sum_x P(x)\mbox{log}Q(x)$ (integral for continuous $x$).

The formula used in The Elements of Statistical Learning [Page 308, 9.2.3 Classification Trees] can be written as $$-\sum_k P_{m}(k) \text{log}P_{m}(k)$$ where $P_{m}(k)$ is the ratio of class $k$ in node $m$. This could be interpreted as a function of only one (data) distribution, i.e. an entropy, that measures the impurity of node $m$. Nonetheless, it can also be interpreted as a cross-entropy between data distribution and model estimation (based on @DrewN nice explanation), i.e. $$-\sum_k P^{\text{data}}_{m}(k) \text{log}P^{\text{model}}_{m}(k)$$ where we, hypothetically, set the model estimation to match the data distribution in node $m$, i.e. $$P^{\text{model}}_{m}(k)=P^{\text{data}}_{m}(k) = P_{m}(k)$$ to minimize the cross-entropy. Accordingly, cross-entropy would be the same as both data entropy and model entropy in value, i.e. $$\begin{align*} \overbrace{-\sum_k P^{\text{data}}_{m}(k) \text{log}P^{\text{model}}_{m}(k)}^{\text{data-model cross entropy } H(P^{\text{data}}_{m}, P^{\text{model}}_{m})}&=\overbrace{-\sum_k P^{\text{data}}_{m}(k) \text{log}P^{\text{data}}_{m}(k)}^{\text{data entropy } H(P^{\text{data}}_{m})}\\ &=\overbrace{-\sum_k P^{\text{model}}_{m}(k) \text{log}P^{\text{model}}_{m}(k)}^{\text{model entropy } H(P^{\text{model}}_{m})} \end{align*}$$

but it is different in meaning and rightfully has a different name. I say "hypothetically" because in practice, classifier only chooses the class with maximum probability, i.e. $$P^{\text{classifier}}_{m}(k)=\left\{\begin{matrix} 1 & k=\underset{k'}{\text{argmax }}P_m(k')\\ 0 & \text{o.w.} \end{matrix}\right.$$

From another perspective, when cross-entropy is equal to entropy, it means KL divergence is zero $$\text{KL}(P^{\text{data}}_{m} \parallel P^{\text{model}}_{m}) = H(P^{\text{data}}_{m},P^{\text{model}}_{m}) - H(P^{\text{data}}_{m})=0$$

All in all, we can still confidently use "entropy" for decision trees when we talk about node splitting and node impurity. For example, a split occurs when entropy of class distribution in parent node is higher than the weighted-average of class entropies in left and right children (i.e. positive information gain).

As an extra note, cross-entropy is mostly used as a loss function to bring one distribution (e.g. model estimation) closer to another one (e.g. true distribution). A well-known example is classification cross-entropy (my answer). Also, KL-divergence (cross-entropy minus entropy) is basically used for the same reason.

Esmailian
  • 9,312
  • 2
  • 32
  • 48
  • 1
    Would you mind bringing a citation for cross entropy? – Green Falcon Mar 12 '19 at 19:11
  • 1
    Thank you @Esmailian. That was what I was thinking aswell. It is kind of confusing when definitions overlap and different sources state different things. In which context is cross-entropy used though? – shenflow Mar 12 '19 at 20:25
6

The use of cross-entropy here is not incorrect; it is the cross entropy of some quantity.

Given data $(x_1,y_1), ..., (x_N, y_N)$, with $y_N$ a categorial variable over $K$ classes, we can model the conditional probability $p_k(x)$ for class $k$, where it satisfies $\sum_{k=1}^K p_k(x) = 1$ for each $x$. Then the sum $$\frac{1}{N}\sum_{i = 1}^N \sum_{k = 1}^K \mathbf{1}\{y_i = k\}\log p_k(x_i) = \frac{1}{N} \sum_{i = 1}^N \sum_{k = 1}^K q_k(x_i) \log p_k(x_i) $$ is the (conditional) log-likelihood, and also the cross entropy between $p$ and the "one-hot" distribution $q$ that has $P(Y = k | X) = 1$. Logistic regression has the same equation, except there we model $\log p_k(x_i)$ via a log-linear model.

Suppose we have $K = 2$ and code the categorical responses as 1 and 0; then this reduces to $$\frac{1}{N} \sum_{i = 1}^N y_i\log p(x_i) + (1 - y_i) \log(1 - p(x_i))$$

In the classification tree setting, for a binary tree with $|T|$ nodes corresponding to regions $\mathcal{R}_1, ..., \mathcal{R}_{|T|}$, and where the $m$th region contains $N_m$ points, we model $p_k(x_i)$ as a constant in each region:

$$\frac{1}{N} \sum_{i = 1}^N \sum_{\mathcal{R}_m}^{|T|} N_m \Big( y_i \log p_{m} + (1 - y_i) \log(1 - p_{m}) \Big)$$ $$ = \sum_{\mathcal{R_m}}^{|T|} N_m \left( \frac{ \#\text{$\{y = 1\}$ in $\mathcal{R}_m$} }{N} \log p_m + \frac{ \#\text{$\{y = 0\}$ in $\mathcal{R}_m$} }{N} \log (1 - p_m) \right)$$ $$ = \sum_{\mathcal{R_m}}^{|T|} N_m \left( \frac{ C_m }{N} \log p_m + \frac{ N - C_m }{N} \log (1 - p_m) \right)$$ where $C_m$ is the number of times $y = 1$ in $\mathcal{R}_m$. Taking a derivative and setting equal to zero shows that the MLE is actually $\hat{p}_m = C_m / N$, and so this is $$ = \sum_{\mathcal{R_m}}^{|T|} N_m \left( \hat{p}_m \log \hat{p}_m+ (1 - \hat{p}_m) \log (1 - \hat{p}_m) \right)$$ which is just the entropy of $\hat{p}_m$. Since $C_m$ depends on the split points and the parameters chosen in the tree, so does $\hat{p}_m$.

So either cross entropy or entropy are valid, depending on what you want to talk about.

I found this blog post very useful.

Drew N
  • 176
  • 2