Why `max_features=n_features` does not make the Random Forest independent of number of trees?

Question

Consider the following simple classification problem (Python, scikit-learn)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

def get_product_data(size):
    '''
    Given a size(int), sets `log10(size)` features to be uniform 
    random variables `Xi` in [-1,1] and an target `y` given by 1 if 
    their product `P` is larger than 0.0 and zero otherwise. 
    Returns a pandas DataFrame.
    '''
    n_features = int(max(2, np.log10(size)))
    features = dict(('x%d' % i, 2*np.random.rand(size) - 1) for i in range(n_features))
    y = np.prod(list(features.values()), axis=0)
    y = y > 0.0
    features.update({'y': y.astype(int)})
    return pd.DataFrame(features)

# create random data
df = get_product_data(1000)

X = np.array(df.drop(df.columns[-1], axis=1))
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, 
                                                        random_state=1)    

def predict(clf):
    '''
    Splits train/test with a fixed seed, fits, and returns the accuracy
    '''
    clf.fit(X_train, y_train)
    return accuracy_score(y_test, clf.predict(X_test))

and the following classifiers:

foo10 = RandomForestClassifier(10, max_features=None, bootstrap=False)
foo100 = RandomForestClassifier(100, max_features=None, bootstrap=False)
foo200 = RandomForestClassifier(200, max_features=None, bootstrap=False)

Why does

predict(foo10)  # 0.906060606061
predict(foo100)  # 0.933333333333
predict(foo200)  # 0.915151515152

give different scores?

Specifically, with

max_features=None, all features are selected for each tree
bootstrap=False, there is no bootstrap of samples
max_depth=None (default), all trees reach the maximum depth

I would expect each tree to be exactly the same. Thus, regardless of how many trees the forest has, the predictions should be equal. Where is the tree's variability coming from in this example?

What further parameters would I have to introduce in the RandomForestClassifier.__init__ in such a way that foo* have all the same score?

Ricardo Cruz · Accepted Answer · 2018-07-06T23:29:46.807

Interesting puzzle indeed.

First things first. The DecisionTreeClassifier has some stochastic behavior. For instance, the splitter code iterates through the features at random:

        f_j = rand_int(n_drawn_constants, f_i - n_found_constants,
                       random_state)

Your data is small and comes from the same distribution. What this means is that you'll have a lot of identical purity scores depending on how iteration is done. If you (a) increase your data, or (b) make it more separable, you'll see the problem should ameliorate.

To clarify: if the algorithm computes the score for feature A and then computes the score for feature B and it gets score N. Or if it computes first the score for feature B and then for feature A and it gets the same score N, you can see how each decision tree will be different, and have different scores during test, even if the train test is the same (100% if max_depth=None of course). (You can confirm this.)

During my exploration of your question, I have produced the following code with my own implementation of a random forest. Since it took me some time, I figured I might as well paste it here. :) Seriously, it can be useful. You can try to disable random_state from my implementation to see what I mean.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np


class MyRandomForestClassifier:
    def __init__(self, n_estimators):
        self.n_estimators = n_estimators

    def fit(self, X, y):
        self.trees = [DecisionTreeClassifier(random_state=1).fit(X, y)
                      for _ in range(self.n_estimators)]
        return self

    def predict(self, X):
        yp = [tree.predict(X) for tree in self.trees]
        return ((np.sum(yp, 0) / len(self.trees)) > 0.5).astype(int)

    def score(self, X, y):
        return accuracy_score(y, self.predict(X))


for alpha in (1, 0.1, 0.01):
    np.random.seed(1)
    print('# alpha: %s' % str(alpha))
    N = 1000
    X = np.random.random((N, 10))
    y = np.r_[np.zeros(N//2, int), np.ones(N//2, int)]
    X[y == 1] = X[y == 1]*alpha
    Xtr, Xts, ytr, yts = train_test_split(X, y)

    print('## sklearn forest')
    for n_estimators in (1, 10, 100, 200, 500):
        m = RandomForestClassifier(
            n_estimators, max_features=None, bootstrap=False)
        m.fit(Xtr, ytr)
        print('%3d: %.4f' % (n_estimators, m.score(Xts, yts)))

    print('## my forest')
    for n_estimators in (1, 10, 100, 200, 500):
        m = MyRandomForestClassifier(n_estimators)
        m.fit(Xtr, ytr)
        print('%3d: %.4f' % (n_estimators, m.score(Xts, yts)))
    print()

Summary: Each DecisionTreeClassifier is stochastic, data such as yours, which is small and comes from the same distribution, are bound to produce slightly different trees, even if the random forest itself is deterministic. You can fix this by passing the same seed to each DecisionTreeClassifier which you can do using random_state=something. RandomForestClassifier also has a random_state parameter which it passes along each DecisionTreeClassifier. (This is slightly incorrect, see the edit.)

EDIT2: While this removes the stochasticity component of the training, the decision trees would still be different. The thing is that sklearn ensembles generate a new random seed for each child based on the random state they are given. They do not pass along the same random_state.

You can see this is the case by checking the _set_random_states method from the ensemble base module, in particular this line, which propagates the random_state across the ensembles' children.

I purposely left the random_state of the RandomForest alone because I wrongly assumed that DecisionTreeClassifier was deterministic. This is indeed the solution to this question. I believe this question can be simplified by removing that code you used to test this. I.e. replacing the foo* by a call with (10, max_features=None, bootstrap=False, random_state=1) already gives the same result for all foo*. — Jorge Leitao, Feb 07 '17 at 11:57
Yep, I was surprised myself. If max_depth is small this does not happen. The problem happens in the lower branches where there is little data. The scores will be identical no matter what decision rule is used (for the training data, of course), therefore things such as the feature iteration order are important. — Ricardo Cruz, Feb 07 '17 at 12:10
This seemed like a fun post for my blog. I used your stackexchange username. Tell me if you'd like me to anonymize it... — Ricardo Cruz, Feb 07 '17 at 12:30
@RicardoCruz I wrote small code to test creating forest with the same trees and it fails to create it (https://gist.github.com/QuantumDamage/29474994f4dc51cd170ea706b0ad646f). Is it possible at all in scikit-learn using RandomForestClassifier? I also formulated question about it here: https://datascience.stackexchange.com/questions/34038/can-i-create-random-forest-with-randomforestclassifier-which-will-consist-the-sa — Damian Melniczuk, Jul 05 '18 at 13:17

score 4 · Answer 2 · edited Apr 13 '17 at 12:50

Picking up answer from Ricardo Cruz:

The reason is that DecisionTreeClassifier is not a deterministic classifier. Specifically, when splitting between two features that lead to the same decrease of the metric, DecisionTreeClassifier picks one at random. The fluctuations observed in the question are caused by this.

The corollary of this observation is that scikit-learn's Random Forest uses bootstrap for 3 different things:

items (samples different items)
features (samples different features)
splitting (samples different splittings)

Freezing items (using bootstrap=False) and features (using max_depth=None) is not sufficient to freeze splitting. The only way to freeze splitting is by using a fixed random_state.

Notice that using the same random_state alone is not enough to freeze everything, i.e.

foo10 = RandomForestClassifier(10, random_state=1)
foo100 = RandomForestClassifier(100, random_state=1)
foo200 = RandomForestClassifier(200, random_state=1)

do not give the same score.

Thus, the only way to guarantee that the classifier gives the same results for any n_estimators is to initialize it as

RandomForestClassifier(n_estimators, max_features=None, bootstrap=False, random_state=1)

This is not particularly helpful for classification per se, but allows to better understand what is being done under the hood.

It seems setting the random_state argument is a must in the RF algorithm. Ref — Mario, Feb 19 '23 at 09:57

Why `max_features=n_features` does not make the Random Forest independent of number of trees?

2 Answers2

Linked