Consider the following simple classification problem (Python, scikit-learn)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
def get_product_data(size):
'''
Given a size(int), sets `log10(size)` features to be uniform
random variables `Xi` in [-1,1] and an target `y` given by 1 if
their product `P` is larger than 0.0 and zero otherwise.
Returns a pandas DataFrame.
'''
n_features = int(max(2, np.log10(size)))
features = dict(('x%d' % i, 2*np.random.rand(size) - 1) for i in range(n_features))
y = np.prod(list(features.values()), axis=0)
y = y > 0.0
features.update({'y': y.astype(int)})
return pd.DataFrame(features)
# create random data
df = get_product_data(1000)
X = np.array(df.drop(df.columns[-1], axis=1))
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=1)
def predict(clf):
'''
Splits train/test with a fixed seed, fits, and returns the accuracy
'''
clf.fit(X_train, y_train)
return accuracy_score(y_test, clf.predict(X_test))
and the following classifiers:
foo10 = RandomForestClassifier(10, max_features=None, bootstrap=False)
foo100 = RandomForestClassifier(100, max_features=None, bootstrap=False)
foo200 = RandomForestClassifier(200, max_features=None, bootstrap=False)
Why does
predict(foo10) # 0.906060606061
predict(foo100) # 0.933333333333
predict(foo200) # 0.915151515152
give different scores?
Specifically, with
max_features=None
, all features are selected for each treebootstrap=False
, there is no bootstrap of samplesmax_depth=None
(default), all trees reach the maximum depth
I would expect each tree to be exactly the same. Thus, regardless of how many trees the forest has, the predictions should be equal. Where is the tree's variability coming from in this example?
What further parameters would I have to introduce in the RandomForestClassifier.__init__
in such a way that foo*
have all the same score?
random_state
of theRandomForest
alone because I wrongly assumed thatDecisionTreeClassifier
was deterministic. This is indeed the solution to this question. I believe this question can be simplified by removing that code you used to test this. I.e. replacing the foo* by a call with(10, max_features=None, bootstrap=False, random_state=1)
already gives the same result for all foo*. – Jorge Leitao Feb 07 '17 at 11:57max_depth
is small this does not happen. The problem happens in the lower branches where there is little data. The scores will be identical no matter what decision rule is used (for the training data, of course), therefore things such as the feature iteration order are important. – Ricardo Cruz Feb 07 '17 at 12:10