2

I'm using spark with scala to implement majority voting of decision trees and random forest (both are configured in the same way - same depth, the same amount of base classifiers etc.). Dataset is split equally among base classifiers for majority voting. Nemenyi test shows, that majority voting is significantly better (for 11 benchmarking datasets from keel).

From what I understand, the difference between those two methods is that data used to train random forest (base classifiers) might not sum up to the whole dataset. Is my understanding correct? If so, what might be the reason for the observed difference?

Also, could you point me to any articles comparing those two methods?

Edit: If someone was interested in this topic, here's an article comparing bagging with horizontal partitioning in favor of the latter.

Andronicus
  • 133
  • 8

3 Answers3

1

Random forest, predicts the class with highest probability estimate. The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.

Majority voting, which is also called Hard Voting, every individual classifier votes for a class, and the majority wins. In statistical terms, the predicted target label of the ensemble is the mode of the distribution of individually predicted labels.

Majority voting may works better in cases where there are some outliers. Consider these votes: $\{0.51, 0.51, 0.51, 0.01\}$ and $\{1,1,1,0\}$.

aminrd
  • 340
  • 3
  • 12
1

Random forests' base-learner trees use "bootstrapping," by default with rate 1.0 (parameter subsamplingRate); that is, the dataset is resampled but with replacement. So each tree learns on a dataset of the same size as the original, but with some of those points duplicated and some left out. For large datasets, it works out to be about 1/3 of the datasets are left out for each tree. With enough trees (really, just a few is enough), it becomes extremely unlikely that any datapoint is never used by any of the trees.

Spark appears to use hard voting for its random forests, so that's not the difference.

It seems to me that the main difference here is that you've partitioned the data for your custom implementation, so those base learners learn on substantially less data. If that's doing well, it suggests that the random forest is overfitting in comparison. I would suggest varying the tree parameters, say by making the trees in the random forest more conservative, to see how they compare then.

Ben Reiniger
  • 11,770
  • 3
  • 16
  • 56
0

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.

A Voting Classifier is a machine learning model that trains on an ensemble of numerous models and predicts an output (class) based on their highest probability of chosen class as the output. It simply aggregates the findings of each classifier passed into Voting Classifier and predicts the output class based on the highest majority of voting. The idea is instead of creating separate dedicated models and finding the accuracy for each them, we create a single model which trains by these models and predicts output based on their combined majority of voting for each output class.