Random forest vs majority voting

Question

I'm using spark with scala to implement majority voting of decision trees and random forest (both are configured in the same way - same depth, the same amount of base classifiers etc.). Dataset is split equally among base classifiers for majority voting. Nemenyi test shows, that majority voting is significantly better (for 11 benchmarking datasets from keel).

From what I understand, the difference between those two methods is that data used to train random forest (base classifiers) might not sum up to the whole dataset. Is my understanding correct? If so, what might be the reason for the observed difference?

Also, could you point me to any articles comparing those two methods?

Edit: If someone was interested in this topic, here's an article comparing bagging with horizontal partitioning in favor of the latter.

What are the common depth, number of trees, ...? What is the shape of your data? — Ben Reiniger, Oct 18 '19 at 00:48

score 1 · Answer 1 · answered Oct 17 '19 at 19:58

Random forest, predicts the class with highest probability estimate. The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.

Majority voting, which is also called Hard Voting, every individual classifier votes for a class, and the majority wins. In statistical terms, the predicted target label of the ensemble is the mode of the distribution of individually predicted labels.

Majority voting may works better in cases where there are some outliers. Consider these votes: $\{0.51, 0.51, 0.51, 0.01\}$ and $\{1,1,1,0\}$.

score 1 · Accepted Answer · answered Oct 18 '19 at 02:02

1

Random forests' base-learner trees use "bootstrapping," by default with rate 1.0 (parameter subsamplingRate); that is, the dataset is resampled but with replacement. So each tree learns on a dataset of the same size as the original, but with some of those points duplicated and some left out. For large datasets, it works out to be about 1/3 of the datasets are left out for each tree. With enough trees (really, just a few is enough), it becomes extremely unlikely that any datapoint is never used by any of the trees.

Spark appears to use hard voting for its random forests, so that's not the difference.

It seems to me that the main difference here is that you've partitioned the data for your custom implementation, so those base learners learn on substantially less data. If that's doing well, it suggests that the random forest is overfitting in comparison. I would suggest varying the tree parameters, say by making the trees in the random forest more conservative, to see how they compare then.

answered Oct 18 '19 at 02:02

Ben Reiniger

11,770
3
16
56

Thank you, have you stumbled on any paper that compares those two methods? I'm having difficulty finding one. – Andronicus Oct 18 '19 at 04:13
I hadn't heard of partitioning the data for ensembling before; but a quick search found: https://www3.nd.edu/~nchawla/papers/ICDM01.pdf . – Ben Reiniger Oct 18 '19 at 14:13
sorry, but I cannot see decision tree as a reference there – Andronicus Oct 20 '19 at 08:32
? The phrase "decision tree" is used at least 5 times, and several other times the specific learning algorithm "C4.5" is mentioned. – Ben Reiniger Oct 20 '19 at 19:14
@BenReiniger Isn't max_samples argument of RandomForestClassifier(), or RandomForestRegressor() in sklearn package equivalent to subsamplingRate parameter you addressed? – Mario Feb 19 '23 at 07:35
1

@Mario that's correct, when it's a float. – Ben Reiniger Feb 19 '23 at 14:05

score 0 · Answer 3 · answered Jul 15 '22 at 08:12

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.

A Voting Classifier is a machine learning model that trains on an ensemble of numerous models and predicts an output (class) based on their highest probability of chosen class as the output. It simply aggregates the findings of each classifier passed into Voting Classifier and predicts the output class based on the highest majority of voting. The idea is instead of creating separate dedicated models and finding the accuracy for each them, we create a single model which trains by these models and predicts output based on their combined majority of voting for each output class.

Random forest vs majority voting

3 Answers3

Linked