Random Forest | Jason Siu

type

Post

Created date

Jun 16, 2022 01:21 PM

Definition

Works the same Bagging principle, but differ in terms of the feature selection in subsetted data. Unlike bagging where all features are considered for splitting a node, RF only selects a subset of features at random out of the total at each split.

is the number of the sampled predictors; is the number of full predictors; typically we choose .

The # of predictors considered at each split is approximately equal to the square root of the total number of predictors. 6 out of 36.

Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. [RFAACG]

Random forest builds multiple trees and averages the results to reduce the variance.

Basic Ensemble Learning (Random Forest, AdaBoost, Gradient Boosting)- Step by Step Explained | by Lilly Chen | Towards Data Science

Theory

Machanism

To classify a new object based on attributes, each tree gives a classification. The forest chooses the classification having the most votes (Over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.

Algorithm

Step 1: In Random forest n number of random records are taken from the data set having k number of records.

Step 2: Individual decision trees are constructed for each sample.

Step 3: Each decision tree will generate an output.

Step 4: Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.

Out of bag data

When the training set for the current tree is drawn by sampling with replacement, about one-third of the cases are left out of the sample. This oob (out-of-bag) data is used to get a running unbiased estimate of the classification error as trees are added to the forest. It is also used to get estimates of variable importance. [UCBerkeley]

OOB error (analyticsvidhya) / (Here)

A validation metric to test if the random forest model is performing well.

is the number of wrongly classifying the OOB Sample.

Advantages of using OOB_Score:

No leakage of data: Since the model is validated on the OOB Sample, which means data hasn’t been used while training the model in any way, so there isn’t any leakage of data and henceforth ensures a better predictive model.

Less Variance : [More Variance ~ Overfitting due to more training score and less testing score]. Since OOB_Score ensures no leakage, so there is no over-fitting of the data and hence least variance.

Better Predictive Model: OOB_Score helps in the least variance and hence it makes a much better predictive model than a model using other validation techniques.

Less Computation: It requires less computation as it allows one to test the data as it is being trained.

Disadvantages of using OOB_Error :

Time Consuming: The method allows to test the data as it is being trained, but the overall process is a bit time-consuming as compared to other validation techniques.

Not good for Large Datasets: As the process can be a bit time-consuming in comparison with the other techniques, so if the data size is huge, it may take a lot more time while training the model.

Best for Small and medium-size datasets: Even if the process is time-consuming, but if the dataset is medium or small sized, OOB_Score should be preferred over other techniques for a much better predictive model.

Proximity

What is meant by proximity in random forests? - Quora What is meant by proximity in random forests? (Best 隨機森林_EntropyPlus的博客-CSDN博客)

For every pair of observations, the proximity measure tells you the percentage of time they end up in the same leaf node. For example, if your random forest consisted of 100 trees, and a pair of observations end up in the same leaf node in 80 of the 100 trees. Then the proximity measure is 80/100 = 0.8.

The higher the proximity measure, the more similar the pair of observations.

Assumption

No formal distributional assumptions as it is non-parametric.

Can thus handle skewed and multi-modal data as well as categorical data that are ordinal or non-ordinal

Shortcoming

Computation power

A large number of trees can make the algorithm too slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained. A more accurate prediction requires more trees, which results in a slower model.

In most real-world applications, the random forest algorithm is fast enough but there can certainly be situations where run-time performance is important and other approaches would be preferred.

Advantage

Versatility

It can be used for both regression and classification tasks, and it’s also easy to view the relative importance it assigns to the input features.

Accuracy

Random forest is also a very handy algorithm because the default hyperparameters it uses often produce a good prediction result.

Avoid overfitting

One of the biggest problems in machine learning is overfitting, but most of the time this won’t happen thanks to the random forest classifier. If there are enough trees in the forest, the classifier won’t overfit the model.

Example

R code

[STHDA]


set.seed(123)
model <- train(
  diabetes ~., data = train.data, method = "rf",
  trControl = trainControl("cv", number = 10),
  importance = TRUE
  )
# Best tuning parameter
model$bestTune

Plot gini

[STHDA]


# Plot MeanDecreaseAccuracy
varImpPlot(model$finalModel, type = 1)
# Plot MeanDecreaseGini
varImpPlot(model$finalModel, type = 2)

R interpretation

[STHDA]

Variable importance

Further information can be found [UCBerkeley].

MeanDecreaseAccuracy, which is the average decrease of model accuracy in predicting the outcome of the out-of-bag samples when a specific variable is excluded from the model.

MeanDecreaseGini, which is the average decrease in node impurity that results from splits over that variable. The Gini impurity index is only used for classification problem. In the regression the node impurity is measured by training set RSS. These measures, calculated using the training set, are less reliable than a measure calculated on out-of-bag data. See Chapter @ref(decision-tree-models) for node impurity measures (Gini index and RSS).

The out-of-bag (oob) error estimate [UCBerkeley]

In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:

Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.

Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees.

At the end of the run, take j to be the class that got most of the votes every time case n was oob.

The proportion of times that j is not equal to the true class of n averaged over all cases is the OOB error estimate.

Math

Reference

Understanding Random Forest. How the Algorithm Works and Why it Is… | by Tony Yiu [Tony Yiu]

Random Forest Algorithm: A Complete Guide | Built In [RFAACG]

Random Forest | Introduction to Random Forest Algorithm (analyticsvidhya.com) [vidhya]

Bagging and Random Forest Essentials - Articles - STHDA [STHDA]

Random forests - classification description (berkeley.edu) [UCBerkeley] ***Very detailed

Slides

Week7_Part1_Random_forests.pdf

5713.6KB

Extra Resource

(491) 7 分鐘解釋決策樹 | Tree-Based Model Explained - YouTube

FAQ

Difference Between Decision Trees and Random Forests [RFAACG]

[RFAACG]

While random forest is a collection of decision trees, there are some differences.

If you input a training dataset with features and labels into a decision tree, it will formulate some set of rules, which will be used to make the predictions.

For example, to predict whether a person will click on an online advertisement, you might collect the ads the person clicked on in the past and some features that describe their decision. If you put the features and labels into a decision tree, it will generate some rules that help predict whether the advertisement will be clicked or not. In comparison, the random forest algorithm randomly selects observations and features to build several decision trees and then averages the results.

Another difference is “deep” decision trees might suffer from overfitting. Most of the time, random forest prevents this by creating random subsets of the features and building smaller trees using those subsets. Afterwards, it combines the subtrees. It’s important to note this doesn’t work every time and it also makes the computation slower, depending on how many trees the random forest builds.

[vidhya]

Why random forest can overcome the problem of bagging?

The problem of bagging:

If we have a few predictors that are correlated, then most or all of the trees in the collection of bagged trees will use these strong predictors in the top split.

When all of the bagged trees will look quite similar to each other, it drives up the variance over a single tree in this setting.

Solution: Random forest

By forcing each split to consider only a subset of the predictors. , on average (p − m)/p of the splits will ignore the strong predictor.

This works like decorrelating the trees, thereby making each tree more independent and improving the ensemble prediction.

Difference between cross-validation and OOB

When using the cross-validation technique, every validation set has already been seen or used in training by a few decision trees and hence there is a leakage of data, therefore more variance. But, OOB_Score prevents leakage and gives a better model with low variance, so we use OOB_score for validating the model.

How does Random Forests prevent overfitting?

RF prevents overfitting; it has an OOB Score to give a signal on if the tree predicts the test data well, and hence prevents leakage and gives a better model with low variance.

r - Random forest out of bag, what difference does this make to computing the proximity matrix - Cross Validated (stackexchange.com)

Bagging is an important feature of random forests (though not exclusive to them). Every tree in the random forest is fitted to only 2/3 of the data points in the dataset, randomly selected. And so every data point is used for fitting by only 2/3 of the trees in the forest. The 'bag' represents the points that were used for fitting a tree; OOB points are therefore the remaining 1/3 of points that were not used for fitting it.

OOB predictions for a specific data point are generated by using only the 1/3 of the trees that do not have that data point in them to make the predictions. This provides a more unbiased estimate of the prediction error of the random forest for that data point. This process is repeated for all data points to generate OOB predictions for every point in the dataset. For the same reasons, I would expect OOB proximity estimation to be less biased and more reliable than 'in-bag' estimation.