type
Post
Created date
Jun 16, 2022 01:21 PM
category
Data Science
tags
Machine Learning
Machine Learning
status
Published
Language
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon

Definition


  • Works the same Bagging principle, but differ in terms of the feature selection in subsetted data. Unlike bagging where all features are considered for splitting a node, RF only selects a subset of features at random out of the total at each split.
    • is the number of the sampled predictors; is the number of full predictors; typically we choose .
      • The # of predictors considered at each split is approximately equal to the square root of the total number of predictors. 6 out of 36.
  • Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. [RFAACG]
  • Random forest builds multiple trees and averages the results to reduce the variance.
 
 

Theory


 

Machanism


To classify a new object based on attributes, each tree gives a classification. The forest chooses the classification having the most votes (Over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.
Algorithm
Step 1: In Random forest n number of random records are taken from the data set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.
notion image

Out of bag data

When the training set for the current tree is drawn by sampling with replacement, about one-third of the cases are left out of the sample. This oob (out-of-bag) data is used to get a running unbiased estimate of the classification error as trees are added to the forest. It is also used to get estimates of variable importance. [UCBerkeley]
notion image
notion image

OOB error (analyticsvidhya) / (Here)

  • A validation metric to test if the random forest model is performing well.
  • is the number of wrongly classifying the OOB Sample.

Advantages of using OOB_Score:

  1. No leakage of dataSince the model is validated on the OOB Sample, which means data hasn’t been used while training the model in any way, so there isn’t any leakage of data and henceforth ensures a better predictive model.
  1. Less Variance :  [More Variance ~ Overfitting due to more training score and less testing score]. Since OOB_Score ensures no leakage, so there is no over-fitting of the data and hence least variance.
  1. Better Predictive ModelOOB_Score helps in the least variance and hence it makes a much better predictive model than a model using other validation techniques.
  1. Less ComputationIt requires less computation as it allows one to test the data as it is being trained.

Disadvantages of using OOB_Error :

  1. Time Consuming The method allows to test the data as it is being trained, but the overall process is a bit time-consuming as compared to other validation techniques.
  1. Not good for Large DatasetsAs the process can be a bit time-consuming in comparison with the other techniques, so if the data size is huge, it may take a lot more time while training the model.
  1. Best for Small and medium-size datasetsEven if the process is time-consuming, but if the dataset is medium or small sized, OOB_Score should be preferred over other techniques for a much better predictive model.
notion image
 

Proximity

  • For every pair of observations, the proximity measure tells you the percentage of time they end up in the same leaf node. For example, if your random forest consisted of 100 trees, and a pair of observations end up in the same leaf node in 80 of the 100 trees. Then the proximity measure is 80/100 = 0.8.
  • The higher the proximity measure, the more similar the pair of observations.

Assumption


No formal distributional assumptions as it is non-parametric.
Can thus handle skewed and multi-modal data as well as categorical data that are ordinal or non-ordinal

Shortcoming


Computation power
A large number of trees can make the algorithm too slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained. A more accurate prediction requires more trees, which results in a slower model.
In most real-world applications, the random forest algorithm is fast enough but there can certainly be situations where run-time performance is important and other approaches would be preferred.

Advantage


Versatility
It can be used for both regression and classification tasks, and it’s also easy to view the relative importance it assigns to the input features.
Accuracy
Random forest is also a very handy algorithm because the default hyperparameters it uses often produce a good prediction result.
Avoid overfitting
One of the biggest problems in machine learning is overfitting, but most of the time this won’t happen thanks to the random forest classifier. If there are enough trees in the forest, the classifier won’t overfit the model.
 

Example

R code

set.seed(123) model <- train( diabetes ~., data = train.data, method = "rf", trControl = trainControl("cv", number = 10), importance = TRUE ) # Best tuning parameter model$bestTune
Plot gini
# Plot MeanDecreaseAccuracy varImpPlot(model$finalModel, type = 1) # Plot MeanDecreaseGini varImpPlot(model$finalModel, type = 2)
notion image

R interpretation

notion image
Variable importance
Further information can be found [UCBerkeley].
notion image
  • MeanDecreaseAccuracy, which is the average decrease of model accuracy in predicting the outcome of the out-of-bag samples when a specific variable is excluded from the model.
  • MeanDecreaseGini, which is the average decrease in node impurity that results from splits over that variable. The Gini impurity index is only used for classification problem. In the regression the node impurity is measured by training set RSS. These measures, calculated using the training set, are less reliable than a measure calculated on out-of-bag data. See Chapter @ref(decision-tree-models) for node impurity measures (Gini index and RSS).
The out-of-bag (oob) error estimate [UCBerkeley]
In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:
Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.
Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees.
At the end of the run, take j to be the class that got most of the votes every time case n was oob.
The proportion of times that j is not equal to the true class of n averaged over all cases is the OOB error estimate.
 

Math


Reference


Slides

Extra Resource



FAQ

Difference Between Decision Trees and Random Forests [RFAACG]
[RFAACG]
While random forest is a collection of decision trees, there are some differences.
If you input a training dataset with features and labels into a decision tree, it will formulate some set of rules, which will be used to make the predictions.
For example, to predict whether a person will click on an online advertisement, you might collect the ads the person clicked on in the past and some features that describe their decision. If you put the features and labels into a decision tree, it will generate some rules that help predict whether the advertisement will be clicked or not. In comparison, the random forest algorithm randomly selects observations and features to build several decision trees and then averages the results.
Another difference is “deep” decision trees might suffer from overfitting. Most of the time, random forest prevents this by creating random subsets of the features and building smaller trees using those subsets. Afterwards, it combines the subtrees. It’s important to note this doesn’t work every time and it also makes the computation slower, depending on how many trees the random forest builds.
[vidhya]
notion image
Why random forest can overcome the problem of bagging?
The problem of bagging:
If we have a few predictors that are correlated, then most or all of the trees in the collection of bagged trees will use these strong predictors in the top split.
When all of the bagged trees will look quite similar to each other, it drives up the variance over a single tree in this setting.
Solution: Random forest
By forcing each split to consider only a subset of the predictors. , on average (p − m)/p of the splits will ignore the strong predictor.
This works like decorrelating the trees, thereby making each tree more independent and improving the ensemble prediction.
 
 
Difference between cross-validation and OOB
When using the cross-validation technique, every validation set has already been seen or used in training by a few decision trees and hence there is a leakage of data, therefore more variance. But, OOB_Score prevents leakage and gives a better model with low variance, so we use OOB_score for validating the model.
How does Random Forests prevent overfitting?
  • RF prevents overfitting; it has an OOB Score to give a signal on if the tree predicts the test data well, and hence prevents leakage and gives a better model with low variance.

Bagging is an important feature of random forests (though not exclusive to them). Every tree in the random forest is fitted to only 2/3 of the data points in the dataset, randomly selected. And so every data point is used for fitting by only 2/3 of the trees in the forest. The 'bag' represents the points that were used for fitting a tree; OOB points are therefore the remaining 1/3 of points that were not used for fitting it.
OOB predictions for a specific data point are generated by using only the 1/3 of the trees that do not have that data point in them to make the predictions. This provides a more unbiased estimate of the prediction error of the random forest for that data point. This process is repeated for all data points to generate OOB predictions for every point in the dataset. For the same reasons, I would expect OOB proximity estimation to be less biased and more reliable than 'in-bag' estimation.
 
 
 
Bagging and Boosting treeSupport Vector Machine (SVM)