type
Post
Created date
Jun 16, 2022 01:21 PM
category
Data Science
tags
Machine Learning
Machine Learning
status
Published
Language
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon

Definition


  • The goal is to try to minimise the entropy to 0.
  • It has multiple algorithms to decide to split a node into two or more sub-nodes.
    • The creation of sub-nodes increases the homogeneity (同質) of resultant sub-nodes. In other words, we can say that the purity of the node increases with respect to the target variable.
  • It starts with a root node and ends with a decision made by leaves.

Theory


Term [KDnuggets]

notion image
notion image
  1. Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
  1. Splitting: It is a process of dividing a node into two or more sub-nodes.
  1. Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
  1. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
  1. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
  1. Branch / Sub-Tree: A subsection of the entire tree is called a branch or sub-tree.
  1. Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.

Assumption


  1. Initially, whole training data is considered as root.
  1. Records are distributed recursively on the basis of the attribute value.

Mechanism


  • Works by repeatedly partitioning the data into multiple sub-spaces, so that the outcomes in each final sub-space are as homogeneous as possible. This approach is technically called recursive partitioning. [STHDA]
The tree will stop growing by the following three criteria: [STHDA]
  1. all leaf nodes are pure with a single class;
  1. a pre-specified minimum number of training observations that cannot be assigned to each leaf node with any splitting methods; (MINSPLIT)
    1. If my MINSPLIT= 5, only my node has 5, then I can keep splitting.
    2. Another word: the minimum number of samples for each spilt. [Vidhya]
    3. For example, we can use a minimum of 10 samples to reach a decision. That means if a node has less than 10 samples then using this parameter, we can stop the further splitting of this node and make it a leaf node. [Vidhya]
  1. The number of observations in the leaf node reaches the pre-specified minimum one. (MINBUCKET)
    1. If my MINBUCKET= 5, my last node has to contain at least 5 observations.
The training error will decrease if we increase the max_depth value, but it might increase the test error — meaning overfitting.
 

Pros


  1. Compared to other algorithms data preparation requires less time.
  1. Don’t require data to be normalized.
  1. Missing values till and extent don’t affect its performs much.
  1. Is very intuitive as can be explained as if-else conditions.

Cons


  1. Need high time to train the model.
  1. A small change in data can cause a considerably large change in the Decision Tree structure.
  1. Comparatively expensive to train.
  1. Not good for regression tasks.

Example

Brendi
Extend the entropy formula so that it can be used to describe the impurity for a possible split of the data into two subsets.
reThat is, it needs to be the sum of the impurity for both left and right subsets of data.
1a-entropy
1a-entropy
1c-impurity
1c-impurity
1b-derivatives
1b-derivatives
2a-splits
2a-splits
2b-classification-rule
2b-classification-rule
3a-olive-split
3a-olive-split
 
 

Pruning

Problem:

A fully grown tree will overfit the training data and the resulting model might not be good for predicting the outcome of new test data.

Solution

pruning controls this problem.
  • You trim off the branches of the tree, i.e., remove the decision nodes starting from the leaf node such that the overall accuracy is not disturbed. [KDnuggets]
  • This is done by segregating the actual training set into two sets: training data set, D and validation data set, V. Prepare the decision tree using the segregated training data set, D. Then continue trimming the tree accordingly to optimize the accuracy of the validation data set, V. [KDnuggets]
  • We have pre-prune and post-prune

R code

Some good links:
olive <- read_csv("http://www.ggobi.org/book/data/olive.csv") %>% rename(name=`...1`) %>% dplyr::select(-name, -area) %>% mutate(region = factor(region)) tree_spec <- decision_tree() %>% set_engine("rpart") class_tree_spec <- tree_spec %>% set_mode("classification") olive_rp <- class_tree_spec %>% fit(region~., data=olive) olive_rp ## You can change the cp value according to your data set. Please note lower cp value means bigger the tree. ## If you are using too lower cp that leads to overfitting also. printcp(olive_rp$fit)
Plot
library(rpart.plot) olive_rp %>% extract_fit_engine() %>% rpart.plot() # ======= examine using ggplot ggplot(olive, aes(x=eicosenoic, y=linoleic, colour=region)) + geom_point() + scale_color_brewer("", palette="Dark2") + geom_vline(xintercept=6.5) + annotate("line", x=c(0, 6.5), y=c(1053.5, 1053.5)) + theme(aspect.ratio = 1)
Tuning params
## rpart.control(minsplit = 20, minbucket = round(minsplit/3), maxdepth = 30) ## Arguments: ## -minsplit: Set the minimum number of observations in the node before the algorithm perform a split ## -minbucket: Set the minimum number of observations in the final note i.e. the leaf ## -maxdepth: Set the maximum depth of any node of the final tree. The root node is treated a depth 0 control <- rpart.control(minsplit = 4, minbucket = round(5 / 3), maxdepth = 3, cp = 0) tune_fit <- rpart(survived~., data = data_train, method = 'class', control = control) accuracy_tune(tune_fit)

R interpretation

We predict the model and then use confusion to calculate the measure.

FAQ

Gini vs Entropy (more here)
• Gini Index has values inside the interval [0, 0.5] whereas the interval of the Entropy is [0, 1]. In the following figure, both of them are represented. The gini index has also been represented multiplied by two to see concretely the differences between them, which are not very significant.
notion image
• Computationally, entropy is more complex since it makes use of logarithms and consequently, the calculation of the Gini Index will be faster.

Math


The sum is computed across the different categories or classes in the outcome variable. [KDnuggets]
For the classification tree, two measures of purity are Gini impurity and Entropy; both of which vary from 0 (greatest purity) to 1 (maximum degree of impurity) [KDnuggets]

Gini impurity

  • Worse case split occurs at p(1-p) 25%.
  • Understand it as a COST FUNCTION used to evaluate splits in the dataset
Higher value of Gini index implies higher inequality, higher heterogeneity.
Example from ETC3250 lecture
notion image
 

Entropy (Measure of disorder)

where p is the proportion of misclassified observations within the subpartition.
  • Worse case split is at 50%, where entropy value is the highest. Best split is 0
Example from FIT3152 lecture
initial state
initial state

For the regression tree, we use RMSE (root mean squared error) to measure [TTP]

The prediction error is measured by the RMSE, which corresponds to the average difference between the observed known values of the outcome and the predicted value by the model
  • The lower the RMSE, the better the model.
  • The best cp (complexity parameter) value is the one that minimizes the prediction error RMSE (root mean squared error).
notion image
  • Further maths is here.
 

Reference


Slides

Extra Resource


Video
Exercise for entropy / Gini
Lab
 
 
 
Support Vector Machine (SVM)Naive bayes