Created date
Jun 16, 2022 01:21 PM
Data Science
Machine Learning
Machine Learning


  • The goal is to try to minimise the entropy to 0.
  • It has multiple algorithms to decide to split a node into two or more sub-nodes.
    • The creation of sub-nodes increases the homogeneity (同質) of resultant sub-nodes. In other words, we can say that the purity of the node increases with respect to the target variable.
  • It starts with a root node and ends with a decision made by leaves.


Term [KDnuggets]

notion image
notion image
  1. Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
  1. Splitting: It is a process of dividing a node into two or more sub-nodes.
  1. Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
  1. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
  1. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
  1. Branch / Sub-Tree: A subsection of the entire tree is called a branch or sub-tree.
  1. Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.


  1. Initially, whole training data is considered as root.
  1. Records are distributed recursively on the basis of the attribute value.


  • Works by repeatedly partitioning the data into multiple sub-spaces, so that the outcomes in each final sub-space are as homogeneous as possible. This approach is technically called recursive partitioning. [STHDA]
The tree will stop growing by the following three criteria: [STHDA]
  1. all leaf nodes are pure with a single class;
  1. a pre-specified minimum number of training observations that cannot be assigned to each leaf node with any splitting methods; (MINSPLIT)
    1. If my MINSPLIT= 5, only my node has 5, then I can keep splitting.
    2. Another word: the minimum number of samples for each spilt. [Vidhya]
    3. For example, we can use a minimum of 10 samples to reach a decision. That means if a node has less than 10 samples then using this parameter, we can stop the further splitting of this node and make it a leaf node. [Vidhya]
  1. The number of observations in the leaf node reaches the pre-specified minimum one. (MINBUCKET)
    1. If my MINBUCKET= 5, my last node has to contain at least 5 observations.
The training error will decrease if we increase the max_depth value, but it might increase the test error — meaning overfitting.


  1. Compared to other algorithms data preparation requires less time.
  1. Don’t require data to be normalized.
  1. Missing values till and extent don’t affect its performs much.
  1. Is very intuitive as can be explained as if-else conditions.


  1. Need high time to train the model.
  1. A small change in data can cause a considerably large change in the Decision Tree structure.
  1. Comparatively expensive to train.
  1. Not good for regression tasks.


Extend the entropy formula so that it can be used to describe the impurity for a possible split of the data into two subsets.
reThat is, it needs to be the sum of the impurity for both left and right subsets of data.



A fully grown tree will overfit the training data and the resulting model might not be good for predicting the outcome of new test data.


pruning controls this problem.
  • You trim off the branches of the tree, i.e., remove the decision nodes starting from the leaf node such that the overall accuracy is not disturbed. [KDnuggets]
  • This is done by segregating the actual training set into two sets: training data set, D and validation data set, V. Prepare the decision tree using the segregated training data set, D. Then continue trimming the tree accordingly to optimize the accuracy of the validation data set, V. [KDnuggets]
  • We have pre-prune and post-prune

R code

Some good links:
olive <- read_csv("http://www.ggobi.org/book/data/olive.csv") %>% rename(name=`...1`) %>% dplyr::select(-name, -area) %>% mutate(region = factor(region)) tree_spec <- decision_tree() %>% set_engine("rpart") class_tree_spec <- tree_spec %>% set_mode("classification") olive_rp <- class_tree_spec %>% fit(region~., data=olive) olive_rp ## You can change the cp value according to your data set. Please note lower cp value means bigger the tree. ## If you are using too lower cp that leads to overfitting also. printcp(olive_rp$fit)
library(rpart.plot) olive_rp %>% extract_fit_engine() %>% rpart.plot() # ======= examine using ggplot ggplot(olive, aes(x=eicosenoic, y=linoleic, colour=region)) + geom_point() + scale_color_brewer("", palette="Dark2") + geom_vline(xintercept=6.5) + annotate("line", x=c(0, 6.5), y=c(1053.5, 1053.5)) + theme(aspect.ratio = 1)
Tuning params
## rpart.control(minsplit = 20, minbucket = round(minsplit/3), maxdepth = 30) ## Arguments: ## -minsplit: Set the minimum number of observations in the node before the algorithm perform a split ## -minbucket: Set the minimum number of observations in the final note i.e. the leaf ## -maxdepth: Set the maximum depth of any node of the final tree. The root node is treated a depth 0 control <- rpart.control(minsplit = 4, minbucket = round(5 / 3), maxdepth = 3, cp = 0) tune_fit <- rpart(survived~., data = data_train, method = 'class', control = control) accuracy_tune(tune_fit)

R interpretation

We predict the model and then use confusion to calculate the measure.


Gini vs Entropy (more here)
• Gini Index has values inside the interval [0, 0.5] whereas the interval of the Entropy is [0, 1]. In the following figure, both of them are represented. The gini index has also been represented multiplied by two to see concretely the differences between them, which are not very significant.
notion image
• Computationally, entropy is more complex since it makes use of logarithms and consequently, the calculation of the Gini Index will be faster.


The sum is computed across the different categories or classes in the outcome variable. [KDnuggets]
For the classification tree, two measures of purity are Gini impurity and Entropy; both of which vary from 0 (greatest purity) to 1 (maximum degree of impurity) [KDnuggets]

Gini impurity

  • Worse case split occurs at p(1-p) 25%.
  • Understand it as a COST FUNCTION used to evaluate splits in the dataset
Higher value of Gini index implies higher inequality, higher heterogeneity.
Example from ETC3250 lecture
notion image

Entropy (Measure of disorder)

where p is the proportion of misclassified observations within the subpartition.
  • Worse case split is at 50%, where entropy value is the highest. Best split is 0
Example from FIT3152 lecture
initial state
initial state

For the regression tree, we use RMSE (root mean squared error) to measure [TTP]

The prediction error is measured by the RMSE, which corresponds to the average difference between the observed known values of the outcome and the predicted value by the model
  • The lower the RMSE, the better the model.
  • The best cp (complexity parameter) value is the one that minimizes the prediction error RMSE (root mean squared error).
notion image
  • Further maths is here.



Extra Resource

Exercise for entropy / Gini
Support Vector Machine (SVM)Naive bayes