Here we use two functions namely

  1. DecisionTreeClassifier
  2. DecisionTreeRegressor

DecisionTreeClassifier function is used when target/dependent variable is  (Qualitative) Binary Class ( Dichotomous say yes or no) or Multi Class (multiple classification say Low, Middle, high )

DecisionTreeRegressor function is used when target/dependent variable is numerical (say sales volume in dollars or price in dollars etc)

Before using the abovesaid decision tree function let us know how root node and internal notes are split. 

Decision Trees Splitting Criteria

We follow different criteria for classifier function and regressor function

  1. Classification criteria
    1. Entropy
    2. Information Gain
    3. Gini Index or Gini Impurity
  2. Regression criteria
    1. Mean Squared Error (MSE)
    2. Poisson Deviance
    3. Friedman MSE
    4. Mean Absolute Error (MAE)

We use recursive algorithm process.  The metrics like Splitting Tree Criterion or Attribute Selection Measures(ASM) are used to evaluate and select the best features. Algorithm identifies the threshold candidate for a node  that will be used as a separator to split that node. First root node is identified  then internal nodes and finally leaf node

For classification we use  Gini index, Entropy and  Information Gain, log_loss

For Regression   we talk about variance reduction: Mean Squared_error(MSE),Mean Absolute_Error(MAE), Friedman MSE, Poisson Deviance

Classification Criterion: Gini and Gini Index

The Gini follows Impurity concept. It is used to measure of the impurity at a node 

The frequencies of two classes determines the purity and impurity/ homogeneous and non-homogenous

  1. If frequencies of two classes (0 and 1 ) are exactly equal then it indicates high degree of impurity
  2. If frequencies of two classes (0 and 1) are at the extreme ends then it indicates less degree of impurity 

Gini 

It measures the probability of misclassifying a randomly chosen element if it were labeled according to the distribution of classes in the dataset.

Example: suppose a dataset has 100 samples. 70 yes(1) and 30 no(0). Then the probability of 1 class is 0.70 and 0 class is 0.30

G = 1- (p^2+q^2) = 1-(0.70^2+0.30^2)= 1-(0.49 +0.09) =0.42

Gini impurity value 0.42 tells us that there is a chance of 42% misclassifying a sample at random at this node.

Gini Index

It is used to measure of the impurity at a node

Where P(j|k) is the proportion of category j in node k

Smaller Gini Index means/implies lesser impurity.

Gini index Calculation:

Say number of classes 2 (0 and 1)

Consider node label (k) with 70 (1s) and 30 zeros

             = 2*0.70*0.30 =0.42

It is very high.  It implies high impurity 

When a node has maximum impurity, we have to take 50% as 1 and 59% as zero .

In that case gini index will be equal to 2 *0.5*0.5=0.50.

GiniIndex falls between 0 to 0.5

Entropy:

Entropy is a measure from information theory that represents the level of impurity or disorder in a dataset.

Another impurity measure that is frequently used. Entropy of node k is given by

 

Entropy Calculation

Say number of classes 2 (0 and 1) Consider node label (k) with 70(0s) and 30 ones

 

  • This entropy value points out the uncertainty in this distribution.
  • Higher value indicates greater disorder
  • Lower value indicates a clearer separation between classes

Using DecisionTreeClassifier function

Binary class problem

Target Variable  is Dichotomous( Yes(1) or No(0))

Criterion – GINI

We deal with Binary Class problems where the dependent variable/target variable will be dichotomous. Yes or no – 1 or 0. Example sex – 1 male  0 female. This data relates to the statistics of titanic ship. Survived is the target variable contains Survived(1) and not survived (0) categories. The independent variables are Pclass is berth customers travelled, Sex has two categories Male(1) and Female(0) and Age is numerical.  So, the target variable, two independent variables are categorical and Age is Numerical

Load data and  read from local drive

  • Target variable Survived has two classes  :  0 – not survived  1: survived
  • Independent variables : 3 ; Pclass  containing 3 classes: first class 1 ; second class 2 and third class 3, sex  : 0  Female and 1 Male
  • Age is numerical

Separate the features and the target

Using DecisionTreeClassifier function – criterion: Gini

Predict if the traveler survived or not survived by traversing the tree

  • Condition 1: Check if sex <=0.5 ? Answer is female = 0
  •                       If condition is true move to left
  •                      If condition is false move to right
  •                      In this case condition fails. So False  move to right
  • Condition 2: check if pclass <= 2.5 ? Answer is pclass =3
  •                        Since the answer is greater than condition
  •                       Condition  fails move again to right
  • Condition 3: Check if Age <=5? Answer is 35.
  •                        Since the answer is greater than condition
  •                       Condition fails. So, move to right
  •                       Predicted class is 0 – dead(not survived)

With this tree you can predict the class of target variable by moving right or left and reach the final classification for new new set of independent variables.

Prediction Programmatically

  • Probability of y(0) is 89.40% – not survived
  • Probability of y(1) is 10.60% – survived
  • the probability of non-survival is  estimated at 89.40%  class =0
  • The probability of survival is estimated at 10.60%              class=1

Depth of the tree:

  • Depth of the tree can be reduced or increased if you want. Remember when you increase the max.depth  the tree will grow heavily. It may lead to overfitting. Avoid that
  • When Max.depth is reduced to 2 in the line
  • clf = DecisionTreeClassifier(criterion=‘gini’, max_depth=2, random_state=42)

GINI CALCULATION

Root

  • Samples = 891
  • Not survived = 549    % = 549*100/891 =61.62% = p
  • Survived        = 342    %=  342*100/891 =38.38%= q
  • Gini                = 1- (p^2+q^2)
  • Gini[sex]  = 1 –( (549/891)^2+(342/891)^2 = 0.47

LEFT Child

  • Samples = 577
  • Not survived = 468    % = 468*100/577 =81.11% = p
  • Survived        = 109    %=  109*100/577 =18.89%= q
  • Gini                = 1- (p^2+q^2)
  • Gini[sex]  = 1 –( (0.8111)^2+(0.1889)^2 = 0.31

RIGHT child

  • Samples = 314
  • Not survived = 81    % = 81*100/314 =25.80% = p
  • Survived        = 233    %=  233*100/314 =74.20%= q
  • Gini                = 1- (p^2+q^2)
  • Gini[sex]  = 1 –( (0.2580)^2+(0.7420)^2 = 0.38

In the same you can calculate gini from the information available in the tree diagram box

Condition 1: Check if sex <=0.5 ? Answer is male = 1

                      If condition is true move to left

                     If condition is false move to right

                     In this case condition fails. So False  move to right

Condition 2: check if pclass <= 2.5 ? Answer is pclass =1

                       Since the answer is less than condition

                      Condition  true  move again to left

                      Predicted class is 1 – survived

Binary class problem

Target Variable  is Dichotomous( Yes(1) or No(0))

Criterion – Entropy

Load  and read data from local drive

Entropy Calculation

Root

  • Samples = 891
  • Not survived = 549    % = 549*100/891 =61.62% = p
  • Survived        = 342    %=  342*100/891 =38.38%= q
  • p = 0.6162
  • q = 0.3838
  • Entropy            = -p*log2(p) – q*log2(q)
  • Entropy           = -0.6162*log(0.6162,2) – 0.3838*log(0.3838,2) =0.96

LEFT  child

  • Samples = 577
  • Not survived = 468    % = 468*100/577 =81.11% = p
  • Survived        = 109    %=  109*100/577 =18.89%= q
  • p = 0.8111
  • q = 0.1889
  • Entropy            = -p*log2(p) – q*log2(q)
  • Entropy           = -0.8111*log(0.8111,2) – 0.1889*log(0.1889,2) =0.70

 

RIGHT child

  • Samples = 314
  • Not survived = 81    % = 81*100/314 =25.80% = p
  • Survived        = 233    %=  233*100/314 =74.20%= q
  • p = 0.2580
  • q = 0.7420
  • Entropy            = -p*log2(p) – q*log2(q)
  • Entropy           = -0.2580*log(0.2580,2) – 0.7420*log(0.7420,2) =0.82

Binary class problem

Target Variable  is Dichotomous( Yes(1) or No(0))

Criterion – Log_loss

We use the same data what used for gini and entropy

Log-loss Calculation

where

  • yi is the actual class label (0 or 1)
  • pi is the predicted probability
  • N is the number of observations

root

  • Samples = 891
  • Not survived = 549    % = 549*100/891 =61.62% = p
  • Survived        = 342    %=  342*100/891 =38.38%= q
  • p = 0.6162
  • q = 0.3838
  • log_loss = -1/N*∑[ yi log(p1) + (1-yi)*log(q)]
  • log_loss = -1/891*∑[ 549* log(0.6162) +342 (log(0.3838)]
  •                 =0.957=0.96
  • In the same way you can calculate other log_loss for parent and child nodes

MULTI-CLASS PROBLEMS

Critierion: Gini

Here let us use IRIS data

Loading iris data

Load data from local drive

Under multi-class problem the target variable will have more than 2 categories.  Here we have Setosa, Versicolor, and Virginica. Independent variables could be categorical or numerical.  We have sepal.length, Sepal.width, petal.length and Petal.width.

Separate the features and the target

 

For calculation of gini refer above

Criterion: Entropy

Calculation of entropy refer above

Criterion: log_loss

Load and read data from local drive

Separate the features and the target

Log loss calculation remains the same as shown above