Here we use two functions namely
- DecisionTreeClassifier
- DecisionTreeRegressor
DecisionTreeClassifier function is used when target/dependent variable is (Qualitative) Binary Class ( Dichotomous say yes or no) or Multi Class (multiple classification say Low, Middle, high )
DecisionTreeRegressor function is used when target/dependent variable is numerical (say sales volume in dollars or price in dollars etc)
Before using the abovesaid decision tree function let us know how root node and internal notes are split.
Decision Trees Splitting Criteria
We follow different criteria for classifier function and regressor function
- Classification criteria
- Entropy
- Information Gain
- Gini Index or Gini Impurity
- Regression criteria
- Mean Squared Error (MSE)
- Poisson Deviance
- Friedman MSE
- Mean Absolute Error (MAE)
We use recursive algorithm process. The metrics like Splitting Tree Criterion or Attribute Selection Measures(ASM) are used to evaluate and select the best features. Algorithm identifies the threshold candidate for a node that will be used as a separator to split that node. First root node is identified then internal nodes and finally leaf node
For classification we use Gini index, Entropy and Information Gain, log_loss
For Regression we talk about variance reduction: Mean Squared_error(MSE),Mean Absolute_Error(MAE), Friedman MSE, Poisson Deviance
Classification Criterion: Gini and Gini Index
The Gini follows Impurity concept. It is used to measure of the impurity at a node

The frequencies of two classes determines the purity and impurity/ homogeneous and non-homogenous
- If frequencies of two classes (0 and 1 ) are exactly equal then it indicates high degree of impurity
- If frequencies of two classes (0 and 1) are at the extreme ends then it indicates less degree of impurity
Gini
It measures the probability of misclassifying a randomly chosen element if it were labeled according to the distribution of classes in the dataset.
Example: suppose a dataset has 100 samples. 70 yes(1) and 30 no(0). Then the probability of 1 class is 0.70 and 0 class is 0.30
G = 1- (p^2+q^2) = 1-(0.70^2+0.30^2)= 1-(0.49 +0.09) =0.42
Gini impurity value 0.42 tells us that there is a chance of 42% misclassifying a sample at random at this node.
Gini Index
It is used to measure of the impurity at a node

Where P(j|k) is the proportion of category j in node k
Smaller Gini Index means/implies lesser impurity.
Gini index Calculation:
Say number of classes 2 (0 and 1)
Consider node label (k) with 70 (1s) and 30 zeros

= 2*0.70*0.30 =0.42
It is very high. It implies high impurity
When a node has maximum impurity, we have to take 50% as 1 and 59% as zero .
In that case gini index will be equal to 2 *0.5*0.5=0.50.
GiniIndex falls between 0 to 0.5
Entropy:
Entropy is a measure from information theory that represents the level of impurity or disorder in a dataset.
Another impurity measure that is frequently used. Entropy of node k is given by
Entropy Calculation
Say number of classes 2 (0 and 1) Consider node label (k) with 70(0s) and 30 ones



- This entropy value points out the uncertainty in this distribution.
- Higher value indicates greater disorder
- Lower value indicates a clearer separation between classes
Using DecisionTreeClassifier function
Binary class problem
Target Variable is Dichotomous( Yes(1) or No(0))
Criterion – GINI
We deal with Binary Class problems where the dependent variable/target variable will be dichotomous. Yes or no – 1 or 0. Example sex – 1 male 0 female. This data relates to the statistics of titanic ship. Survived is the target variable contains Survived(1) and not survived (0) categories. The independent variables are Pclass is berth customers travelled, Sex has two categories Male(1) and Female(0) and Age is numerical. So, the target variable, two independent variables are categorical and Age is Numerical
Load data and read from local drive


- Target variable Survived has two classes : 0 – not survived 1: survived
- Independent variables : 3 ; Pclass containing 3 classes: first class 1 ; second class 2 and third class 3, sex : 0 Female and 1 Male
- Age is numerical
Separate the features and the target

Using DecisionTreeClassifier function – criterion: Gini


Predict if the traveler survived or not survived by traversing the tree

- Condition 1: Check if sex <=0.5 ? Answer is female = 0
- If condition is true move to left
- If condition is false move to right
- In this case condition fails. So False move to right
- Condition 2: check if pclass <= 2.5 ? Answer is pclass =3
- Since the answer is greater than condition
- Condition fails move again to right
- Condition 3: Check if Age <=5? Answer is 35.
- Since the answer is greater than condition
- Condition fails. So, move to right
- Predicted class is 0 – dead(not survived)
With this tree you can predict the class of target variable by moving right or left and reach the final classification for new new set of independent variables.
Prediction Programmatically

- Probability of y(0) is 89.40% – not survived
- Probability of y(1) is 10.60% – survived
- the probability of non-survival is estimated at 89.40% class =0
- The probability of survival is estimated at 10.60% class=1
Depth of the tree:
- Depth of the tree can be reduced or increased if you want. Remember when you increase the max.depth the tree will grow heavily. It may lead to overfitting. Avoid that
- When Max.depth is reduced to 2 in the line
- clf = DecisionTreeClassifier(criterion=‘gini’, max_depth=2, random_state=42)

GINI CALCULATION
Root
- Samples = 891
- Not survived = 549 % = 549*100/891 =61.62% = p
- Survived = 342 %= 342*100/891 =38.38%= q
- Gini = 1- (p^2+q^2)
- Gini[sex] = 1 –( (549/891)^2+(342/891)^2 = 0.47
LEFT Child
- Samples = 577
- Not survived = 468 % = 468*100/577 =81.11% = p
- Survived = 109 %= 109*100/577 =18.89%= q
- Gini = 1- (p^2+q^2)
- Gini[sex] = 1 –( (0.8111)^2+(0.1889)^2 = 0.31
RIGHT child
- Samples = 314
- Not survived = 81 % = 81*100/314 =25.80% = p
- Survived = 233 %= 233*100/314 =74.20%= q
- Gini = 1- (p^2+q^2)
- Gini[sex] = 1 –( (0.2580)^2+(0.7420)^2 = 0.38
In the same you can calculate gini from the information available in the tree diagram box


Condition 1: Check if sex <=0.5 ? Answer is male = 1
If condition is true move to left
If condition is false move to right
In this case condition fails. So False move to right
Condition 2: check if pclass <= 2.5 ? Answer is pclass =1
Since the answer is less than condition
Condition true move again to left
Predicted class is 1 – survived

Binary class problem
Target Variable is Dichotomous( Yes(1) or No(0))
Criterion – Entropy
Load and read data from local drive



Entropy Calculation
Root
- Samples = 891
- Not survived = 549 % = 549*100/891 =61.62% = p
- Survived = 342 %= 342*100/891 =38.38%= q
- p = 0.6162
- q = 0.3838
- Entropy = -p*log2(p) – q*log2(q)
- Entropy = -0.6162*log(0.6162,2) – 0.3838*log(0.3838,2) =0.96
LEFT child
- Samples = 577
- Not survived = 468 % = 468*100/577 =81.11% = p
- Survived = 109 %= 109*100/577 =18.89%= q
- p = 0.8111
- q = 0.1889
- Entropy = -p*log2(p) – q*log2(q)
- Entropy = -0.8111*log(0.8111,2) – 0.1889*log(0.1889,2) =0.70
RIGHT child
- Samples = 314
- Not survived = 81 % = 81*100/314 =25.80% = p
- Survived = 233 %= 233*100/314 =74.20%= q
- p = 0.2580
- q = 0.7420
- Entropy = -p*log2(p) – q*log2(q)
- Entropy = -0.2580*log(0.2580,2) – 0.7420*log(0.7420,2) =0.82
Binary class problem
Target Variable is Dichotomous( Yes(1) or No(0))
Criterion – Log_loss
We use the same data what used for gini and entropy


Log-loss Calculation

where
- yi is the actual class label (0 or 1)
- pi is the predicted probability
- N is the number of observations
root
- Samples = 891
- Not survived = 549 % = 549*100/891 =61.62% = p
- Survived = 342 %= 342*100/891 =38.38%= q
- p = 0.6162
- q = 0.3838
- log_loss = -1/N*∑[ yi log(p1) + (1-yi)*log(q)]
- log_loss = -1/891*∑[ 549* log(0.6162) +342 (log(0.3838)]
- =0.957=0.96
- In the same way you can calculate other log_loss for parent and child nodes
MULTI-CLASS PROBLEMS
Critierion: Gini
Here let us use IRIS data
Loading iris data


Load data from local drive

Under multi-class problem the target variable will have more than 2 categories. Here we have Setosa, Versicolor, and Virginica. Independent variables could be categorical or numerical. We have sepal.length, Sepal.width, petal.length and Petal.width.
Separate the features and the target


For calculation of gini refer above
Criterion: Entropy


Calculation of entropy refer above
Criterion: log_loss
Load and read data from local drive

Separate the features and the target



Log loss calculation remains the same as shown above
