Here we use two functions namely

DecisionTreeClassifier
DecisionTreeRegressor

DecisionTreeClassifier function is used when target/dependent variable is (Qualitative) Binary Class ( Dichotomous say yes or no) or Multi Class (multiple classification say Low, Middle, high )

DecisionTreeRegressor function is used when target/dependent variable is numerical (say sales volume in dollars or price in dollars etc)

Before using the abovesaid decision tree function let us know how root node and internal notes are split.

Decision Trees Splitting Criteria

We follow different criteria for classifier function and regressor function

Classification criteria
1. Entropy
2. Information Gain
3. Gini Index or Gini Impurity
Regression criteria
1. Mean Squared Error (MSE)
2. Poisson Deviance
3. Friedman MSE
4. Mean Absolute Error (MAE)

We use recursive algorithm process. The metrics like Splitting Tree Criterion or Attribute Selection Measures(ASM) are used to evaluate and select the best features. Algorithm identifies the threshold candidate for a node that will be used as a separator to split that node. First root node is identified then internal nodes and finally leaf node

For classification we use Gini index, Entropy and Information Gain, log_loss

For Regression we talk about variance reduction: Mean Squared_error(MSE),Mean Absolute_Error(MAE), Friedman MSE, Poisson Deviance

Classification Criterion: Gini and Gini Index

The Gini follows Impurity concept. It is used to measure of the impurity at a node

The frequencies of two classes determines the purity and impurity/ homogeneous and non-homogenous

If frequencies of two classes (0 and 1 ) are exactly equal then it indicates high degree of impurity
If frequencies of two classes (0 and 1) are at the extreme ends then it indicates less degree of impurity

Gini

It measures the probability of misclassifying a randomly chosen element if it were labeled according to the distribution of classes in the dataset.

Example: suppose a dataset has 100 samples. 70 yes(1) and 30 no(0). Then the probability of 1 class is 0.70 and 0 class is 0.30

G = 1- (p^2+q^2) = 1-(0.70^2+0.30^2)= 1-(0.49 +0.09) =0.42

Gini impurity value 0.42 tells us that there is a chance of 42% misclassifying a sample at random at this node.

Gini Index

It is used to measure of the impurity at a node

Where P(j|k) is the proportion of category j in node k

Smaller Gini Index means/implies lesser impurity.

Gini index Calculation:

Say number of classes 2 (0 and 1)

Consider node label (k) with 70 (1s) and 30 zeros

= 2*0.70*0.30 =0.42

It is very high. It implies high impurity

When a node has maximum impurity, we have to take 50% as 1 and 59% as zero .

In that case gini index will be equal to 2 *0.5*0.5=0.50.

GiniIndex falls between 0 to 0.5

Entropy:

Entropy is a measure from information theory that represents the level of impurity or disorder in a dataset.

Another impurity measure that is frequently used. Entropy of node k is given by

Entropy Calculation

Say number of classes 2 (0 and 1) Consider node label (k) with 70(0s) and 30 ones

This entropy value points out the uncertainty in this distribution.
Higher value indicates greater disorder
Lower value indicates a clearer separation between classes

Using DecisionTreeClassifier function

Binary class problem

Target Variable is Dichotomous( Yes(1) or No(0))

Criterion – GINI

We deal with Binary Class problems where the dependent variable/target variable will be dichotomous. Yes or no – 1 or 0. Example sex – 1 male 0 female. This data relates to the statistics of titanic ship. Survived is the target variable contains Survived(1) and not survived (0) categories. The independent variables are Pclass is berth customers travelled, Sex has two categories Male(1) and Female(0) and Age is numerical. So, the target variable, two independent variables are categorical and Age is Numerical

Load data and read from local drive

Target variable Survived has two classes : 0 – not survived 1: survived
Independent variables : 3 ; Pclass containing 3 classes: first class 1 ; second class 2 and third class 3, sex : 0 Female and 1 Male
Age is numerical

Separate the features and the target

Using DecisionTreeClassifier function – criterion: Gini

Predict if the traveler survived or not survived by traversing the tree

Condition 1: Check if sex <=0.5 ? Answer is female = 0
If condition is true move to left
If condition is false move to right
In this case condition fails. So False move to right
Condition 2: check if pclass <= 2.5 ? Answer is pclass =3
Since the answer is greater than condition
Condition fails move again to right
Condition 3: Check if Age <=5? Answer is 35.
Since the answer is greater than condition
Condition fails. So, move to right
Predicted class is 0 – dead(not survived)

With this tree you can predict the class of target variable by moving right or left and reach the final classification for new new set of independent variables.

Prediction Programmatically

Probability of y(0) is 89.40% – not survived
Probability of y(1) is 10.60% – survived
the probability of non-survival is estimated at 89.40% class =0
The probability of survival is estimated at 10.60% class=1

Depth of the tree:

Depth of the tree can be reduced or increased if you want. Remember when you increase the max.depth the tree will grow heavily. It may lead to overfitting. Avoid that
When Max.depth is reduced to 2 in the line
clf = DecisionTreeClassifier(criterion=‘gini’, max_depth=2, random_state=42)

GINI CALCULATION

Root

Samples = 891
Not survived = 549 % = 549*100/891 =61.62% = p
Survived = 342 %= 342*100/891 =38.38%= q
Gini = 1- (p^2+q^2)
Gini[sex] = 1 –( (549/891)^2+(342/891)^2 = 0.47

LEFT Child

Samples = 577
Not survived = 468 % = 468*100/577 =81.11% = p
Survived = 109 %= 109*100/577 =18.89%= q
Gini = 1- (p^2+q^2)
Gini[sex] = 1 –( (0.8111)^2+(0.1889)^2 = 0.31

RIGHT child

Samples = 314
Not survived = 81 % = 81*100/314 =25.80% = p
Survived = 233 %= 233*100/314 =74.20%= q
Gini = 1- (p^2+q^2)
Gini[sex] = 1 –( (0.2580)^2+(0.7420)^2 = 0.38

In the same you can calculate gini from the information available in the tree diagram box

Condition 1: Check if sex <=0.5 ? Answer is male = 1

If condition is true move to left

If condition is false move to right

In this case condition fails. So False move to right

Condition 2: check if pclass <= 2.5 ? Answer is pclass =1

Since the answer is less than condition

Condition true move again to left

Predicted class is 1 – survived

Binary class problem

Target Variable is Dichotomous( Yes(1) or No(0))