Model Evaluation

Models:

We use different types of models for prediction. Which one will be the best for the given data. Let us examine with the same data what we used in py-DecisioTreeRegressor-2. Students Performance(Multiple Linear Regression) dataset. you can download this from Kaggle

Load Libraries and Read data

Separate the features and the target

Ordinary Least Square Method – Find Linear Regression – using statsmodels

Summary

Important metrics are

R-Squared : 0.989 Interpretation: 98.9% of changes of target variable PerformIndex is explained by the features. It is a good sign. that the model produced under OLS is good

PerformIndex Predicted = – 34.0756 +2.8530*Hours +1.0180*PreviScores+0.6129*Extraactivities+0.4806*Sleepinghrs+0.1938*Practisedsamequest
Since the probability of all features fall less than 0.05 all features are statistically significant and all influence the change in PerformIndex. As per t test we reject the null hypothesis
F-statistics is huge and its probability is less than 0.05 the model is good.

Method 2: Using SKlearn find R-Squared

Here also R-Squared works out to 98.9% Explained ratio is nearer to 1.

Coefficients and Constants under SKlearn

We get the same result under both methods. Under SKlearn we used to split the data into two sets 1) trainset and 2)testset for both X an y

Using the formula predict the value of PerformIndex

Use the same SKlearn But split the data into twosets

Out of 10000 records 8000 records as X_train and y_train and 2000 records as X_test and y_test by specifying the argument test_size as 20%
We have used train_test_split function imported from sklearn.model_selection
5 indicates the number of features used

Accuracy/R-Squared

Model 1 : Linear Regression

Model 2: Decision Tree Regressor

Model 3: Support Vector Machine

Model 4: Random Forest Regressor

Model 5: Gradient Boosting Regressor

Model 6: XgBoostRegressor

Model 7: BaggingRegressor

Model 8: k-Nearest Neighbourhood Regressor

Model 9: Cat Boost Regressor

Consolidate

Result

Linear Regression is the best model among the 9 models tested. The formula for finding the PerformIndex can be deployed for some time for the given data. Decision tree is the worst model because R-Squared is the lowest among the 9 models tested

MMK TECHNOLOGIES

Flat No.6h Orchid Block,6th Floor of Ceebros Garden,

+91 9840922213

km301252@gmail.com