Steps Involved

  • Import data
  • See the data view and variable view
  • Provide codes to qualitative/categorical variables in variable view
  • Report Code book
  • Descriptive statistics
  • Analysis
  • Correlate
  • Regression
  • Classify
  • Graph

LIBRARY USED: STATSMODELS.API

Before importing library you must have installed packages in your system. After installing the latest python 3.13.2  downloaded from the website  https://www.python.org/downloads, open command line interface in your windows system. In search box type cmd you will get cmd prompt. Run as an administrator. In CLI  go to your directory where you have installed python and go to Scripts directory and pass the following command

D:\python313\Scripts > pip install pandas, numpy, statsmodels,jupyter notebook

Then open jupyter notebook

D:\python313>jupyter notebook

Jupyter notebook will open in microsoft edge/internet explorer with url : localhost:8888/tree as shown below

Now press new tab and select python 3(ipykernel)

Now jupyter notebook editor opens up

Load the libraries

Read Data

We use the same data Cellphone.csv. Containing one dependent variable ‘Price’. It is continuous in nature. We have 13 explanatory variables. We have to predict the  price of a cellphone based on the equation which may contain all the 13 explanatory variables or reduced explanatory variables. The data is stored in the data frame object called df

Information about the data we handle

To know the variable name, types of the data, null data if any  we pass the command  df.info(). we use info() for this

Total variables used are 14 (1 dependent and 13 explanatory variables).No.of observations handled are 161 and the types could be int64 or float64 or categorical

Describe the data

Use object.describe() – describe function. Here you get the number of observations of each features, mean, std, minimum, maximum first quartile, median, third quartile values

If you want to see the above in a tabular form use object.describe().T

Find the shape of the data

PRE-PROCESSING THE DATA:  Find if any null values exists

Use object.isnull().sum()

No null value exists. If null values/ missing values found the system will not proceed further. It will show error message

Separate dependent and independent variables from df

Dependent variable will be called as target variable

independent variables will be called as feature variables. use object.drop function to separe the target variable ‘price’ and assign it to X. use axis=1. Assign the df[‘Price’] to y

Find mean, variance, standard deviation of all features using program

Output

 

Convert random variable X and y as array with float64 type

Ordinary Least Square using Statsmodels.api

import statsmodels.api as sm

Find metrics using statsmodels.api

Once Again Read Data:

Standard Error of Betas using Python Program

Library Used: SKlearn without spliting the data

Box Plot

Bar Plot

Linear method (sales vs price)

sns.lmplot(data=df, x=’Sale’,y=’Price’)

Residual plot

Metrics

output

Using sklearn with spliting of given data

once again read data

Separate features and target

Split data into training and test data sets

Model Parameters(Coefficients and intercepts)

Metrics for split method

Lazy predict – Supervised Learning

Install Lazy predict using pip install Lazy predict

IDENTIFY WHICH MODEL IS BEST AND WHICH MODEL IS WORST  FOR THE GIVEN DATASET

results_df1 = pd.DataFrame(models, columns=[‘Model’,’Adjusted R-squared’,’R-squared’,’RMSE’,’Time Taken’]

results_format_df1 =results_df1.style.hightlight_max(subset=[‘Adjusted R-squared’,’R-squared’],color=’green’).highlight_min(subset=[‘Adjusted R-squared’.’R-squared’],color=’red’)

display(results_format_df1)

Predictions

Load libraries

Accuracy  of different models:

Linear Regression

Lasso Regression (L1 Regularization)

Ridge Regression(L2 Regularization)

Decision Tree Regression Model

Random Forest Regression Model

Gradient Boosting Regression Model

XGBoost Regression Model

Bagging Regression Model

K-Nearest Neighbourhood Regression model

CatBoost Regression Model

Combine all results

Results:

XGBoost Regression model gives the highest R-Square  98.42%  and KNN model gives the worst R-Square. 16.32%