Introduction

Amazing statistical technique used to find out the relationship between the dependent/response variable and several independent/explanatory/predictor variables. Enables you to predict the dependent value. It is different from Simple Linear Regression models. We have to make some assumptions before constructing relationship equation. We also must know how to interpret the result we get. Here we deal with the concept of Partial coefficients and Part coefficients. We also deal with multi-collinearity and auto correlation.  Our interest is how this model is used in various real time industries.

Here let us learn

  1. the difference between Simple Linear Regression and Multiple Linear Regression.
  2. Various stages involved in MLR model building
  3. The assumptions what we have to make
  4. The way how to include categorical variables in MLR Model
  5. The way how to use dummy variables
  6. The way how to interpret the MLR Coefficient Parameters
  7. The concepts of partial and part regression coefficients
  8. The impact of multi-collinearity, auto correlation on model.
  9. The way how to apply MLR model across several industries

When our data contains one dependent/target variable and multiple (more than one) independent/explanatory variables we have to use Multiple Linear Regression model.  Our aim is to understand the relationship between dependent variable and available independent variables individually. Goal is to know how dependent variable changes when one or more independent variables are altered/changed. We have to create a model (formula)  that can predict the value of dependent variable based on the values of independent variable.

  • When data is given first identify the target variable, the variable which you want to predict. Then identify the variables which influences the change in dependent variables.
  • Gather the data for independent and dependent variables. Organize the data in a tabular form. Rows represent the observations you made. Columns represent the variables.
  • Clean or pre-process the data. Substitute missing values, find outliers, encoding the categorical variables, normalization/scaling of the data observations
  • Visualize the data so as to see the relationship between dependent and independent variables. You may box plot, scatter plot etc.,
  • Make some assumptions: a) Relationship follows linearity b) Errors follow normal distribution 3) variance follow homoscedasticity.
  • Conduct statistical tests like t-test, f-test etc., to check if the assumptions made by us are correct or wrong.  All diagnostic tests must pass so as to employ the model in real life situation
  • Using R, Python, Excel, Real Statistic Packages, SPSS ) fit a linear regression model to the given data. Estimate the regression coefficients for independent variables(slope of the line) and constant (intercept of the line)
  • Sum of the absolute differences between the observed and predicted value will be zero ∑ [yi – yhat]= 0
  • Sum of Squared Residual (SSR or SSE) should be minimum for the best fitted line. We may not able to equate the sum of squared residual (SSR or SSE) to zero. We can minimize it. We use Ordinary Least Square method for this. SSE represents the sum of the squared differences between the observed values and the predicted values of dependent values. Minimize ∑(y – yhat)^2
  • Interpret the model (we have found out using parameters). Analyze Regression coefficients, standard error, t-tests, p-values. Check if p-values are less than 0.05. This is done for rejecting or accepting null hypothesis. Under Null hypothesis H0 we assume all betas are equal to zero. Under Alternate hypothesis H1 we assume that all  beta are not equal to zero and they are statistically significant
  • If p-value is less than 0.05 we reject null hypothesis H0 as we have strong evidence to reject null hypothesis and accept alternate hypothesis H1
  • If p-value is greater than 0.05 we accept null hypothesis as we do not have strong evidence to reject null hypothesis.
  • Validate the model: if you have a large data, you can split the data into a training set and a testing set. Fit the linear regression model to the training set. Then use the model to predict the dependent variable in the testing set. Calculate R2, MSE,RMSE and other metrics to assess the predictive accuracy of the formula/model.
  • Reporting: Create a summary of your findings, limitation of the model etc

Mainly the steps can be condensed with

  1. Estimation: Estimation of regression coefficients
  2. Inference: Infer the significance and reliability of coefficients
  3. Assumptions: Linearity of the data,  independence of errors, homoscedasticity, normality of errors
  4. Goodness of fit: R^2 and Adjusted R^2 state how the line fits the data
  5. Predictive accuracy: Mean Squared Error (MSE), Root Mean Squared Error (RMSE) assess the accuracy of the prediction
  6. Interpretation: Statistical tools interpret the results

Difference between SLR and MLR

When we have included several independent variables and one response variable in a model, it is called Multi Linear Regression Model. We believe that there is a relationship and the relationship is given by

Where beta 0, beta 1, beta2 … beta n are  regression coefficients

X1, X2,X3..Xare explanatory variables.