INTRODUCTION – REGRESSION

Let us learn

  1. How to Construct and interpret a scatter diagram for two variables (SLR) (one dependent and one independent variable) or two or more variables (MLR) (one dependent and more than one independent variables)
  2. Learn the way how a scatter diagram explains the relationship between variables
  3. The way how a scatter diagram uses least squares to fit the best line
  4. The terminology used in Simple Linear Regression model
  5. The way how to make assumptions for that model
  6. The way to Determine the least squares regression
  7. The way how to arrive at point and interval estimates for the dependent/response variables
  8. The way to Determine the value of coefficient of correlation and interpret
  9. The way to Describe the meaning of coefficient of determination Construct confidence intervals
  10. The way how to Carry out Hypothesis testing for the slope of regression line Test the significance of correlation coefficient
  11. The way how to Examine the appropriateness of linear regression model using residual analysis.
  12. Certify the degree to which your underlying assumptions are met

Regression:

  1. The concept of regression was given by Sir Francis Galton.
  2. Regression is a statistical measure
  3. used in finance, investing and other disciplines that attempts to determine the strength of the relationship between one dependent variable (usually denoted by Y) and a series of other changing variables (known as independent variables).
  4. Correlation Coefficient enables us to find out the relationship between two variables using scatter diagram
  5. Now we try to draw a line which best represent the points
  6. To fit the data the line must pass close to all points and should have points on both sides.
  7. For this we use the technique of Least Squares
  8. Using the Regression, we try to quantify the dependence of one variable on the other. Say X and Y are variables. Y depends on X then the dependence is expressed in the form of equations.

We deal with two types of Variables. One is dependent variable and others are independent variables.  We call these variable using different names.

We come across the following types of Regression.

  1. Linear Regression
  2. Lasso Regression
  3. Ridge Regression
  4. Polynomial regression
  5. Logistic Regression
  6. Decision Tree Regression
  7. Random Forest Regression
  8. Gradient Boosting Regression
  9. XGBoosting Regression
  10. Bagging Regression
  11. K-Nearest Neighbour Regression
  12. Catboost Regression
  13. Support Vector Regression

Usages of Regression Models:

Regression models are used in different functional areas of business management.

Finance:

  1. The capital asset pricing model (CAPM) is an often-used regression model in finance for pricing assets and discovering costs of capital.
  2. Probability of an account becoming NPA
  3. Probability of Default
  4. Probability of bankruptcy
  5. Credit Risk

Marketing:

  1. Prediction of Sales in the next three months based on past sales
  2. Prediction of market share of a company
  3. Prediction of customer satisfaction
  4. Prediction of customer churn
  5. Prediction of customer life time value

Operation:

  1. Inventory management : prediction of inventory requirement
  2. Productivity prediction
  3. Prediction of Efficiency

Human Resources:

  1. Prediction of Job Satisfaction among employees
  2. prediction of attrition
  3. Probability of an employee likely to go out of the company

Difference between Mathematical and Statisical Relationships:

General form of regression equation:

equation

For our convenience let us use ‘a’ for intercept instead of beta not and ‘b’ for slope instead of beta one and u for regression residual instead of e

Where:

Y = the variable that you are trying to predict (dependent variable)

X = the variable that you are using to predict Y (independent variable)

a = the intercept

b = the slope

u = the regression residual

Using Regression, we try to find out a mathematical relationship between dependent and independent random variables. The relationship is in the form of a straight line(linear) that becomes the best approximate for all individual data points. Under Simple Linear Regression we use one dependent variable and one independent variable. Example Salary as independent variable and happiness as dependent variable. Under this we have Univariate data for analysis. Under Multiple Linear Regression we use one dependent variable and more than one independent variables for analysis.

Regression lines:

For a set of paired observations (Say X, Y. Y is dependent random variable and X is independent Variable) we get two straight equations.

The line is drawn in such a way that

  1. a) The sum of the vertical deviation is zero
  2. b) The sum of their squares is minimum.

Two lines

  1. a) Regression line of y on x. Using this you can estimate the value of Y for a given X
  2. b) Regression line of x on y. Using this you can estimate the value of X for a given Y
  3. c) The smaller the angle between these lines, the higher is the correlation between the variables

Regression Equations:

Regression Equation of Y on X

Regression Equation of X on Y

Where:

Y is the observed value (Dependent variable)

Ybar is the mean of Y

X is the observed value (independent variable)

Xbar is the mean of X

byx the regression coefficient for the line Y on X

bxy is the regression coefficient for the line X on Y

Regression line of y on x:

Employment and output of 10 companies are given below. You need to find out regression line using Output as dependent variable and Employment as independent variable. Employment is represented by X and output is represented by Y. Fit the regression lines using Least Square Method

Statistics – Using Least Square Method:

Let y= a+bx be the regression line of Y on X. The constants a and b can be found out using the following normal equations.

For estimating y when x=600 we use regression line of y on x

Thus, you can estimate the value dependent variable when you know the value of X

Excel

Regression Line of X on Y:

Let x= a+by be the regression line of x on y. The constants a and b can be found out using the following normal equations.

Regression Equations:

Regression Equation of Y on X

Regression Equation of X on Y

Where:

Y is the observed value (Dependent variable)

Ybar is the mean of Y

X is the observed value (independent variable)

Xbar is the mean of X

byx is the regression coefficient for the line Y on X

bxy  is the regression coefficient for the line X on Y

Calculation of regression Coefficients:

Correlation

Regression Coefficient for Regression Line of Y on X   :  byx

Regression Coefficient for Regression Line of X on Y: bxy

 

byx and bxy are called regression coefficients

 Characteristics of Regression Coefficients:

  • When regression is linear, then regression coefficient is given by the slope of the regression line.
  • The geometric mean of regression coefficients gives the correlation coefficient (r)

Regression Coefficient is an absolute measure If byx is negative then ’bxy’ is also negative and ’r’ is also negative

Formula used:

Standard error of the estimate:

  1. Now We can measure the strength of relationship between two variables
  2. We can estimate the value of one variable when we know the value of other variable
  3. Our estimation is based on the line of best fit.
  4. However some error may creep in
  5. Standard Error of Estimate is a measure which points out the error.
  6. A small Se indicates that our estimate is accurate
  7. Confidence level 68% when actual value of y falls between ybar- Se and ybar+Se
  8. Confidence level 95% can be achieved when the value of y falls between ybar-2Se and ybar+2Se
  9. Confidence level 99.7% can be achieved when the value of y falls between ybar-3Se and ybar+3Se

Check Standard Error:

In the previous example we found out

Result and Conclusion:

 We had estimated the value of y as 3567.84 when x=500

  1. Se works out to 549.27
  2. The actual  of y may lie between 3567.84-549.27 = 3018.57 and 3567.84+549.27=4117.11 when x=500
  3. At 95%level confidence interval would be (3018.57,4117.11)

 

Calculation of coefficient of determination R-Squared:

It is a measure that provides info about the goodness of fit of a model. When we talk about regression, we talk about R^2. It is a statistical measure discloses how well the regression line approximate the actual data. When a statistical model is used to predict future outcomes/in the testing of hypothesis we use the measure R^2.We use the following three sum of square metrics.

Coefficient of determination R^2 enables you to determine the goodness of fit. It is measured in percentage. Takes a value from 0 to 1. The chosen model explains 82.42% of changes of y for every one unit of change in x. Slope (b) reflects the change in y for a unit change in x

Adjusted R-Squared:

R-squared will always increase when a new predictor variable is added to the regression model. This is a drawback. Adjusted R^2 is a modified version of R^2. It adjusts the number of predictors in regression model.Adjusted R2 is a modified version of R^2