Regression is one of important technique in Predictive Analysis. Under Regression Building Model, Simple Linear Regression is the first step towards machine learning. Data for this will contain one Response/dependent variable and one explanatory variable(independent) The function form of SLR is

where

y  (Dependent Variable)  = predicted value

Beta not  is Population intercept (parameter)

Beta 1  is Population Slople ( parameter)

Epsilon  is Random Error

X  is independent variable.

This is how a dependent variable  is related to independent variable in a linear function.

Assumptions We make under SLR

  1. The Error  follows normal distribution (Epsilon{i})
  2. For Values of X, the variance of Error  is always constant. Follows Homoscedasticity.
  3. There is no Multi-collinearity. (no two independent variables are highly correlated. This third point is applicable to Multiple Regression Model and not to SLR
  4. There is no auto-correlation between two epsilon{i} values. This is also applicable to MLR

Using Ordinary Least Square method we estimate the regression coefficients. We also interpret regression parameters under different functional forms. We conduct different regression model diagnostics and validate the model. Once our basic assumptions are met for each model we deploy the Best model  available.

In order to validate the assumptions we use simple hypothesis/assumption testing such as Z-test, t-test(Student’s), F-test.

We are given a data containing one dependent variable and one independent variable. Salary is the target variable (dependent variable) and Experience in years (Explanatory variable).

We are asked to estimate the parameters beta not and beta one and to arrive at an equation that Estimates the value of Salary when know  the value of experience in years.

Let y= a+bx be the regression line of y on x. The constants a and b can be found out using the following normal equations

Standard Normal Equations:

2280.09 = 30*a + 159.4b         … (1)

14321.961 = 159.4a + 1080.5*b  ….(2)

363446.346=4782a+25408.36b  …(3)

429658.83 = 4782a +32415b   ….(4)

Deduct 3 from 4

  1. 484= 7006.64b

b = 66212. 484/7006.64 =9.449962321

Substitute b in equation 1

2280.09 = 30a +1080.5*66212. 484

2280.09 = 30a +1506.324

 

30a = 2280.09-1506.324 =773.766

a = 773.766/30 =25.7922

Equation:

Y hat = 25.7922 + 9.449962321X

Estimate the value of y when  value of X = 11

Yhat = 25.7922 + 9.449962321*11

= 129.74

USING EXCEL:

Descriptive Statistics using Excel:

Regression:

Yhat = 25.7922 +9.449962321*X

Find Equation using Regression Coefficients and Correlation Coefficients

The regression coefficients byx and bxy for regression line of y on X and regression line of X on y as shown below:

The regression equation of y on X  is given by

The regression equation of X on y is given by

Where r is correlation coefficient and byx is regression coefficient of regression line of y on X and bXy is regression coefficient of regression line of X on y.

What we need to know are

  • Correlation coefficient
  • Standard deviation of y
  • Standard deviation of X

Important Characteristics of Regression Coefficients:

  1. When regression is linear, then regression coefficient is given by the slope of the regression line.
  2. The geometric mean of regression coefficients gives the correlation coefficient

Regression Coefficient is an absolute measure

If byx is negative then ’bxy’ is also negative and ’r’ is also negative

Difference between  correlation coefficient and regression coefficients:

 

Formulae for mean of X, mean of y, Standard Deviation of X and Standard deviation of y.

Regression line

Standard Error of the Estimate:

  1. Now we can measure the strength of relationship between two variables
  2. We can estimate the value of one variable when we know the value of other variable
  3. Our estimation is based on the line of best fit.
  4. However, some error may come
  5. Standard Error of Estimate is a measure which points out the error.
  6. A small Se indicates that our estimate is accurate
  7. Confidence level 68% when actual value of y falls between ybar- Se and ybar+Se 8
  8. Confidence level 95% can be achieved when the value of y falls between ybar-2Se and ybar+2Se
  9. Confidence level 99.7% can be achieved when the value of y falls between ybar-3Se and ybar+3Se

Standard Error of y = a+bx                              Standard error of X = a+by

For regression line of y on X                            For Regression line of X on y

Result and Conclusion:

We had estimated the value of y as 129.74 when x=11

Se works out to 5.788315051

The actual value of y may lie between 129.74 -5.79 and 129.74+5.79

At 95% level confidence interval would be (123.95,135.53)

 

Coefficient of Determination-R-Squared-Calculation of R-square:

It is a measure that provides info about the goodness of fit of a model. When we talk about regression, we talk about R-Square. It Is a statistical measure that discloses how well the regression line approximates the actual data. When a statistical model is used  to predict future outcomes/in the testing of hypothesis we use this R^2 as an important measure. We use the following three sum of squares metrics

Coefficient of determination R2 enables you to determine the goodness of fit. It is measured in percentage. Takes a value from 0 to 1. The chosen model explains 95.70% of changes of y for every one unit of change in x. Slope (b) reflects the change in y for a unit change in x

ADJUSTED R-SQUARE:

  • R-squared will always increase when a new predictor variable is added to the regression model.
  • This is a drawback
  • Adjusted R^2 is a modified version of R^2
  • It adjusts the number of predictors in regression model
  • Adjusted R^2 is a modified version of R^2