Using SPSS – MMKTECH CONSULTANTS

Using IBM SPSS

Data View

Variable View

Descriptive Statistics

Press continue and press OK

Output in viewer

No of cpu processors used in cell phone: frequency

Regression

Press statistics button and fill up the popup and press continue. Then press plots button

Press continue and press save button. I do not want to save anything continue and press options button

Instruct the system to enter variables if the probability falls below 0.05 and to remove the variables when the probability moves more than 0.10. Now press continue and press ok. We estimate the parameters 0, 1, 2, 3.. k using least square approach. Minimize the sum of squared errors. If we have K parameters or regression parameters in the model then we will have K+1 equations. Once we estimate regression parameters, we have to validate the model before we use. Let us imagine that we have included fourteen parameters like Price, Product_id, Sale, weight, resoloution, ppi, cpu core, cpu freq, internal mem, ram, RearCam, Front_Cam, battery, thickness in the model. We have to find out Total Cost Treatment.Say. Price is dependent/response variable and all others are explanatory variables.

Mean and Standard Deviation of each variables:

95% Confidential Interval for mean

The population mean may fall in the range / in between Lower and Upper at (95% confidential interval)

For this we need to know mean of X, Standard Deviation of X( for each variable/feature)

Calculate SE_{mean =}Standard Deviation/Sqrt(n) – where n is the number of observations of sample
Product Id SE_mean= 410.852/Sqrt(161) = 37967366
Calculate ME_{mean =}SE_mean*Zvalue (0.95) – here Z value for 95% is 1.96
Product ID ME_{mean =}37967366*1.96 =63.46299471
LCD = mean -46299471 =675.56-63.46299471 = 612.0970053
UCD= mean+46299471= 675.56+63.46299471=739.022995

Correlation:

p_value using one tailed test:

Method Used: Enter

Under this method all variables whatever we provide have been entered for analysis. This is a forced inclusion of all variables in the dataset

Model Summary

R is equivalent to the Pearson correlation coefficient for a simple linear regression, that is a regression with only one predictor variable.

Coefficient of Determination –> R^2(R-SQUARE)

R²comes to 0.954,namely, 95.4% of changes in dependent variable price is explained by the unit change of each independent variable. Besides that, adjusted R²is calculated as 0.950 i.e. 95%. At the time of adding a new independent variable for consideration, R²increases. This may be due to multi-collinearity. To avoid that Adjusted R^{^2}was introduced.

It is a measure that provides info about the goodness of fit of a model. When we talk about regression, we talk about R^2. It is a statistical measure that discloses how well the regression line approximates the actual data. When a statistical model is used to predict future outcomes/ in the testing of hypothesis we use this measure R^2, an important measure. We use the following three sum of squares metric

Formulae

Example:

Find mean for export(X) – 452/7 = 64.57

find mean for import (y). – 427/7 = 61.00

No.of observations are 7 .

Find predicted y using the formula 40.19+0.3222x for each individual value of X.

Find SSE(Sum of Squares of Error) using formula SSE = SUM(yhat- y)^2=126.96

Find SSR(Sum of Squares of Residuals) using formula SSR=SUM(yhat-mean(y))^2 =289.04

Find SST(Total Sum of Squares) using formula SST= SUM(y- mean(y))^2 = 416.00

RSquare = SSR/SST = 289.04/416 =0.6948 = 69.48%

RSquare = 1-(SSE/SST) = 1-(126.96/416.00) = 0.6948 = 69.48%

The above said example is taken from Simple Linear Regression model. Coefficient of determination R^2 enables you to determine the goodness of fit. It is measured in percentage. Takes a value from 0 to 1. The chosen model explains 69.48 % of changes of y for every one unit of change in x. Slope (b) reflects the change in y for a unit change in X

Multiple R-Square

This is meant for multiple linear regression

R-square shows the total variation for the dependent variable that could be explained by the independent variables. A value greater than 0.5 shows that the model is effective enough to determine the relationship. In this case, the value is .6948, which is good.

R-Correlation

R = SQRT(R^2)

= SQRT(0.954)

=0.977

R-value represents the correlation between the dependent and independent variableW. A value greater than 0.4 is taken for further analysis. In this case, the value is .977, which is good.

Adjusted R-Square

R-squared will always increase when a new predictor variable is added to the regression model. This is a drawback. Adjusted R^2 is a modified version of R^2. It adjusts the number of predictors in regression model Adjusted R^2 is a modified version of R^2 Calculated as under:

Adjusted R-square shows the generalization of the results i.e. the variation of the sample results from the population in multiple regression. The difference between R-square and Adjusted R-square should be minimum. In this case, the value is .950, which is not far off from .954, so it is good. Therefore, the model summary table is satisfactory to proceed with the next step

ANOVA TABLE

It determines whether the model is significant enough to determine the outcome.

Sum of Squares of Regression(SSR) =∑(yhat – ybar)^2 = 90102069.28

Sum of squares of Residual(SSR/SSE) = ∑(y – yhat)^2 = 4315775.476

Sum of squares of Total = ∑(y – ybar)^2 =94417844.76

Mean Square Regression(MSR) = SSR/df = 90102069.28/13 = 6930928.406

Mean Square Error(MSE) = SSE/df = 4315775.476/147 = 29359.017

F-statistic = MSR/MSE = 6930928.406/29359.017 =236.075

p-value of F-statistic = 0.000.

Since p-value is less than 0.05 we reject Null hypothesis and come to a conclusion the parameters are statistically significant and each one of beta is not equal to zero. F-test is passed.

P-value/ Sig value: Generally, 95% confidence interval or 5% level of the significance level is chosen for the study. The p-value should be less than 0.05. In the above table, it is .000. Therefore, the result is significant. There is a possibility of rejecting the null hypothesis. Based on the t-statistic test statistic and the degrees of freedom, we determine the P-value. P-value is the smallest for which we can reject Null H0.

F-ratio: It represents an improvement in the prediction of the variable by fitting the model after considering the inaccuracy present in the model. A value is greater than 1 for F-ratio yield efficient model. In the above table, the value is 236.05, which is good. Here the value of F is greater than 25, which is ratio of MSR over MSE and the corresponding P value is very, very low, which is basically almost close to zero. We can conclude that the overall model is statistically significant

Coefficients

Equation of the fitted line:

Price(yhat) = 1720.152+0.031*product_id -0.023*sale -0.563*weight-70.758*resolution +1.004*ppi +53.517*cpu core +129.577*cpu freq +6.107*internal mem

+91.549*ram + 4.458*RearCam+9.313*Front_cam+1.34*battery-73.605*thickness

Shows the strength of relationship and significance of the variable in the model and shows how it impacts the dependent variable. Enables you to perform hypothesis testing for further study. Sig/p-value of parameters/coefficients of variables like product_id, sale, weight, resoloution, RearCam, and Front_Cam are greater than 0.05. These are not significant. Other seven parameters are with p-value less than 0.05.

The Sig.value should be below the tolerable level of significance for the study (0.05 or 95% Confidence level). Based on the significant value the null hypothesis is either rejected or accepted

If sig.value is <0.05 the null hypothesis is rejected. it means there is an impact

If sig.value is >0.05 the null hypothesis is accepted. There is no strong evidence to reject the Null hypothesis. it means there is no impact.

Sometimes some variables may be statistically insignificant due to multi-collinearity. We have to take care of this issue. Say the coefficient of product_id is 0.031. This is like serial numbers. It will not have any influence on the price. The coefficient of weight is -.563. It means that the price of the phone decreases by -0.563 when there is an increase of one kilo gram of body weight provided everything else is kept constant. That means we control all other variables.

Standared Errors of Coefficients/Betas

The variance of the coefficients is derived from the variance-covariance matrix of the estimated coefficients. The formula for the variance of each coefficient βj is given by:

Var(βj)=MSE⋅(X′X)−1j

Var(βj)=MSE⋅(X′X)j−1

Where:
– XX is the matrix of independent variables (including a column of ones for the intercept).
– X′Xis the matrix product of XX transposed and XX.
– (X′X)−1jj(X′X)j−1 is the j-th diagonal element of the inverse of the X′X matrix.

Calculate the standard errors of Coefficients

The standard error of each coefficient βj.

βj is the square root of its variance:

SE(βj)=√Var(βj)

Using Python Program

dfData <- as.data.frame(read.csv(“https://gattonweb.uky.edu/sheather/book/docs/datasets/MichelinNY.csv”,header=T))

# using direct calculations

vY <- as.matrix(dfData[, -2])[, 5] # dependent variable

mX <- cbind(constant = 1, as.matrix(dfData[, -2])[, -5]) # design matrix

vBeta <- solve(t(mX)%*%mX, t(mX)%*%vY) # coefficient estimates

dSigmaSq <- sum((vY – mX%*%vBeta)^2)/(nrow(mX)-ncol(mX)) # estimate of sigma-squared

mVarCovar <- dSigmaSq*chol2inv(chol(t(mX)%*%mX)) # variance covariance matrix

vStdErr <- sqrt(diag(mVarCovar)) # coeff. est. standard errors

print(cbind(vBeta, vStdErr)) # output

Using statsmodels.api

model.bse

Using Excel:

Refer to Excel section topic Calculation of Standard Error of all Betas:

Standardized Betas

When we deal with multiple variables measured in different units (say body weight in kg, body height in cm etc we cannot compare them. In order to standardize response variable, we have to standardize all explanatory variables. After standardization, we get standard beta coefficients. The variable which has the highest standardized beta coefficient will have the maximum influence on Response variable.

Standardized Beta is nothing but the raw beta that we get times standard deviation of that explanatory variable over standard deviation of Y

Standardize Beta = (Unstandardized Beta * standard deviation of X)/standard deviation of y

t-Statistics calculation

t-statistics = Coefficients(beta)/Standard Error(beta)

p_value statistics

It can be calculated on web using p-value calculator

The p-values related to product_id, weight, resoloution, RearCam and Front_Cam are greater than 0.05 these variables can be removed from regression line findings provided these are not influenced by multi collinearity.

Using t-Distribution Table

You can find p-value using t-Distribution value. Here look up for degrees of freedom. We have 161 observations. And Degrees of Freedom is 161-1 = 160

Remember the drawback of using the t distribution table: it does not tell us the exact p-value; it only gives us a range of values.

95% Confidential Interval of beta

Lower bound and Upper bound

Lower bound = coefficient(beta) – 1.96*standard error(beta)

Upper bound = coefficient(beta) + 1.96*Standard error(beta)

Correlations

To determine the relation between two variables we use different types of correlations

Pearson Correlation
Spearman Correlation
Point Biserial Correlation
Kendall Correlation

And the terms we come across are

Zero-order Correlation
Partial Correlation
Part Correlation

These types of correlations are required/relevant when you have response/dependent variable(aka outcome variable) predictor/independent variables (aka explanatory variables) One or more control variables(confounding variables)

Partial Correlation and Part correlation provides another means of assessing the relative importance of independent variables in determining the value of y ( response variable). They show how much each variable uniquely contributes to R^2.

Partial Correlation: Describes how each independent variables participation in determining partial correlation pr and its square pr^2. Squared pr^2 explains “How much of the Y variance which is not estimated by the other independent variables in the equation is estimated by this variable?”

Zero order correlation is nothing but Pearson correlation. You can compute Pearson correlation between each feature with dependent variable Price(y)

Let us find out the correlation r(weight,price)

Calculation of Partial Correlation:

Calculates the correlation between two variables, while excluding the effect of a third variable.

Let us take the Pearson correlation between y and ppi and y and cpu core

r(ppi, price (xy) ) = 0.818

r(cpu core, price (zy)=0.687

r(cpu core, ppi (zx) =0.488

Controlling variable z. our aim is to find out correlation r_xy,z

The partial correlation r_xy,z tells how strongly the variable x correlates with the variable y, if the correlation of both variables with the variable z is calculated out.

Formula

Can be done using SPSS

If you need Zero-order correlation you can tick mark. Otherwise continue

Now we can find partial correlation between (price, battery) and controlling all other variables

In the same way you can find partial correlation between price and required feature subject to controlling all other variables

Let us restrict the above data with three features, price(y) and ppi and thickness (independent variables)

correlation bivariate

Partial correlation

Part Correlation:

Part correlation is similar to partial correlation. But there is a subtle difference.

Sometimes, it is called Semi-Correlation.

Like the partial correlation, the part correlation is the correlation between two variables (independent and dependent) after controlling for one or more other variables.

With semi-partial correlation, the third variable holds constant for either X or Y but not both;

with partial, the third variable holds constant for both X and Y.

It explains how one specific independent variable affects the dependent variable, while other variables are controlled for to prevent them getting in the way.

However, for the part correlation, only the influence of the control variables on the independent variable is taken into account. In other words, the part correlation does not control for the influence of the confounding variables on the dependent variable.

It indicates the “unique” contribution of an independent variable. Specifically, the squared semi-partial correlation for a variable tells us how much R^2 will decrease if that variable is removed from the regression equation.

Semipartial correlation measures the strength of linear relationship between variables X1 and X2 holding X3 constant for just X1 or just X2. It is also called part correlation.

r12 =0.818 (r(price,ppi)

r13 = -0.717(r(price,thickness)

r23 =-0.497(r(ppi,thickness)

r1(2.3) = 0.818-(-0.717)*(-0.497) / sqrt(1-(-0.497)^2)

= 0.532009 relationship between price and ppi when thickness is remaining constant for ppi

In the above image, r1(2.3) means the semipartial correlation between variables X1 and X2 where X3 is constant for X2.

Multi Linear Regression Model Diagnostics

You must know how a model(formula) is validated and diagnosed

Step 1: Test for overall fitness of the model

Step 2: Test for overall model significance achieved through F-test

Step 3: Check for individual explanatory variable and their statistical significance(achieved through t-test)

Step 4: Check for multi-collinearity and auto correlation

Step 1:Test for Overall Fitness of the model

Coefficient of Determination R^2:

Where:

SSR = Sum of Squares of Residuals ∑(yhat-ybar)^2

SSE = Sum of Squares of Error ∑(y – yhat)^2

SST = Total Sum of Squares ∑(y – ybar)^2

SSR(Numerator) will go on increasing as the number of explanatory variables increases

SSE will go on decreasing when you add new independent variable for analysis

So as and when you add a new independent variable, R2 will increase irrespective of the matter if the added variable is statistically significant or not.

Under step-wise method see the model summary

System adds or removes the features/predictor variables on condition basis and finally it settles down by adding 7 variables(including constant). When ‘ram’ was added R^2 was 0.804 when it added thickness R^2 increased to 0.890 and finally after including 7 variables R^2 increased to 0.945. But when we used Enter method where all variables 13 were added at the beginning itself. The R^2 was 0.954

To nullify the effect of increase on every addition of independent variable we have to find out Adjusted R2. Here we normalize both numerator and denominator with respect to degrees of freedom. Say n is the number of observations and k is the number of variables used in the model

Under Enter Method: The model was

Price(yhat) = 1720.152+0.031*product_id -0.023*sale -0.563*weight-70.758*resolution

+1.004*ppi +53.517*cpu core +129.577*cpu freq +6.107*internal mem

+91.549*ram + 4.458*RearCam+9.313*Front_cam+1.34*battery-73.605*thickness

When we add a new variable R^2 will increase where as Adjusted R^2 will not increase

Step2: Test for overall model significance

It is achieved through F-test For multiple linear regression we will have more than one independent variables.

Hypothesis for F-test

Null H_0: (null hypothesis) beta_0$,beta_1,beta_2,beta_3…beta k will be equal to 0

Alternate H_1: beta_0,beta_1,beta_2,beta_3…beta k will not be equal to zero.

sometimes some betas may be zero.

If we reject Null hypothesis, it means that we can claim that overall model is valid

Remember, some explanatory variables could be statistically insignificant

We get this ANOVA (Analysis of Variance) Table. F is related to R^2

Refer to Anova Table in the previous pages

Step 3: Check for individual explanatory variable and their statistical significance(achieved through t-test)

Here we use t-test

Null H_0: (null hypothesis) beta_0,beta_1$,beta_2,beta_3…beta k= 0

Alternate H_1: beta_0,beta_1,beta_2,beta_3…beta_k<>0

Coefficient Table

We are having 13 variables plus one constant.

p-values for all variables are less than 0.05 except for product_id, sale, weight, resoloution, RearCam, and Front_Cam. We reject the null hypothesis for the variables which have p-values less than .05. All are statistically significant. The Variables that may appear with p-value greater than 0.05 may be dropped. At the same we cannot remove beta_0 even though the p-value exceeds 0.05. It is an intercept constant. Perhaps some may have p-value greater than 0.05 due to multi collinearity,

Check for Multi-Collinearity and Auto Correlation

Multi-collinearity is a serious issue in MLR model. It is nothing but high correlation between explanatory variables. It can lead to unstable coefficients. Understand the impact multi-collinearity.

P-value of weight increases to 0.442 as we add new variables. This is higher than 0.05. weight has become statistically insignificant. This happens because of multi-collinearity we may see high R^2

F-test rejects Null Hypothesis. It means that F-test reports that overall model is fine. But none of the t-test rejects Null Hypothesis.

There is a contradiction between F-test and t-test The correlation between X variables are high compared to correlation between X and Y

Because of this regression coefficient estimates are inflated. This produces bad results when I do t-test

Variation Inflation Factor (VIF)

How to identify the existence of Multi-collinearity?

One of the measures used to identify the existence of Multi-Collinearity is Variance Inflation Factor(VIF). It is given by

Say if VIF is 25.041 then Standard Error of estimate is inflated by (sqrt(V IF) ie by 5.004098

When standard error of estimate is inflated by 5 then t-value is deflated by the factor 5.

When t-value goes down p-value increases

So you may decide to remove the variable weight thinking that the variable is significantly insignificant.

p-Value Calculation

The result is significant as p < .05. we cannot remove this variable

Now explanatory weight is significant. People may not like to remove variables. In that case we may go and use Principal Component Analysis.

Collinearity Diagnostics

It is applicable only for Multiple Linear Regression model

Condition Index

Dimension

First column talks about dimension. Then comes Eigen value and condition Index

First column dimension is similar but not identical to a factor analysis or PCA (principal component analysis)

Eigen Value

Eigenvalues close to 0 are an indication for multicollinearity. To be seen in conjunction with condition index

Condition Index

Condition index is calculated from eigen values

Condition index = Sqrt( Eigen value Dim1 /Eigen value Dim2 or Dim3…Dim n)

Interpretation of condition Index

Values above 15 indicates multi-collinearity issues.

Values above 30 indicates very strong sign of multi-collinearity problems

Variance Proportions

We also get regression coefficients variance decomposition matrix. According to Hair et al. (2013) for each row with a high Condition Index, you search for values above .90 in the Variance Proportions. If you find two or more values above .90 in one line you can assume that there is a collinearity problem between those predictors. If only one predictor in a line has a value above .90, this is not a sign for multicollinearity.

Residual Statistitcs

HISTOGRAM

Normal P-P Plot

Scatter plot

Partial Regression Plot

Methods Used in SPSS

ENTER
STEP-WISE
REMOVE
BACKWARD
FORWARD

PARTIAL F-TEST and Variable Selection

Partial F-Test is important test.
It is done in Step-Wise Regression Model Building
The need for this test is required when we check if the additional independent variables add value in the explanation of response variable
First, we test a model with large number of independent variables
Second, we test another model with lesser number of independent variables
We use Partial F-test in selecting new variables with a strategy Step-Wise Regression
Initially we have K explanatory variables
In the reduced model we have r explanatory variables where r is less than k. Our Null hypothesis r+1, r+2, r+3, r+4 … k = 0
All explanatory variables which are not present in the reduced model the beta values are assumed to be zero

Nested Models

Full model

Reduced Model

The first equation is related to full model containing 13 predictor variables and one intercept The second equation is related to the reduced model containing only 6 predictor variables and one intercept. We have to understand if additional variables add value to the equation and explain the changes in y-dependent variable. To determine if these two models are significantly different, we can perform a partial F-test. A partial F-test calculates the following F test-statistic

Where:

SSE [full] Sum of squares of Error(residuals) full model
SSE[Reduced] Sum of squares of Error(residuals) Reduced model
k Total number observations in the full dataset
r number of variables in the reduced model

Sum of squares of residuals will always be smaller for the full model

Since adding predictor variables will always lead to some reduction in error

A partial F-test tests whether the group of predictors that you removed from full model are actually useful and need to be included in the full model

Partial F-test uses the following null and alternate hypothesis

H0 All coefficients removed from the full model are zero
HA At least one of the coefficients removed from the full model is non-zero

Full model: price = 1720.152+0.031*product_id -0.023*sale -0.563*weight-70.758*resolution

+1.004*ppi +53.517*cpu core +129.577*cpu freq +6.107*internal mem

+91.549*ram + 4.458*RearCam+9.313*Front_cam+1.34*battery-73.605*thickness

n = 161

k =13

Reduced model: price = 1427.657+136.661*ram-61.433*thickness +1.162*ppi+

69.808*cpu core+6.396*internal mem+99.7606*cpu freq

r=6

SSE full = 4315775.476

SSR reduced = 5227169.063

MSE Full =29359.017

Partial F-test = (SSE reduced– SSE full/(13-6))/MSE Full

= (5227169.063-4315775.476)/ 7/ 29359.017

=4.43472

P-value for F test statistic

If the p-value corresponding to the F test-statistic is below a certain significance level (e.g. 0.05), then we can reject the null hypothesis and conclude that at least one of the coefficients removed from the full model is significant. We can do some refinement by adding some removed variables and test if they add any value in the explanation of changes in y.

Under Backward Method

Initially all given 13 independent variables are included and by backward elimination process three variables have been removed based on criteria.

Ten variables have been selected

SSE full = 4315775.476

SSR reduced = 4397305.025

MSE Full =29359.017

Partial F-test = (SSE reduced– SSE full/(13-10))/MSE Full

= (4397305.025-4315775.476)/ 3/ 29359.017

= 0.92566166

If the p-value corresponding to the F test-statistic is greater than (e.g. 0.05), then we have to accept the null hypothesis and conclude that the added variables do not add any value in the explanation of changes in y and the removal of like product_id, RearCam, and weight is justified.

2.Step-Wise Method

model summary

Under step-wise method system provides the coefficients after adding and removing variables based on the conditions set ( cut off 0.05. if p-value of the particular feature falls below that particular variable will be added to the equation and cut off 0.10 if p-value of the particular feature moves above 0.10 the particular variable will be dropped.). System has run 6 steps and finally gives the coefficients which are all less than 0.05.

Yhat = 1427.657+136.661*ram-61.433*thickness +1.162*ppi+69.808*cpu core+6.396*internal mem+99.606*cpu freq

Excluded Variables

Coefficient Correlations:

Residual Statistics

Charts:

3.Backward Method:

4:Forward Method

Coefficients

Excluded variables

Dummy Variables for Qualitative Variables in Regression Model

Can we incorporate Qualitative variables in Regression model Building? Yes. Regression model can have qualitative variables like marital status, Say married status categories could be 1) Married 2) Unmarried and 3) divorced. In this case if we add three columns with different betas. it may not look nice for analysis purpose. Whenever, I find qualitative variables I have to convert them using dummy variables or indicator variables. Dummy variables are binary variables. Usually takes 0 or 1.

Can I create n dummy variables? No. When we have n categorical codes we have to create n-1 dummy variables and Beta 0. To avoid dummy variable trap we can have n-1 dummy variables and intercept/beta 0. The intercept/Beta 0 is the base category. When creating dummy variables, we have to leave one category as base category. The remaining coefficients attached to the dummy variables are called differential intercept coefficient, and they measure deviation from the base category for that specific dummy variable.

Interaction Variable

Plays important role in regression model building. You can derive a new variable from the existing variables. In two ways you can derive variables.

First Approach: By taking ratios(say X1(currentAsset) and X2 (Current Liabilities) are variables then you can derive a new variable X3 (Current Ratio) as X1 /X2. Division

Second Approach: Adopting Interaction Variable. Say Y = β0 + β1X1 + β2X2 + β3X1X2. X1 is Gender and X2 is Work Experience and X1X2 which is wage. product of X1 and X2. Product of quantitative variable and qualitative variable. Multiplication of variables

You can also set Female as 1 and Male as 0. In that case