Scaling to Shape

While creating machine learning models we assume that our data follows normal distribution. In reality it is not so. We get data with large outliers that may lead to skewed distribution. Skewed distribution always affects the performance of the model we create. So our prediction may go wrong. Scaling to Shape techniques enable us to reduce the skewness and to enhance the performance of our model.

Before going into the details of this techniques we should learn about Continuous probability distribution.

Continuous Probability Distribution – normal distribution

Continuous Probability Distribution comes under Theoretical Distribution. A probability distribution which has the following probability density function (pdf) is called normal distribution. It is a bell-shaped curve.

Where:

x is continuous. Called Normal Variate.

Normal distribution has two parameters µ and sigma.

Here π = 3.14 and e= 2.718.

Mean E(X)= µ,

VAR(X) = sigma^2

SD(X) =sigma

Characteristics of Normal Distribution

Continuous probability Distribution
Probability Density Function is given by
Mu and sigma are the parameters
It is bell-shaped and symmetric about its mean
It is symmetrical on both sides (not skewed). Skewness = 0
The mean, median and mode are equal
Mean divide the curve into two equal parts
The quartile deviation QD = 2/3 σ
The mean deviation MD = 4/5σ
The X-axis is asymptote to the curve
Asymptote is a straight line that touches the curve at infinity. So the curve never touches the X-axis

Standard Normal Distribution:

Normal Variate(X) with mean µ = 0 and standard deviation σ = 1 is called standard normal variate(X). It is denoted by Z. The probability density function (p.d.f) is given by

Problem: Our data –> Marks of 15 students are given below.

Normal distribution and Standard normal distributions are drawn

Normal Distribution shows mean as 65 and standard deviation as 22.36

Standard normal distribution shows mean as 0 and standard deviation as 1

Calculation of Z

Characteristics of Standard Normal Variate

Z is a standard Normal variate. To find any probability of X we can use standard normal variate (Z)
Any normal distribution can be converted into a standard normal distribution
Area can be read from the table of areas under standard normal curve
Let X be a normal variate with mean µ and standard deviation σ
Then Z is a standard normal variate
Find Z using
Standard Normal Variate is denoted by N(0,1)
Statisticians have developed Standard Normal Table Values
Z varies from-∞ to +∞
The mean of standard normal distribution Is 0 and SD is 1

Advantage:

You can find the Probability of any value of X using Standard Normal variate Z after conversion of X. Let us explain this using one example problem.

Problem:

The weight of Halwa packed by the filling machine follows a normal distribution with mean weight of 500 gm and standard deviation of 10 gm. A pack is selected at random. What is the probability that

The pack’s weight will exceed 515 gm?
The pack’s weight lies within 480 and 520 gm?
The proportion of packs will have less than 480 and greater than 520 gm

if 10000 packs are supplied how many packs will be rejected given that 480 gm and 520 gm are lower and upper limit for acceptance?

Solution: X is normal variate with parameters mean= 500 and standard deviation =10.Therefore Z, the Standard Normal Variate is found out by using

b) What is the probability that the pack’s weight lies within 480 and 529 gm

Probability of Rejection:

If the weight lies outside these values, then it will be rejected. The probability of rejection = 1- 0.9544=0.0456 The number of packets that will be rejected is given by N ∗P =10000∗0.0456 = 456

Symmetrical and Skewed

Normal Distribution is always symmetrical. If the data follows a normal distribution, then its mean, median and mode will be equal to each other

Mean = Median = Mode

If the distribution is asymmetrical then we call that distribution as skewed distribution. The normal distribution has a skewness of 0. Skewness tells us about where most of the values are concentrated on an ascending scale.

Thumb Rule

Left Skewed and Right Skewed

Summary

We have two types of Skewness:

Negative Skewness:

If the skewness is less than 0 then the distribution is negatively skewed.
For negatively skewed data, most of the values will be concentrated above the average value and tail on the left side of the distribution will be longer or flatter.

Positive Skewness:

If the skewness is greater than 0 then the distribution is positively skewed.
For positively skewed data, most of the values will be concentrated below the average value and tail on the right side of the distribution will be longer or flatter.

What does skewness tell us?

Skewness of a data indicates the direction and relative magnitude of a distribution’s deviation from the normal distribution.
Skewness considers the extremes(outliers) of the dataset rather than concentrating only on the average.
Analyst need to look at the extremes (outliers)

Why skewed data is not used in creating machine learning models?

Many machine learning models assumes normal distribution but in reality, data points may not be perfectly symmetric. If the data are skewed, then this kind of model will always underestimate the skewness risk. Outliers and skewed data will never support the accuracy of the model undertaken.

Skewed Data in the real world

Real-world examples with Right Skewed Data

a) Income distribution – Right Skewed

Happiness Distribution

Real-world examples with Left Skewed Data

Retirement Age

Testing Score – GPA

In Right-skewed data mean will always be higher than median and mode

In left-skewed data mean will always be lower than median and mode

Skewness measures (formulae)

When our data is skewed either left or right, we can not use the techniques like log transformation, square-root transformation, Square transformation and Exponential Transformation. These techniques enable you to reduce the skewness and maintain the accuracy of the model performance

MMK TECHNOLOGIES

B32, F1, C-COLONY, MUTHU SUNDARI APARTMENTS, WATER TANK SOUTH STREET, PERUMALPURAM

+91 9840922213

km301252@gmail.com