Normal Distribution and Machine Learning

Abhishek Barai
Analytics Vidhya
Published in
6 min readNov 19, 2020

--

Normal Distribution is an important concept in statistics and the backbone of Machine Learning. A Data Scientist needs to know about Normal Distribution when they work with Linear Models(perform well if the data is normally distributed), Central Limit Theorem, and exploratory data analysis.

As discovered by Carl Friedrich Gauss, Normal Distribution/Gaussian Distribution is a continuous probability distribution. It has a bell-shaped curve that is symmetrical from the mean point to both halves of the curve.

Source: Google

Mathematical Definition:

A continuous random variable “x” is said to follow a normal distribution with parameter μ(mean) and σ(standard deviation), if it’s probability density function is given by,

Source: Google

This is also called a normal variate.

Standard Normal Variate:

If “x” is a normal variable with a mean(μ) and a standard deviation(σ) then,

Source: Google

where z = standard normal variate

Standard Normal Distribution:

The simplest case of the normal distribution, known as the Standard Normal Distribution, has an expected value of μ(mean) 0 and σ(s.d.) 1, and is described by this probability density function,

Source: Google

Distribution Curve Characteristics:

  1. The total area under the normal curve is equal to 1.
  2. It is a continuous distribution.
  3. It is symmetrical about the mean. Each half of the distribution is a mirror image of the other half.
  4. It is asymptotic to the horizontal axis.
  5. It is unimodal.

Area Properties:

The normal distribution carries with it assumptions and can be completely specified by two parameters: the mean and the standard deviation. If the mean and standard deviation are known, you can access every data point on the curve.

The empirical rule is a handy quick estimate of the data's spread given the mean and standard deviation of a data set that follows a normal distribution. It states that:

  • 68.26% of the data will fall within 1 sd of the mean(μ±1σ)
  • 95.44% of the data will fall within 2 sd of the mean(μ±2σ)
  • 99.7% of the data will fall within 3 sd of the mean(μ±3σ)
  • 95% — (μ±1.96σ)
  • 99% — (μ±2.75σ)
Source: Google

Thus, almost all the data lies within 3 standard deviations. This rule enables us to check for Outliers and is very helpful when determining the normality of any distribution.

Application in Machine Learning:

In Machine Learning, data satisfying Normal Distribution is beneficial for model building. It makes math easier. Models like LDA, Gaussian Naive Bayes, Logistic Regression, Linear Regression, etc., are explicitly calculated from the assumption that the distribution is a bivariate or multivariate normal. Also, Sigmoid functions work most naturally with normally distributed data.

Many natural phenomena in the world follow a log-normal distribution, such as financial data and forecasting data. By applying transformation techniques, we can convert the data into a normal distribution. Also, many processes follow normality, such as many measurement errors in an experiment, the position of a particle that experiences diffusion, etc.

So it’s better to critically explore the data and check for the underlying distributions for each variable before going to fit the model.

Note: Normality is an assumption for the ML models. It is not mandatory that data should always follow normality. ML models work very well in the case of non-normally distributed data also. Models like decision tree, XgBoost, don’t assume any normality and work on raw data as well. Also, linear regression is statistically effective if only the model errors are Gaussian, not exactly the entire dataset.

Here I have analyzed the Boston Housing Price Dataset. I have explained the visualization techniques and the conversion techniques along with plots that can validate the normality of the distribution.

Visualization Techniques:

13 Numerical and 1 categorical(chas) feature is present

Histograms: It is a kind of bar graph which is an estimate of the probability distribution of a continuous variable. It defines numerical data and divided them into uniform bins which are consecutive, non-overlapping intervals of a variable.

histogram of all numerical features

kdeplot: It is a Kernel Distribution Estimation Plot which depicts the probability density function of the continuous or non-parametric data variables i.e. we can plot for the univariate or multiple variables altogether.

kdeplot of all numerical features

Feature Analysis:

Let’s take an example of feature rm(average number of rooms per dwelling) closely resembling a normal distribution.

Though it has some distortion in the right tail, We need to check how close it resembles a normal distribution. For that, we need to check the Q-Q Plot.

When the quantiles of two variables are plotted against each other, then the plot obtained is known as quantile — quantile plot or qqplot. This plot provides a summary of whether the distributions of two variables are similar or not with respect to the locations.

Note: “rm” feature is standardized before plotting qqplot

Here we can clearly see that feature is not normally distributed. But it somewhat resembles it. We can conclude that standardizing (StandardScaler) this feature before feeding it to a model can generate a good result.

Central Limit Theorem and Normal Distribution:

CLT states that when we add a large number of independent random variables to a dataset, irrespective of these variables' original distribution, their normalized sum tends towards a Gaussian distribution.

Machine Learning models generally treat training data as a mix of deterministic and random parts. Let the dependent variable(Y) consists of these parts. Models always want to express the dependent variables(Y) as some function of several independent variables(X). If the function is sum (or expressed as a sum of some other function) and the number of X is really high, then Y should have a normal distribution.

Here ml models try to express the deterministic part as a sum of deterministic independent variables(X):

deterministic + random = func(deterministic(1)) +…+ func(deterministic(n)) + model_error

If the whole deterministic part of Y is explained by X, then the model_error depicts only the random part and should have a normal distribution.

So if the error distribution is normal, then we may suggest that the model is successful. Else some other features are absent in the model but have a large enough influence on Y, or the model is incorrect.

--

--