Probability Distributions in Data Science and Machine Learning | Part 1

Abhishek Barai
Analytics Vidhya
Published in
7 min readNov 24, 2020

--

analyticsindiamag.com

For a data scientist aspirant, Statistics is a must-learn thing. It can process complex and challenging problems in the real world so that Data Scientists can mine useful trends, changes, and data behavior to fit into the appropriate model, yielding the best results. Every time we get a new dataset, we must understand the data pattern and the underlying probability distribution for further optimization and treatment during the Exploratory Data Analysis(EDA). During EDA, we try to find out the behavior of data using different probability distributions. If the data satisfies any one of the issuances or resembles them, we further treat them for a better result.

Data Scientists deal with many kinds of data, such as categorical, numerical, text, image, voice, and many more. Each of them has a way of analysis and representation. Here we are going to consider the numerical data for further analysis. Numerical data can be of two types.

  1. Discrete — It can only take specific values. The outcome of the data is fixed. For example, the number of employees in a company, the result when you roll a die where a possible outcome can be between [1,6]
  2. Continuous — It can take any values. For example, the height or weight of a person can be any values like 45.6, 87.9

We can plot this numerical data, visualize and draw a conclusion based on its pattern, behavior, and the type of probability distribution it follows. Before going into the deep, let’s be familiar with some terminologies.

Q. What is a Probability Distribution?

Ans: A probability distribution is a function that describes all the possible outcomes and likelihood that a random variable can take within a range.

Q. What is a Random Variable?

Ans: A variable associated with some chance, measured, is called a random variable. The value of a random variable is unknown, and the outcomes can be obtained using experiments. It can be discrete(when the event has a specific result) or continuous(when the event has resulted within a particular range).

Q. What is Probability Mass Function(PMF)?

Ans: The distribution of discrete random variables is called the probability mass function(PMF). The pmf of a discrete random variable x is defined as,

Q. What is Probability Density Function(PMF)?

Ans: The distribution of continuous random variables is called the probability density function(PDF). The pdf of variables(let x) whose values range over an interval of numbers(let a & b) is defined as,

Discrete Probability Distributions:

There are several discrete probability distributions commonly used in statistics and data science. Such as,

  1. Bernoulli
  2. Binomial
  3. Uniform
  4. Poisson
  5. Geometric etc.

Bernoulli Distribution:

Bernoulli distribution for a Bernouilli trial has only two possible outcomes success or failure. For example, tossing a coin can only yield two outcomes heads or tails.

Let the probability of success be p; then, a failure will be (1-p). So the function can be defined as,

The probability of getting head for a single unbiased coin toss will be p=0.5 as there is an equal chance of getting a result. Then (1-p) = 0.5 . So,

P(x=1) = p(1) = p = 1/2

Distribution Plot:

prob of getting success=0.3 and failure=0.7 for a single chance

Binomial Distribution:

As we saw, Bernoulli distribution is based on the outcome of a single experiment. Suppose an unbiased coin is tossed 10 times. Then, in this case, what will be the probability of getting at least 7 times head? Now binomial distribution comes into the picture.

A binomial distribution can be thought of as simply the probability of a success or failure outcome in an experiment or survey repeated multiple times.

Q. Under which conditions a binomial distribution can be a Bernoulli distribution?

  1. The number of trials(n) should be 1.

Assumptions:

  • The experiment is performed under the same set of conditions for any number of trials. For example, if a prob. of success(p) is 0.5, it will be 0.5 throughout the trials.
  • For each trial, there are only two possible outcomes. success or failure
  • The sum of the probabilities will always be 1.
  • Each trial will be independent of each other.

Definition:

A random variable x is said to follow binomial probability distribution if it assumes non-negative integral values. The probability mass function is given by the probability law, as shown below.

Now the probability of getting at least 7 head would be,

Parameters:

n = number of independent trials

Distribution Plot:

for the different probability of success

Properties:

  • A binomial distribution is skewed unless p=q=1/2.
  • The mean np=λ is constant, which is a positive real value.
  • The sum of independent binomial variate is not a binomial variate.

Q. Under which conditions a binomial distribution can be a normal distribution?

  1. The number of independent trials should be indefinitely large, n → ∞.
  2. Neither p nor q should be small.

Uniform Distribution:

Uniform distribution for discrete random variables is a symmetrical probability distribution where a finite number of values is observed equally. For example, when we roll a dice or toss an unbiased coin, the probability of getting these outcomes are equally likely.

For a random variable x, the uniform distribution function can be defined as,

For example, by rolling an unbiased dice, we get 6 possible values: {1,2,3,4,5,6}. So there is an equally likely chance to get any one of the value.

So, f(X==x)=1/6 (prob, of getting a value)

Parameters:

mean and variance for uniform distribution

Distribution Plot:

Poisson Distribution:

The Poisson distribution is a discrete distribution that was derived by a mathematician called Dennish Poisson. He developed this method in 1830 to describe the number of times a gambler would win a rarely won game of chance in a large number of tries.

Basically, it shows how often an event is likely to occur within a specified period of time. As the random variables are discrete, it can only be measured as occurring or non-occurring.

Definition:

A random variable x is said to follow a Poisson distribution when it assumes only non-negative values and its probability function is given by,

λ = Poisson parameter

It is a uni-parameter and univariate distribution. It is also a limiting case of the binomial distribution.

Q. Under which conditions a binomial distribution can form a Poisson distribution?

  1. The number of trials(n) should be huge, say ∞.
  2. The constant probability of success for each trial should be minimal p→0
  3. The mean should be equal to the Poisson parameter. np= λ

Examples:

Many real-life datasets which we encounter as a data scientist follows the Poisson distribution. Such as,

  1. The number of transaction frauds happens in a month for a particular bank.
  2. The number of insincere questions posted on Quora every day
  3. The number of customers who call the company service center for their service problem

It is important to have an idea of what kind of distribution our dataset is following. In this way, we can draw a certain conclusion about data modeling.

Parameters:

For Poisson distribution, both mean and variance is the same, which is the Poisson parameter.

Distribution Plot:

Geometric Distribution:

Suppose we are surveying for an independent candidate after polls that how many votes did he/she get. So outside a polling booth, we started asking people they voted, and each time we are getting the name of other candidates. Finally, we got a person who said that he/she voted for that independent candidate. Here Geometric distribution will be represented by the number of people we had to poll before finding someone who voted for our candidate.

Basically, it represents the number of failures before we succeed in a series of Bernoulli trials(which has two outcomes always).

We can define the function as,

Assumptions:

  • There are two possible outcomes for each trial (success or failure).
  • The trials are independent of each other.
  • The probability of success is the same for each trial.

Parameters:

Distribution Plot:

Note: There are many kinds of discrete probability distribution is present. Such as mulinomial, negative binomial, hypergeometric etc. These kind of distributions also have an high impact in case of statistics and its good to have an idea from data science prospective. But I will complete the discrete part here with the above 5 distributions.

--

--