Probability and Descriptive statistics

Bharadwaj Narayanam
AlmaBetter
Published in
8 min readJul 9, 2021

--

How can we start talking about probability without introducing set theory? A set is something that consists of well-defined objects.

Photo by Clayton Robbins on Unsplash

What do we mean by a well-defined object? Well, it might be anything. A name, a number, True/False, day of a week etc.

Is it mandatory that a set should consist of at least one element? No. A set might also be empty. We also have a name for the empty set, we call it a “NULL” or “VOID” set.

We also have a concept of a “Universal set” and it is better when explained with an example. Suppose, we have a school that includes all grades from one to ten, let us consider the students of each grade as different sets, then the students of class eight would be our set and the whole school will be our universal set.

Complement of a set

If we consider a set A, which belongs to a universal set, all the elements which do not belong to set A but belong to the universal set, is called the complement of A.

Union of sets

Union of two sets is defined as all the elements which belong to either the first OR the second set.

Intersection of sets

Can a student who belongs to grade six also belong to grade seven? NO, right? So the “Intersection” of any of the two grades would be empty. The intersection of two sets is defined by the common elements in both of them.

If P(AB) is zero, it means there are no common elements in A and B. So, P(AUB) = P(A)+P(B). In this scenario, A and B are called mutually exclusive sets.

Permutations and Combinations

This is probably one of the confusing topics in the theory of probability. But it shouldn’t be one if we think about it in a very simple manner.

Let me talk about combinations first, as it is easy to understand permutations if we get well versed with combinations.

Ask yourself, what is a combination? Suppose you want to choose three friends to invite to your party from a group of five. In whichever order you invite them, all three would turn up at your party. So a combination is something in which order doesn’t matter. The number of combinations you can choose “r” elements from a group of “n” elements is given by, n!/r!(n-r)!

Permutations are somewhat different, here the order always matters. The number of permutations possible to choose “r” elements from a group of “n” elements is given by n!/(n-r)!

At any given point, the number of permutations possible will always be greater than or equal to the number of possible combinations.

The dice experiment

It is the most used experiment to explain the concept of probability initially. Consider an unbiased die, it has six possible outcomes when rolled. So what is the probability of each possible outcome?

It would be 1/(no. of possible outcomes) which is 1/6. But, what is the probability that either 2 or 5 turns up to be the outcome? It is time that we define the formula for the probability of an event. The probability of an event is defined by, (no. of possible outcomes)/(Total no. of outcomes). So the probability that either 2 or 5 turns up is 2/6.

Probability always lies between 0 and 1. If the probability of occurrence of an event is 0, then the event will not occur. But if the probability of occurrence of an event is 1, the event is certain.

Since we’ve discussed the union of sets, you must have got an idea about it. Let us have a look at its mathematical formulation and understand it.

P(A U B) = P(A) + P(B) - P(A ∩ B)

Since we are adding the probabilities of A and B, the area A ∩ B is considered twice. Hence it is subtracted from the sum of the probabilities of A and B. In mutually exclusive events, P(A ∩ B) will be zero.

Conditional Probability

What is the probability that it will rain today, given it is a cloudy day? This question has a condition in it, and this type of probability is called conditional probability. The probability of an event given another event has already occurred is called conditional probability.

It is given as P(A|B) = P(A ∩ B)/P(B).

Bayes Theorem

P(A|B) = (P(B|A).P(A))/P(B)

Bayes theorem can be derived from the definition of conditional probability.

P(A | B) = P(A ∩ B) / P(B) , if P(B) ≠ 0,

P(B | A) = P(B ∩ A) / P(A) , if P(A) ≠ 0,

where P(A∩B) is the joint probability of both A and B being true, because

P(A ∩ B) = P(B ∩ A)

P(A ∩ B) = P(A | B).P(B) = P(B | A).P(A)

P(A | B) = P(B | A).P(A) / P(B) , if P(B) ≠ 0.

Now that we had a walkthrough of the basic concepts of probability, let us discuss descriptive statistics.

Types of data

Let us start with the types of data available. There are two types of data available, structured and unstructured data.

Data that is organised and can be used to draw insights directly is called structured data. Unstructured data is usually sentences and words that might have to be converted into structured data to be used in day to day life. An example of unstructured data is a review of a movie. Let us discuss unstructured data some other day and concentrate on the structured data now.

Structured data

Structured data is further classified into numerical and categorical data. As the name suggests, numerical data is any data that consists of numbers. There are two types of numerical data:- “Continuous” and “Discrete”.

Continuous data can take any value from the Real number system, whereas Discrete data is something that is countable i.e, Discrete data can take only the values of integers.

Categorical data can be classified into three groups. “Dichotomous”, “Nominal” and “Ordinal”.

Dichotomous data is that which consists of only two unique values such as (0/1), (True/False), (Yes/No) etc. Data that represent an order is called Ordinal data. Ranks of students, Ratings, Education level etc are examples for Ordinal data. Nominal data doesn’t represent any order.

Measures of Central tendency

Tendency means how likely is something to take a value of another. In measures of central tendency, we will discuss three measures that define the centre of the data points we’ve taken.

Mean

Mean is nothing but the average of all data points. And it is defined as the ratio of the sum of all data points and the total number of them.

Mean = (Sum of all observations)/(Total number of observations)

When we have a frequency distribution of points, the weighted mean would be, (1/n)ΣFiXi. Where Fi is the frequency of the data point and Xi is the actual value of the data point.

Let us consider a set {1,2,3,4,5,6,7,8,90}, the mean is (1+2+3+4+5+6+7+8+90)/9 = 14.

In the above example, 90 is an outlier. Outlier is something that is alien to the data. The concept of outliers is discussed in detail here. Mean is affected by the presence of outliers.

Median

Median is the absolute mid-value of the observed data points. It is not affected by outliers. To find the median, sort the values and take the midpoint if the number of observations is odd, and the mean of (n/2)th and (n/2 + 1)th observations if the number of observations is odd.

Mode

The observation which is the most repeated in the data is its mode. There are also multi-modal data, which means they have more than one mode.

Measures of spread

Measures of spread are those which indicate how spread the data is from the central point. Variance, Standard deviation, Range, IQR are some of the measures of spread.

Random variable

A Random variable is a variable that can take values of the outcomes of a phenomenon. For example, if we roll a die, the possible outcomes are [1,2,3,4,5,6]. So the random variable, in this case, can take values from 1 to 6.

A random variable is always denoted by a capital letter, while the values it can take are denoted by a small letter.

In other words, the random variable is a kind of a function to associate a number to each element in the sample space. (Sample space is nothing but all the possible outcomes of an event).

Discrete Random variable

If the random variable takes values that are countable, it is called a discrete random variable. We can take the same example of dice. Here the random variable can take values that are countable, hence it is a discrete random variable.

Continuous Random variable

If the random variable takes continuous values i.e, floating numbers, it is called a continuous random variable. If we measure the heights of students in a class, the variable can take any value in a particular range. Hence it is a continuous random variable.

Probability Distribution Function (PDF)

What if I ask you, tell me the probability of 4 showing up when I roll a die? It is written as P(X=4), where X is the random variable. Generally, for discrete distributions, there’s something called the probability distribution function, which tells us the probability that X will take the value x, Also written as P(X=x).

So what is the probability that 4 is going to show up? Yes, you got it right.

Let me ask you something tricky now. What is the probability that the value that shows up is less than or equal to 3?

Cumulative Distribution Function

It will be the combined probability of 1 or 2 or 3 showing up. We call it the Cumulative probability, which can be calculated by the Cumulative Distribution Function.

Probability Density Function

We have understood the P(X=x) for the discrete random variable. But what about the probability that X will take the value x for the continuous RVs?

Let us understand this with an example. You are calculating the heights of people in your college which has around 3000 students. Let us assume that minimum and maximum heights are 150 and 180 cm respectively. So, your Random variable can take any value from 150cm to 180cm including float values. Can you guess how many points are there between 150cm and 180cm? There are infinite!

So, the Probability that X takes the value x for a continuous RV is always ~ZERO.

In this case, we compute something called density. It is basically done by taking a bin from the population and calculating the probability that X takes values of x inside that bin.

Cumulative Distribution Function for continuous RV

CDF for a continuous random variable is almost the same as for discrete RV, but here we don't sum the individual probabilities as they will add up to zero, instead we integrate the PDF from the minimum value in the sample space to the required value which will give the area under the curve. It is the P(X<x).

If you want to refer to the PDFs and CDFs of frequently used distributions, kindly follow this link.

References

Almabetter

--

--

Bharadwaj Narayanam
AlmaBetter

On a mission of writing 100 quality articles related to statistics and data science.