Z Score in detail

Bharadwaj Narayanam
Analytics Vidhya
Published in
6 min readAug 30, 2020

--

Let us understand what a Z score is, it’s applications and how it is used in comparing multiple observations which are in a different scale (Scale almost refers to a range of the data but is not exactly the same. If observations in different scale have same value their values definitely vary when we bring them to the same scale.)

Photo by Carlos Muza on Unsplash

We are going to have a deep discussion on the Z score. But right before that, we need to understand what a normal distribution and a standard normal distribution is.

What is a distribution?

Okay, for the people new to statistics, I want you to know what exactly a distribution is. Think about it in a simple manner.

Today is your birthday! And you’re ‘distributing’ candies in your class. How would you do that? Let me tell you how I would distribute them. I would give three candies to two of my best friends, two to 10 of my friends and the remaining guys get only one candy.

Well, that’s what a distribution is!

A distribution in statistics is a function that shows the possible values for a variable and how often they occur.

All these distributions have “Probability distribution functions” or “Probability mass functions” depending on the distribution, but for now, that’s not our cup of tea. We’re here to learn the basics of distributions.

Great! you’ve learnt what a distribution is. So now let’s try to understand the normal distribution.

What is a Normal distribution?

The normal distribution is a distribution which is symmetric about the mean(Well, mean is nothing but average of all the observations). Most of the observations in the normal distribution cluster around the mean.

I hope you got a brief idea about distributions and normal distribution. Let us have a look at standard normal distribution, which is pretty simple. It is just a special case of normal distribution.

What is a standard normal distribution?

The standard normal distribution is a normal distribution whose mean and standard deviation are scaled at 0 and 1 respectively.

Z score can only be calculated for the observations which follow a normal distribution.

What is a Z score?

A Z-score is a numerical measurement that describes a value’s relationship to the mean of a group of values.

Z-score is measured in terms of standard deviations from the mean.

Example

Suppose there are three students whose marks in their English examination are 12, 16 and 23. The mean is 17.

Apart from mean I’ve used another term in the definition of Z score, which is the standard deviation.

What is Standard deviation?

Standard deviation is a quantity expressing by how much the members of a group differ from the mean value for the group. In the above example, the mean is 17 and the observations are 12, 16 and 23. How to calculate the standard deviation? I would write a simple python code for that!

import math
marks = [12,16,23]
mean = sum(marks)/len(marks)
print('Mean:',mean)
# Here comes the standard deviation
# First let us calculate the individual deviations of observations from their mean
marks_dev = [abs((x-mean)**2) for x in marks]
st_dev = math.sqrt(sum(marks_dev)/len(marks_dev))
print('Standard deviation:',round(st_dev,2))

How to calculate the Z score?

z = (data point — mean) / standard deviation

What are we doing here?

We’re just scaling the mean to zero. So let us calculate the Z scores for the marks in the above example so that I can explain what a Z score is, in extreme detail.

# Calculating Z scores
Z_scores = [round((x-mean)/st_dev,2) for x in marks]
Z_scores

We’ve got the Z scores respectively for 12, 16 and 23 as -1.10, -0.22 and 1.32 respectively.

What actually does this Z score mean?

Let us consider the Z score of 23. It is 1.32, which means that 23 is 1.32 times the standard deviation away from it’s mean! That is, as mean is 17 and the standard deviation is 4.55, 23–17 is 6, which is equal to 1.32 * standard deviation. That’s the whole point!

Why do I calculate Z scores? I mean, we can just compare the scores as they are, right?

NO! you can’t do that. Let me give you an example of two students trying to enter a university for M.tech but through different exams.

Suresh appeared at the GATE exam and wants to use his score for the admission. Whereas, Archana didn’t appear for GATE but she did well at her PGECET exam (Note that GATE and PGECET are two different exams for the entry to post-graduation).

Suresh’s score was 73 where the average GATE score that year was 87 and the standard deviation was 23.

Archana’s score was 345 where the average PGECET score was 374 and the standard deviation was 115.

Can you tell me who did comparatively well just by looking at their scores? Well, I don’t have that superpower. So here comes our next question.

How is Z score used to compare multiple scores on a different scale?

I am eager to know who’s gonna make it to the university. Are you too? Let’s find out then!

Suresh_score = 73
Archana_score = 345
avg_gate_score = 87
avg_pgecet_score = 374
gate_stdev = 23
pgecet_stdev = 115
Suresh_z_score = (Suresh_score-avg_gate_score)/gate_stdev
print("Suresh's Z score is:",Suresh_z_score)
Archana_z_score = (Archana_score-avg_pgecet_score)/pgecet_stdev
print("Archana's Z score is:",Archana_z_score)
if Suresh_z_score > Archana_z_score:
print("Suresh made it to the university!")
elif Suresh_z_score == Archana_z_score:
print("That's some great news! Both of them made it!")
else:
print("Archana made it to the university!")
Suresh's Z score is: -0.6086956521739131
Archana's Z score is: -0.25217391304347825
Archana made it to the university!

So now, why did Archana make it to the university? Can you compare Archana’s and Suresh’s Z score? Which one is higher?

Archana’s score which is around -0.2522 is greater than that of Suresh’s, which is around -0.6087.

Is a higher Z score good or worse?

I would say that it completely depends upon the problem statement and the situation. In the above example, we were looking for a candidate who did better compared to the other ones. So, we chose the one with higher Z score.

If we take the example of how many times a coastal city was hit by a tsunami and compared it with other coastal cities, the city with the highest Z score would be the worst affected.

Alright! We’ve come to know that Z scores are helpful in comparing data which are not on the same scale. What are the other uses of this?

Outlier detection

Yes, Z scores can also be used for outlier detection. If I did forget to mention above, if the Z score is less than -3 or greater than 3, That observation might be considered as an outlier.

What is an outlier?

Outlier is a value which differs significantly from other values in the data.

Okay! so let’s have a look at a problem.

There is a sample of 15 observations given below for the areas of houses in Greater Hyderabad.

Areas of houses are given in square yards.

Observations = [200,234,523,1255,623,324,65,123,192,4332,433,235,543,720,239]

Now let us detect outliers in the data using a Z score.

from statistics import mean
from statistics import stdev

Observations = [200,234,523,1255,623,324,65,123,192,4332,433,235,543,720,239]
avg = mean(Observations)
st_dev = stdev(Observations)
# Now that we've calculated mean and standard deviation, it's time for outlier detection.outliers = list()
for i in Observations:
z = (i-avg)/st_dev
if z <= -3 or z>=3:
outliers.append(i)
outliers

So as we can see, 4332 is the only outlier in the data, which is quite obvious that it is very rare for a person to have his house built in such a large area.

--

--

Bharadwaj Narayanam
Analytics Vidhya

On a mission of writing 100 quality articles related to statistics and data science.