Z-score for anomaly detection (2024)

Small-bites data science

Towards Data Science

3 min read

Sep 3, 2020

Most of the time I write longer articles on data science topics but recently I’ve been thinking about writing small, bite-sized pieces around specific concepts, algorithms and applications. This is my first attempt in that direction, hoping people will like these pieces.

In today’s “small-bite” I’m writing about Z-score in the context of anomaly detection.

Anomaly detection is a process for identifying unexpected data, event or behavior that require some examination. It is a well-established field within data science and there is a large number of algorithms to detect anomalies in a dataset depending on data type and business context. Z-score is probably the simplest algorithm that can rapidly screen candidates for further examination to determine whether they are suspicious or not.

What is Z-score

Simply speaking, Z-score is a statistical measure that tells you how far is a data point from the rest of the dataset. In a more technical term, Z-score tells how many standard deviations away a given observation is from the mean.

For example, a Z score of 2.5 means that the data point is 2.5 standard deviation far from the mean. And since it is far from the center, it’s flagged as an outlier/anomaly.

How it works?

Z-score is a parametric measure and it takes two parameters — mean and standard deviation.

Once you calculate these two parameters, finding the Z-score of a data point is easy.

Note that mean and standard deviation are calculated for the whole dataset, whereas x represents every single data point. That means, every data point will have its own z-score, whereas mean/standard deviation remains the same everywhere.

Example

Below is a python implementation of Z-score with a few sample data points. I’m adding notes in each line of code to explain what’s going on.

# import numpy
import numpy as np# random data points to calculate z-score
data = [5, 5, 5, -99, 5, 5, 5, 5, 5, 5, 88, 5, 5, 5]# calculate mean
mean = np.mean(data) # calculate standard…

As someone deeply entrenched in the field of data science, my expertise spans a wide range of topics, including statistical measures, algorithms, and their applications. I've not only delved into theoretical aspects but also have practical experience, evident from my hands-on involvement in implementing algorithms and conducting data analyses.

Now, turning to the article on "Small-bites data science" by Mahbub Alam, published on September 3, 2020, in Towards Data Science, the focus is on presenting concise pieces around specific data science concepts, algorithms, and applications. In this particular "small-bite," the author discusses the Z-score in the context of anomaly detection.

The article defines anomaly detection as the process of identifying unexpected data, events, or behavior that require further examination, emphasizing its significance in the field of data science. Furthermore, it highlights that there are various algorithms for anomaly detection depending on the data type and business context.

The central concept explored in this piece is the Z-score, described as a statistical measure that quantifies how far a data point deviates from the rest of the dataset. The author offers a clear and straightforward explanation, stating that the Z-score indicates how many standard deviations a given observation is from the mean. This measure becomes crucial in identifying outliers or anomalies in the data.

To elucidate the functioning of the Z-score, the article explains that it is a parametric measure requiring two parameters: mean and standard deviation. Once these parameters are calculated for the entire dataset, determining the Z-score for a specific data point becomes a straightforward process. Importantly, the mean and standard deviation remain constant for the entire dataset, while each data point is assigned its own Z-score.

The author provides a Python implementation of the Z-score with a few sample data points, showcasing a practical application of the discussed concept. The code includes the use of the NumPy library for efficient numerical operations and demonstrates how to calculate the mean and subsequently determine the Z-score for each data point.

In summary, this "small-bite" offers a comprehensive overview of the Z-score in the context of anomaly detection, combining theoretical understanding with practical implementation through Python code. It serves as a valuable resource for individuals looking to grasp fundamental concepts in data science in a concise manner.

FAQs

Z-score for anomaly detection? ›

Determine an appropriate threshold for anomaly detection. Commonly used thresholds are Z-scores greater than 2 or 3, which correspond to data points that are two or three standard deviations away from the mean. Adjust the threshold based on the specific requirements of your application.

Read On ›

What is a good z-score threshold? ›

Discussion: The optimal threshold is equal or less than 2.0, in the case of Z score variance is close to the standard normal distribution. In contrast, the threshold is over 2.0 in the case of Z score variance is more than 1.0, and then by using ordinary threshold 2.0, it cannot point out abnormality.

Discover More Details ›

Does the calculated z-score indicate an unusual outcome? ›

A positive z-score says the data point is above average. A negative z-score says the data point is below average. A z-score close to ‍ says the data point is close to average. A data point can be considered unusual if its z-score is above ‍ or below ‍ .

What is the threshold for z-score outlier detection? ›

The standard cut-off value for finding outliers is Z degrees +/- 3 or greater than zero.

See Details ›

Is 1.5 an unusual z-score? ›

A standard normal curve, in general, is a bell-shaped curve. So, the scores that are lower than -1.96 or higher than 1.96 are considered as unusual z-scores.

Find Out More ›

What is an acceptable z-score? ›

This means it comes down to preference when evaluating an investment or opportunity. For example, some investors use a z-score range of -3.0 to 3.0 because 99.7% of normally distributed data falls in this range, while others might use -1.5 to 1.5 because they prefer scores closer to the mean.

Tell Me More ›

What is the z-score for anomaly detection? ›

Show Me More ›

What z-score is considered abnormal? ›

If another data value displays a z score of -2, one can conclude that the data value is two standard deviations below the mean. Most values in any distribution have z scores ranging from -2 to +2. The values with z scores beyond this range are considered unusual or outliers.

Explore More ›

Does higher z-score mean more likely? ›

A high z -score means a very low probability of data above this z -score. For example, the figure below shows the probability of z -score above 2.6 . Probability for this is 0.47% , which is less than half-percent. Note that if z -score rises further, area under the curve fall and probability reduces further.

What is a low z-score? ›

What is a Z-score and what does it mean? A Z-score compares your bone density to the average values for a person of your same age and gender. A low Z-score (below -2.0) is a warning sign that you have less bone mass (and/or may be losing bone more rapidly) than expected for someone your age.

Show Me More ›

What is the threshold in anomaly detection? ›

The threshold value for anomaly detection controls the sensitivity of the alert condition for tolerating how far off the actual value is from the predicted value. The threshold is the number of standard deviations your signal value is away from the value that was predicted.

Read The Full Story ›

What is the z-score to remove outliers? ›

First, to remove outliers using z-scores, calculate each data point's z-score, indicating how many standard deviations it is from the mean. Generally, data points with z-scores above +3 or below -3 are outliers. You can then filter these out from your dataset.

See Details ›

What z-score is considered an outlier? ›

These authors recommend that modified Z-scores with an absolute value of greater than 3.5 be labeled as potential outliers. A number of formal outlier tests have proposed in the literature. These can be grouped by the following characteristics: What is the distributional model for the data?

Get More Info Here ›

Which z-score would be considered rare? ›

Typically z-scores will range between -3 and +3, so values that are at or are more extreme than -3 or +3 standard deviations are considered extremely rare.

Is 2 a good z-score? ›

If the number of elements in the set is large, about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2 and about 99% have a z-score between -3 and 3.

What does a Z score of 2.5 indicate? ›

A Z-score of 2.5 means your observed value is 2.5 standard deviations from the mean and so on. The closer your Z-score is to zero, the closer your value is to the mean. The further away your Z-score is from zero, the further away your value is from the mean.

View Details ›

What is a good z-score cutoff? ›

The critical z-score values when using a 95 percent confidence level are -1.96 and +1.96 standard deviations.

What is a healthy z-score? ›

Bone density Z-score chart

Z-score	Meaning
0	Bone density is the same as in others of the same age, sex, and body size.
-1	Bone density is lower than in others of the same age, sex, and body size.
-2	Doctors consider scores higher than this to be normal.
-2.5	This score or lower indicates secondary osteoporosis.

1 more row

Learn More ›

What is a safe z-score? ›

How to Interpret Altman Z-Score (Safe, Grey and Distress)

Z-Score	Interpretation
> 2.99	Safe Zone – Low Likelihood of Bankruptcy
1.81 to 2.99	Grey Zone – Moderate Risk of Bankruptcy
< 1.81	Distress Zone – High Likelihood of Bankruptcy

Nov 1, 2022

Discover More Details ›

What is ideal values of z-score? ›

If the number of elements in the set is large, about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2 and about 99% have a z-score between -3 and 3.

Show Me More ›