Most of the time I write longer articles on data science topics but recently I’ve been thinking about writing small, bite-sized pieces around specific concepts, algorithms and applications. This is my first attempt in that direction, hoping people will like these pieces.
In today’s “small-bite” I’m writing about Z-score in the context of anomaly detection.
Anomaly detection is a process for identifying unexpected data, event or behavior that require some examination. It is a well-established field within data science and there is a large number of algorithms to detect anomalies in a dataset depending on data type and business context. Z-score is probably the simplest algorithm that can rapidly screen candidates for further examination to determine whether they are suspicious or not.
What is Z-score
Simply speaking, Z-score is a statistical measure that tells you how far is a data point from the rest of the dataset. In a more technical term, Z-score tells how many standard deviations away a given observation is from the mean.
For example, a Z score of 2.5 means that the data point is 2.5 standard deviation far from the mean. And since it is far from the center, it’s flagged as an outlier/anomaly.
How it works?
Z-score is a parametric measure and it takes two parameters — mean and standard deviation.
Once you calculate these two parameters, finding the Z-score of a data point is easy.
Note that mean and standard deviation are calculated for the whole dataset, whereas x represents every single data point. That means, every data point will have its own z-score, whereas mean/standard deviation remains the same everywhere.
Example
Below is a python implementation of Z-score with a few sample data points. I’m adding notes in each line of code to explain what’s going on.
# import numpy import numpy as np# random data points to calculate z-score data = [5, 5, 5, -99, 5, 5, 5, 5, 5, 5, 88, 5, 5, 5]# calculate mean mean = np.mean(data) # calculate standard…
As someone deeply entrenched in the field of data science, my expertise spans a wide range of topics, including statistical measures, algorithms, and their applications. I've not only delved into theoretical aspects but also have practical experience, evident from my hands-on involvement in implementing algorithms and conducting data analyses.
Now, turning to the article on "Small-bites data science" by Mahbub Alam, published on September 3, 2020, in Towards Data Science, the focus is on presenting concise pieces around specific data science concepts, algorithms, and applications. In this particular "small-bite," the author discusses the Z-score in the context of anomaly detection.
The article defines anomaly detection as the process of identifying unexpected data, events, or behavior that require further examination, emphasizing its significance in the field of data science. Furthermore, it highlights that there are various algorithms for anomaly detection depending on the data type and business context.
The central concept explored in this piece is the Z-score, described as a statistical measure that quantifies how far a data point deviates from the rest of the dataset. The author offers a clear and straightforward explanation, stating that the Z-score indicates how many standard deviations a given observation is from the mean. This measure becomes crucial in identifying outliers or anomalies in the data.
To elucidate the functioning of the Z-score, the article explains that it is a parametric measure requiring two parameters: mean and standard deviation. Once these parameters are calculated for the entire dataset, determining the Z-score for a specific data point becomes a straightforward process. Importantly, the mean and standard deviation remain constant for the entire dataset, while each data point is assigned its own Z-score.
The author provides a Python implementation of the Z-score with a few sample data points, showcasing a practical application of the discussed concept. The code includes the use of the NumPy library for efficient numerical operations and demonstrates how to calculate the mean and subsequently determine the Z-score for each data point.
In summary, this "small-bite" offers a comprehensive overview of the Z-score in the context of anomaly detection, combining theoretical understanding with practical implementation through Python code. It serves as a valuable resource for individuals looking to grasp fundamental concepts in data science in a concise manner.
Determine an appropriate threshold for anomaly detection. Commonly used thresholds are Z-scores greater than 2 or 3, which correspond to data points that are two or three standard deviations away from the mean. Adjust the threshold based on the specific requirements of your application.
Discussion: The optimal threshold is equal or less than 2.0, in the case of Z score variance is close to the standard normal distribution. In contrast, the threshold is over 2.0 in the case of Z score variance is more than 1.0, and then by using ordinary threshold 2.0, it cannot point out abnormality.
A positive z-score says the data point is above average. A negative z-score says the data point is below average. A z-score close to says the data point is close to average. A data point can be considered unusual if its z-score is above or below .
A standard normal curve, in general, is a bell-shaped curve. So, the scores that are lower than -1.96 or higher than 1.96 are considered as unusual z-scores.
This means it comes down to preference when evaluating an investment or opportunity. For example, some investors use a z-score range of -3.0 to 3.0 because 99.7% of normally distributed data falls in this range, while others might use -1.5 to 1.5 because they prefer scores closer to the mean.
Determine an appropriate threshold for anomaly detection. Commonly used thresholds are Z-scores greater than 2 or 3, which correspond to data points that are two or three standard deviations away from the mean. Adjust the threshold based on the specific requirements of your application.
If another data value displays a z score of -2, one can conclude that the data value is two standard deviations below the mean. Most values in any distribution have z scores ranging from -2 to +2. The values with z scores beyond this range are considered unusual or outliers.
A high z -score means a very low probability of data above this z -score. For example, the figure below shows the probability of z -score above 2.6 . Probability for this is 0.47% , which is less than half-percent. Note that if z -score rises further, area under the curve fall and probability reduces further.
What is a Z-score and what does it mean? A Z-score compares your bone density to the average values for a person of your same age and gender. A low Z-score (below -2.0) is a warning sign that you have less bone mass (and/or may be losing bone more rapidly) than expected for someone your age.
The threshold value for anomaly detection controls the sensitivity of the alert condition for tolerating how far off the actual value is from the predicted value. The threshold is the number of standard deviations your signal value is away from the value that was predicted.
First, to remove outliers using z-scores, calculate each data point's z-score, indicating how many standard deviations it is from the mean. Generally, data points with z-scores above +3 or below -3 are outliers. You can then filter these out from your dataset.
These authors recommend that modified Z-scores with an absolute value of greater than 3.5 be labeled as potential outliers. A number of formal outlier tests have proposed in the literature. These can be grouped by the following characteristics: What is the distributional model for the data?
Typically z-scores will range between -3 and +3, so values that are at or are more extreme than -3 or +3 standard deviations are considered extremely rare.
If the number of elements in the set is large, about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2 and about 99% have a z-score between -3 and 3.
A Z-score of 2.5 means your observed value is 2.5 standard deviations from the mean and so on. The closer your Z-score is to zero, the closer your value is to the mean. The further away your Z-score is from zero, the further away your value is from the mean.
If the number of elements in the set is large, about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2 and about 99% have a z-score between -3 and 3.
Address: Suite 369 9754 Roberts Pines, West Benitaburgh, NM 69180-7958
Phone: +522993866487
Job: Sales Executive
Hobby: Worldbuilding, Shopping, Quilting, Cooking, Homebrewing, Leather crafting, Pet
Introduction: My name is Golda Nolan II, I am a thoughtful, clever, cute, jolly, brave, powerful, splendid person who loves writing and wants to share my knowledge and understanding with you.
We notice you're using an ad blocker
Without advertising income, we can't keep making this site awesome for you.