Please disable your adblock and script blockers to view this page

Simple Anomaly Detection Using Plain SQL


SQL
ContentsInteractive
Anomaly
Standard Score
IIS
GROUP BY
sideThe Ugly Duckling



developmentA
QuerySets

No matching tags

No matching tags

No matching tags


Parsing

No matching tags

Positivity     41.00%   
   Negativity   59.00%
The New York Times
SOURCE: https://hakibenita.com/sql-anomaly-detection
Write a review: Hacker News
Summary

In the previous section, our acceptable range was one standard deviation from the mean, or in other words, a z-score in the range ±1:Like before, we can detect anomalies by searching for values which are outside the acceptable range using the z-score:Using z-score, we also identified 12 as an anomaly in this series.So far we used one standard deviation from the mean, or a z-score of ±1 to identify anomalies. This is the type of anomalies we want to identify early on.It's entirely possible that there were other problems during that time, we just can't spot them with a naked eye.In anomaly detection systems, we usually want to identify if we have an anomaly right now, and send an alert.To identify if the last datapoint is an anomaly, we start by calculating the mean and standard deviation for each status code in the past hour:To get the last value in a GROUP BY in addition to the mean and standard deviation we used a little array trick.Next, we calculate the z-score for the last value for each status code:We calculated the z-score by finding the number of standard deviations between the last value and the mean. In the past minute we returned a 400 status code 24 times, which is significantly higher than the average of 0.73 in the past hour.Let's take a look at the raw data:It does look like in the last couple of minutes we are getting more errors than expected.What our naked eye missed in the chart and in the raw data, was found by the query, and was classified as an anomaly. In our case, every once in a while we get a 400 status code, but because it doesn't happen very often, the standard deviation is very low so that even a single error can be considered way above the acceptable value.We don't really want to receive an alert in the middle of the night just because of one 400 status code. Using thresholds we were able to remove some non interesting anomalies.Let's have a look at the data for status code 400 after applying the threshold:The first alert happened in 17:59, and a minute later the z-score was still high with a large number of entries and so we classified the next rows at 18:00 as an anomaly as well.If you think of an alerting system, we want to send an alert only when an anomaly first happens. Looking at the results we can see what anomalies we would have discovered:The query can now be used to fire alerts when it encounters an anomaly.In the process so far we’ve used several constants in our calculations:Now that we have a working query to backtest, we can experiment with different values.This is a chart showing the alerts our system identified in the past 12 hours:To get a sense of each parameter, let's adjust the values and see how it affects the number and quality of alerts we get.If we decrease the value of the z-score threshold from 3 to 1, we should get more alerts. Using the backtesting query you can experiment with different values that produces quality alerts you can act on.Using a z-score for detecting anomalies is an easy way to get started with anomaly detection and see results right away.

As said here by Haki Benita