Unsupervised, Multivariate Statistical Anomaly Detection for Data with Complex Distributions
Anomaly detection is a field within machine learning that has been thoroughly explored, with methods from k-nearest neighbor, to isolation forests, to clustering. But what if we want to limit our anomaly detection to statistical methods only? Furthermore, what if our data doesn’t follow parametric distributions?
At Okaya, we deal with an extremely wide myriad of clients, from firefighters, to students, to veterans. It’s important to capture the differences between these many groups. For example, with respect to something like sleep time, a firefighter working the night shift will have drastically different sleep and wake times as compared to someone like a student. How can we capture these relationships and make statistical inferences on those relationships?
Parametric Statistical Modeling
Why might we want to limit our methods to statistical methods only? There are two main reasons:
- The ability to calculate a probability for a test point as to whether it is an anomaly or not. This allows us to easily set anomaly thresholds (for example, flag anything below 5% probability) as opposed to spending time trying to figure out what a good threshold is as you would have to do for something like distance-based anomaly detection.
- Superior computation speed. Since statistical modeling relies on distributions for which libraries for fast, easy calculation already exist, it is relatively simple to implement an algorithm which runs quite quickly.
One of the most commonly used and commonly taught statistical methods makes use of the normal curve.
As a review, recall that the normal curve has mean μ and standard deviation σ, for which μ defines where the curve is centered, and σ defines how spread apart our distribution is.
Essentially, what we are going to do is fit our data to a normal curve, resulting in parameters μ and σ which fit our input dataset. Once we have these values, we can easily calculate a probability density value from any given test input. This represents a “probability” of that test input being an anomaly or not (that is, the lower the probability, the higher the likelihood of it being an anomaly). It is important to remember, however, that this is not an absolute probability, it is a probability density, since the probability of a single point is 0 (since the normal curve is a continuous distribution).
Non-Parametric Statistical Modeling – Motivation
However, we are making a dangerous assumption when we assume our data fits the normal curve (or any other statistical distribution, for that matter). It is true that we can use the central limit theorem (CLT) in many situations, and thus approximate a distribution to be normal (you can read more about the CLT here). But what if we are actually relying on data to not fit to a normal curve? CLT only works across large populations – however, at Okaya, we are aiming to personalize data analysis instead.
Back to the example of sleep data amongst the many different groups that Okaya encounters:
We can see this data clearly has spikes at certain areas. However, if we apply to a normal curve:
If we try to predict the likelihood of a certain test point, we will end up getting a very high probability density, when in fact it should be classified as an outlier.
The Kernel Density Estimator
Essentially, what we’d like is a statistical distribution that follows whatever shape the input data represents. Intuitively, wherever there is more data, we would expect the distribution to be more dense at that point.
We will use an example dataset of sleep lengths (in hours):
firefighter_sleep_lengths = [5, 5.5, 4.7, 5.6, 6.2, 5.4, 5.7, 4.3, 4.9] student_sleep_lengths = [8.3, 8.1, 7.9, 8.5, 8.6, 8.4, 8.9, 9]
The most effective way to achieve this goal is to somehow “add” together data points. We then want to stitch all these together in some way to create a continuous distribution. We will achieve this by using a kernel, which, coincidentally, takes the form of a normal curve.
As mentioned before, with statistical modeling, we can easily make use of pre-existing libraries to speed up computation of our statistical distributions. We’ll use SciPy’s implementation of the normal curve in this case (the use of SciPy’s multivariate normal as opposed to univariate normal will be covered later). We will then construct a normal-curve kernel for each data point we have.
from scipy.stats import multivariate_normal input_data = firefighter_sleep_lengths + student_sleep_lengths kernels =  for point in input_data: kernels.append(multivariate_normal(point))
Now that we have a concrete curve for each point, we can add them up by simply adding all the areas. Since every kernel is continuous, this automatically makes the resulting distribution continuous as well:
There’s only one problem left. We know that the area under any statistical distribution must equal 1, but since we added up all the kernels (17 in total), the total area is now 17, instead of 1! There is an easy fix – just divide everything by the number of points we have, and the area will once again be 1.
There’s no easy way to add up all the kernels and store them in a single distribution, so we’ll perform this step whenever we want to calculate a likelihood probability:
def predict(kernels, test_point): probs =  for kernel in kernels: probs.append(kernel.pdf(test_point)) return sum(probs) / len(probs)
In which we use kernel.pdf to calculate a probability density at a certain point (see SciPy documentation here), and divide by len(probs) so that the resulting number has been scaled properly.
test_point = 5.5 pdf = predict(kernels, test_point) >>> .183821 test_point = 10 pdf = predict(kernels, test_point) >>> .062001
We can see the probability density at 5.5 is much higher than it is at 10. This makes sense – we have a spike at ~5.5, but the distribution is quite low at ~10. From then on, anomaly detection logic is simple:
threshold = .1 flag = False if pdf < threshold: flag = True
Multivariate Non-Parametric Modeling
In today’s data-centric world, we can gain even more information by using multivariate data. When we use statistical distributions that are solely based on the magnitude of input data (for example, sleep length), we are assuming these numbers are random variables. However, data is hardly ever truly random. We would expect a variety of factors to have an influence on what our data looks like. For example, we would expect many users to get more sleep on the weekends than they do on weekdays:
As you can see, there is a high density of longer sleep times near the end of the week, but the highest density of sleep times at earlier days of the week seems to be centered around a shorter sleep time.
This is where multivariate_normal comes in. We can specify any number of features and this will create a normal curve centered around those values (across however many dimensions that there are features).
The methodology for calculating probabilities is exactly the same as before, except we put in 2 numbers for the input mean of the multivariate normal, instead of just 1 like before (in which the first number is the sleep length, and the second is day of the week):
from scipy.stats import multivariate_normal input_data = [[5.5, 0], [5.7, 1], [6.5, 2], [6.4, 3], [6.8, 4], [8.6, 5], [8.7, 6]] kernels =  for point in input_data: kernels.append(multivariate_normal(point))
def predict(kernels, test_point): probs =  for kernel in kernels: probs.append(kernel.pdf(test_point)) return sum(probs) / len(probs)
monday_test1 = [6.8, 0] monday_test2 = [8.3, 0] saturday_test = [8.3, 5] predict(kernels, monday_test1) predict(kernels, monday_test2) >>> .020480 >>> .001574 predict(kernels, saturday_test) >>> .039499
As we can see, getting a lower amount of sleep on a Monday results in a higher probability than getting a higher amount of sleep. On the weekend, however, getting a higher amount of sleep results in a larger probability density. This aligns with our observation that people get more sleep on weekends than they do on weekdays.
Making Use of Features
Though we can make use of featured data to make more nuanced anomaly detections, we can also use features to make more powerful predictions. We might be interested in calculating an expected value, in which we input a number of features and return an expected value that has been calculated based on those features. For example:
- Given any day of the week, how much sleep would we expect any one person to get that night?
For simplicity, we will be using the same application as before (sleep length), as it is only 2-dimensional and is thus easier to visualize.
From a statistics standpoint, you might recall expected value, E[X], is just the mean of the distribution. For a standard normal curve, we would just say the expected value of random variable X is μ. Why can’t we just say the expected value of our multivariate, nonparametric distribution is just the average of all the points that make up the distribution? Remember, we are using featured data – that is, depending on the features that we get, we expect different averages.
To get the expected value based on a given input feature set, we will take a “slice” of the overall distribution at that certain feature input. What does this look like?
The overall distribution
Taking a slice
This idea of “slicing” works for not just this 2-dimensional data, but also for higher dimensional data (in which we have more than just 1 feature). It’s just a little harder to visualize (since we can’t visualize anything higher than 2 dimensions + the magnitude dimension).
Essentially, what we have done is reduced the dimensionality of the overall distribution, and created a 1-dimensional curve (similar to what we looked at when we weren’t using multivariate distributions). It is much easier to calculate the expected value now that we only have 1 dimension to worry about.
You might say, great! The expected value is just the mean of this 1-dimensional distribution, right?
Recall the idea of non-parametric modeling. We are often dealing with spikes at different points (remember the firefighter vs. student problem), so giving the mean as the expected value is not necessarily the best idea. Instead, the best approach is to give the point at which there are the most occurences, or the mode (recall that in a set of numbers, the mode is the number that occurs most often in the set).
def expected(self, val): pdfs =  pdfs_sums =  expected = 0.0 # We want to take slices at a certain feature input. Define features here. features_list = val[1:len(val)] # Get the maximum value we want to evaluate at, so we have a range at which to take slices from means =  for kernel in kernels: means.append(kernel.mean) max_mean = max(means) if val > max_mean: max_range = val else: max_range = max_mean loop_range = math.ceil(max_range) * resolution iterations = 1 # For data types that have large ranges, we don't want the calculation to take too long # So if iterations is over 100, just set to max_iterations if loop_range > max_iterations * resolution: loop_range = int(math.ceil(max_range)) iterations = int(loop_range / max_iterations) # Get a range of values we can take slices from, in this case, the range between min and max of data for i in range(0, loop_range, iterations): for kernel in kernels: # For each kernel, take a PDF at each point (determined by resolution) # An approximation of taking the PDF at each point in the continuous probability distribution if loop_range > max_iterations * resolution: pdfs.append(kernel.pdf([i] + features_list)) else: pdfs.append(kernel.pdf([i / resolution] + features_list)) pdfs_sums.append(sum(pdfs) / len(kernels)) pdfs =  # The index is where the greatest PDF was. This is the expected. index = min(range(len(pdfs_sums)), key = lambda i: abs(pdfs_sums[i] - max(pdfs_sums))) if loop_range > max_iterations * resolution: # Since we are not iterating by 1 if we are using max range, we have to multiply by the iterator expected = index * iterations else: expected = index / resolution return expected
Note that since we are using a continuous distribution, we are not pulling numbers directly from the input dataset. Instead, we sample at various points throughout the distribution (from min to max, evenly spaced out at a specified resolution), and return the point which had the highest pdf.
monday_test = [0, 0] # What we really care about is the second number (the day of the week)expected(monday_test)>>> 5.6saturday_test = [0, 5]expected(saturday_test)>>> 8.5
The importance of non-parametric modeling is not to be ignored. In fact, this is essentially what deep learning is based around – modeling complex distributions which ordinary functions would have no hope of recreating. We are essentially just recreating this idea, but in a statistical context (distributions created from a neural network don’t have a probabilistic “area” of 1).
To achieve this goal, we use kernel density estimation to capture more complex relationships which parametric curves such as the normal distribution cannot model. For example, this could be bimodal in the case of the sleep lengths example examined earlier. Oftentimes, the resulting distribution is much more complex and can only be described by that specific dataset.
Lastly, we discussed the importance of using featured data and how changing feature inputs affect the resulting distribution. We leverage this idea to create an expected value function, in which we can calculate an expected value based off a set of features we input.
Normal distribution – https://www.itl.nist.gov/div898/handbook/eda/section3/eda3661.htm
Kernel density estimation – https://en.wikipedia.org/wiki/Kernel_density_estimation
Central limit theorem – https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Book%3A_Introductory_Statistics_(OpenStax)/07%3A_The_Central_Limit_Theorem/7.02%3A_The_Central_Limit_Theorem_for_Sample_Means_(Averages)