Cyber threats evading signatures: Outlier, anomaly or both?

March 19, 2020 Aaron Gerow

Data, averages and outliers

We do a lot of data-driven analysis at Senseon. And though some cyber threats can be efficiently discovered using signatures, rules and event-based analysis, many cannot. One problem that crops up again and again is trying to find and quantify abnormalities in a sea of complex data.

Naively, we might look for deviations from an average. Unfortunately, the concept of an average isn’t the same for every dataset. Sometimes an average makes statistical sense, but doesn’t capture what we want. What we usually mean by “average” is something along the lines of “typical”, but the trouble comes when we try to metricate typical. In some situations we get lucky and a statistical average works quite well, but as cyber threats become increasingly complex, simple statistics often aren’t enough.

The idea of an outlier is a statistical concept, and like any good concept, it’s helpful. In many areas of statistics, outliers are defined as data points that make it difficult for models to accurately characterise effects and predict outcomes. Given enough data, this difficulty can be measured exactly by scoring data points based on precisely how much they reduce a model’s predictive accuracy on new data.

Outliers don’t just harm models’ predictive accuracy -- sometimes they are actually the problem we’re looking for, and which might form the basis of a security observation. Take, for example, network file transfers. It turns out, we see a similar pattern on many networks: a bunch of small transfers (< 1 MB), not many medium-sized transfers (> 1MB and < 100MB), but almost always a few large ones (>100MB). What is likely happening is that emails and web-pages comprise most of the small transfers, and things like videos and compressed archives are sporadically transferred with relatively low frequency to make up the larger transfers. Simplifying this distribution (generated here from random data to preserve our customers’ privacy), the distribution of transfer sizes often looks a bit like this:

An average doesn’t work here. What we mean by that is that the arithmetic mean (the usual kind of mean) isn’t a good estimate of the expected value for this distribution. That’s just a fancy way of saying the mean is rather meaningless. The problem with the mean is that pesky dense pack of small transfers. With the large counts we see there, the mean is actually just a little bit past the blue histogram (small transfers) and just below the center of the orange (medium-sized transfers), and nowhere near the green (large transfers). If we were to raise security observations for all transfers above the mean, users would be alerted to nearly every medium- and large-sized transfer.

Ok, you say, let’s define the threshold relative to the mean. How about 2 standard deviations above? But why 2? Why not 10? Or maybe we do a little machine learning and figure out what users’ consider alert-worthily high and adjust the threshold accordingly? We try to do that sometimes, but not here because we’d need to heckle (and annoy) users by constantly asking for rather esoteric feedback. While we might get a couple of users who really enjoy investing that kind of time -- they aren’t your average users. What we really want is a statistic that isolates the outliers. Something that finds files transfers in these regions:

One way to deal with this is to cluster the transfer sizes with a clustering algorithm, verify the clustering solution is coherent and look for transfer sizes that land between clusters. Indeed, there are a lot of ways to identify outliers in data like this, but the data tends to dictate the analysis.

We have another analytic that looks for user logins at odd times. Instead of flagging login times by a rule (outside of 9 to 5? on the weekend? at night?) or by an average (5 logins an hour? what about weekends?) we combine a rolling average over entire weeks with a local measure of when users on certain devices tend to log in. It’s a hybrid approach to outlier detection: measure deviation from a rolling average as well as distance from a historical baseline grouped by users and devices.

Anomalies: Adding a dimension

But there are certainly other dimensions to file transfers and user logins not accounted for in these strategies. What protocol was used to transfer a file? At and over what time period? What were the source and destination? So let’s think about what would make a user login or a file transfer anomalous. Maybe outlier and anomaly are the same thing? If they are, great -- job done! But… they’re not. At least not for us.

The idea of an outlier is well-understood in statistical research, but anomalies, less so. Here, we’ll briefly explain how (and why) Senseon differentiates outliers and anomalies. The example worked through in the rest of this post is simplified for a clear explanation. In most real situations, our datasets represent points in a high-dimensional space (many more than two dimensions) and the data types aren’t always continuous or even numeric. And to make matters even more interesting, we analyse a lot of data, fast. These challenges are addressed with a mixture of concurrency, scalable algorithms, and opportunistic computing. In this post, we’ll focus on our conceptualization of anomaly detection and how it differs from outlier detection.

Let’s start with a small cluster of data around the origin of a 2-D plane. In a real-world situation, the axes would be meaningful, but here the meaning of the axes doesn’t matter.

The data depicted above is random, sampled from a bivariate normal / Gaussian distribution with a specified mean of 0. Now, say we want to find outliers in this data. Remember outliers are defined as points in a distribution that are far from the expected value(s). Luckily, in this distribution, the mean is the expected value and the mean is the origin, so distance from the origin is our definition of outlier. We can see from the overlaid concentric circles that there are a couple points which might justifiably be considered outliers, but none of them really stand out as such.

Let’s make things more complicated. Say we have a similar cluster around the origin, but with a fair few data points distributed along the x axis.

Now it’s pretty clear that we have some outliers - points relatively far from the origin. In this dataset, they’re still centred around the x axis (spoiler: their y values are normally distributed around y=0). Our conception of an outlier doesn’t really need to change to cope with this new dataset, it’s still distance from origin.

Getting more interesting, let’s say we also have a similar pattern running up the y axis:

We’ve got a bunch of points around the origin and a few trailing out along each axis. It’s not hard to justify calling anything beyond, say, 7 circles from the origin an outlier. Great! (Why 7? Let’s say just because <2% the points land beyond that distance.)

Outliers and anomalies

Is there anything different about outliers in the x axis and the y axis? Well, that’s a question for our nameless metrics represented on the axes. They’ll remain nameless here because it’s really up to you and your data to answer that kind of conceptual question. For this example, there’s nothing statistically different. But what if we were to add some off-axis points in the northeast quadrant?

It’s not just the color that makes these new points weird. In terms of our conception of outlier, they’re nothing special: most of the new points are beyond our threshold of 7-circles-from-the-origin and a couple are not. So we have a couple more outliers. Great. But they’re distinctive for reasons not to do with their distance from the origin. They are anomalies. They are anomalies because they lie far from what are called the principal components represented by the other data points. The idea of a principal component is for another post, but the distinction you can hopefully see is that the new points are rather distant from the axes, where most other points are found.

So are anomalies necessarily outliers? Consider the following addition to our dataset:

Those new red points in the southwest are weird, right? They’re certainly separate from the big cluster around the origin. But most of them are well within 2 circles of the origin. Are they outliers? Certainly not by our first definition. And if we update our definition to say, ok, outliers are anything beyond the first circle, then we’ll have to include quite a few points along the positive portions of the x and y axes. That doesn’t quite seem right. So they’re not outliers. But they are definitely anomalies for the same reason the more distant northeast points are anomalies: they don’t fall along the principle components of our dataset.

The main message is that outliers are not the same as anomalies: you can have one without the other. To wrap this up visually, this whole post boils down to the following figure:

Many components in Senseon’s defence framework look for outliers to raise or enrich security observations. The distinction between outlier and anomaly gives us more information. We use this additional information to reduce false-positive rates and filter out unimportant or redundant noise. In our example, distance from the origin provides a measure of “outlierness”. Sometimes this isn’t easy, but it’s usually possible. Scoring anomalies is also possible, and it’s here where we spend a lot of cycles implementing robust, scalable solutions. Consider the outlier-anomaly clusters in the northeast. The points are nearly equidistant (centred at about 45 degrees) from each of the principal components lying along the x and y axes. If these points nearer were closer to one or the other axis, our anomaly detection strategy would score it lower on a scale “anomalousness”. These two kinds of scores -- outlier and anomaly - allow Senseon to provide sensitive, discerning analysis of complex situations.

This example was designed to exemplify our strategies for outlier and anomaly detection, but things are usually more complicated. Our data is often multi-dimensional, with different types (scales, categories, counts, text, etc.) to which our methods can adapt, often automatically. You can probably guess that our anomaly detection algorithms have something to do with those principal components. Along with some clever tricks to extract why an anomaly is anomalous, Senseon is able to explain and justify the results of complex data-driven analytics to our users. While AI, machine learning, and data-driven strategies more generally may not address every cyber defence problem, after the easy ones are solved, you have to get creative to cover the final mile of malicious behaviour.

About the author

Dr Aaron Gerow, Software Engineer, Senseon

Aaron “Gerow” got his start studying Computer Science at Pacific Lutheran University, where he later worked as a programmer and Unix admin. Then it was on to postgraduate degrees in cognitive science and computational linguistics. After a pair of postdocs in computational social science and big data analytics, he moved to London for a lectureship in Data Science at Goldsmiths, University of London. Last summer, he took his first job in the private sector joining Senseon as a Software Engineer. As you can imagine, he spends a fair bit of time on data science work in addition to backend development. When he’s not busy accumulating as much knowledge as he can from Senseon’s security analysts, he’s trying to learn Rust and assembly.