Monday, December 26, 2011

Diary of an Anomaly Detection System

I recently finished Stanford's Machine Learning course offered as part of SEE. It was one of the best courses I've ever taken. Not only did I learn a lot about ML algorithms, but I learned a lot about the applications thereof as well as some math applications.

One of the algoritms that grabbed my attention was "anomaly detection"; this is the algorithm credit card companies use to flag possibly fraudulent activity. It can also be used for monitoring computers and, I believe, web pages but more about that later.

Which brings us to this series of blog posts. Since I'm on winter break, I decided to spend part of my time designing, coding, and blogging about building an anomaly detection system.

The Motivation


My current and on-going project at work is called Sentry, a system that processes URLs. The system involves 23 (and counting) virtual machines and four physical machines so obviously, system administration is a not-insubstantial part of the project.

Currently, the overall health of the system can be measured by throughput; I use collectd and RRDtool to monitor the system. If the throughput is too low, I know something is wrong with one or more of the machines but that doesn't tell me which machine(s) is having a problem, so I look at the CPU load graph for each machine to see if anything looks odd, if not, I look at the network transfers graph for each machine to see if anything looks off, if not, then I... Since there are 27 machines, each with over a dozen metrics, this is a chore, as you can imagine.

Anomaly detection will reduce the the dozen-plus metrics for each computer down to a range of numbers that we consider "normal behavior" and, by extension, the health of all 27-plus machines down to a range of numbers. So how do we define the range of numbers?


All of my metrics measure different things: load average, memory use, etc. If I plot the range of numbers over time I'll get different looking graphs, but for now, assume all the graphs are Gaussian. What I'm going to do is take each metric's numbers and figure out which Gaussian curve (in other words, the values for μ and σ) best fits the data. I'll then take the latest reading, for say, memory use, and see where on the curve it sits (since I know the equation for the Gaussian distribution function) call that value p(memory), and for any value of p(memory) greater than say, ε, I'll tell the computer to flag that reading as an anomaly.

I'll then take all the latest metric readings (the x-s), calculate the p(x)s and multiply them together; that gives me a single number, super-ε, that tells me if my computer is acting "weird".  If I multiply all the p(x)s for all 27+ computers, I get another super-ε number telling me if my cluster of computers is acting "weird". Neat, huh?

First Things First


But first, I need to gather the data and put it into matrix form that the algorithm can handle. The raw data resides in RRDtool tables. I'll be using Python as my main language. The vectorization libraries will be numpy and scipy. Even though my main machine at home is a Macbook Pro, installing numpy and scipy on a MBP is a PITA, so I'll be doing all of our work on a (Ubuntu) Linux box.

Massaging the data will be the topic of my next blog post.







No comments:

Post a Comment