Wednesday, December 28, 2011

DADS: Massaging Data

I've decided that "Diary of an Anomaly Detection System" is too wordy to keep writing in the title of the posts in this series, so I'm shortening it to "DADS" hence the title of this post "DADS: Massaging Data".

Anywho, as I said in the previous post, I'm going to talk a bit about what I needed to do to get my data ready for the anomaly detection algorithm. This post has nothing to do with machine learning, per se, but is an important part of designing an ML algorithm.

I'm going to use seven metrics ("features" in ML parlance) to start with: short-, medium-, and long-term load averages; memory use; number of processes; and the number of zombies. You can argue whether or not these are useful metrics but I'm not interested in that argument at this point. I'm currently building the framework for the ML algo; I'll be adding, subtracting, and inventing metrics once I have something to manipulate them with.

I'm using Python since that is one of the scripting languages of choice at my day job; Perl, unfortunately, is frowned upon and the consensus is Ruby can't do scientific programming just yet. Don't even get me started with Java.

Let's read some data

The data originally resides in RRDtool and needs to be put into a standard matrix form. Shouldn't be that difficult, right? RRDtool has a Python interface, so it's just a matter of reading the data in, right?  I wish! The RRDTool Python API is essentially a wrapper around the command-line tool but the output is "Python-esque".  For example, the CLI ouput for the load average looks like this:

[faber@fabers-desktop data] rrdtool fetch load/load.rrd  AVERAGE -s 1321160400

                      shortterm midterm longterm

1321164000: 5.3888888889e-03 1.2805555556e-02 5.0000000000e-02
1321167600: 3.0555555556e-03 1.1388888889e-02 5.0000000000e-02
1321171200: 3.7500000000e-03 1.1861111111e-02 5.0000000000e-02
...


where the first column is the number of seconds from the epoch and the three remaining colums are short-, medium- and long-term load averages; a very handy format. Unfortunately the Python output looks like this:

>>> mydata = rrdtool.fetch('load/load.rrd', 'AVERAGE', '--start=1321160400')
>>> mydata
((1321160400, 1325098800, 3600), ('shortterm', 'midterm', 'longterm'), [(0.005388888888888891, 0.012805555555555468, 0.05000000000000019), (0.0030555555555555557, 0.011388888888888818, 0.05000000000000019), (0.0037500000000000016, 0.011861111111111041, 0.0500000000000002), ...]


which is not a very handy format. For reasons which I'll get into later, I want the format to be this:

shortterm = ((1321164000, 5.3888888889e-03), 

             (1321167600, 3.0555555556e-03), 
             (1321171200, 3.7500000000e-03),
             ...
            )

mediumterm = ((1321164000, 1.2805555556e-02), (1321167600, 1.1388888889e-02), (1321171200, 1.1861111111e-02_,...)

longterm = ((1321164000, 5.0000000000e-02), (1321167600, 5.0000000000e-02), (1321171200, 5.0000000000e-02),...)


So the next step is to format the data.

List Comprehensions to the Rescue

I've always thought Python was just an okay language but its list comprehensions are kinda cute.  It wasn't until this project that I found out just how useful they are.  Here's the blow-by-blow action:

# mydata[0] = timestamp begin, end, and interval
# mydata[1] = labels
# mydata[2] = list of 3-tuples
mydata = rrdtool.fetch('load/load.rrd', 'AVERAGE', '--start=1321160400')

# create a list of timestamps at the appropriate intervals
tses = [ i for i in range(mydata[0][0], mydata[0][1], mydata[0][2]) ]

# create three lists from the 3-tuple list
st, mt, lt = zip(*mydata[2])

mydict = {}
mydict['shortterm'] = zip(tses, st)
mydict['midterm'] = zip(tses, mt)
mydict['longterm'] = zip(tses, lt)


Seven lines of code. I don't know about you, but I'm impressed when a language allows me to do that with native functions.

So what's with the key/value format?

There's a subtle problem with the raw data that's not obvious until you start reading in other RRDtool files and try munging them together: you don't always have data for all the same timestamps. memory.rrd might have data for timestamps t1 and t2 while load.rrd might have data for t2 and t3. How do you manage your lists so that you don't duplicate timestamps (two t2s in the above case) AND fill in values for data you don't have and don't know you don't have? Easy:

SQL.

I'm going to store my data into an SQLite3 database then generate a matrix from the database table. If I do my SQL correctly (and I will :-), SQLite3 will fill in missing data, order by timestamp and I don't have to keep track of values or timestamps across rrd files! This is why I break every metric.rrd file into a (timestamp, value) data structure and put it into a dictionary called mydict['metric']; so I can easily insert and update the metric column in the database!

How that is actually done, I'll talk about in the next post since it's late.


Monday, December 26, 2011

Diary of an Anomaly Detection System

I recently finished Stanford's Machine Learning course offered as part of SEE. It was one of the best courses I've ever taken. Not only did I learn a lot about ML algorithms, but I learned a lot about the applications thereof as well as some math applications.

One of the algoritms that grabbed my attention was "anomaly detection"; this is the algorithm credit card companies use to flag possibly fraudulent activity. It can also be used for monitoring computers and, I believe, web pages but more about that later.

Which brings us to this series of blog posts. Since I'm on winter break, I decided to spend part of my time designing, coding, and blogging about building an anomaly detection system.

The Motivation


My current and on-going project at work is called Sentry, a system that processes URLs. The system involves 23 (and counting) virtual machines and four physical machines so obviously, system administration is a not-insubstantial part of the project.

Currently, the overall health of the system can be measured by throughput; I use collectd and RRDtool to monitor the system. If the throughput is too low, I know something is wrong with one or more of the machines but that doesn't tell me which machine(s) is having a problem, so I look at the CPU load graph for each machine to see if anything looks odd, if not, I look at the network transfers graph for each machine to see if anything looks off, if not, then I... Since there are 27 machines, each with over a dozen metrics, this is a chore, as you can imagine.

Anomaly detection will reduce the the dozen-plus metrics for each computer down to a range of numbers that we consider "normal behavior" and, by extension, the health of all 27-plus machines down to a range of numbers. So how do we define the range of numbers?


All of my metrics measure different things: load average, memory use, etc. If I plot the range of numbers over time I'll get different looking graphs, but for now, assume all the graphs are Gaussian. What I'm going to do is take each metric's numbers and figure out which Gaussian curve (in other words, the values for μ and σ) best fits the data. I'll then take the latest reading, for say, memory use, and see where on the curve it sits (since I know the equation for the Gaussian distribution function) call that value p(memory), and for any value of p(memory) greater than say, ε, I'll tell the computer to flag that reading as an anomaly.

I'll then take all the latest metric readings (the x-s), calculate the p(x)s and multiply them together; that gives me a single number, super-ε, that tells me if my computer is acting "weird".  If I multiply all the p(x)s for all 27+ computers, I get another super-ε number telling me if my cluster of computers is acting "weird". Neat, huh?

First Things First


But first, I need to gather the data and put it into matrix form that the algorithm can handle. The raw data resides in RRDtool tables. I'll be using Python as my main language. The vectorization libraries will be numpy and scipy. Even though my main machine at home is a Macbook Pro, installing numpy and scipy on a MBP is a PITA, so I'll be doing all of our work on a (Ubuntu) Linux box.

Massaging the data will be the topic of my next blog post.







Sunday, February 6, 2011

The Budget Game

I've been doing Tim Ferris' "Slow Carb Diet" from his book "The Four Hour Body". To help out, I've been tracking calories using the MyFitnessPal Calorie Counter app on my Android phone. So far, so good; I'm down five pounds in two weeks and this is the first time I've been below 190 in a decade.

So I got to thinking, why not do with dollars what I'm doing with calories, IOW, set up a budget? Now, I've done that in the past and it's been BORING! so I thought about how to make it more fun and came up with the idea of making it a game.

Apparently, the idea of making budgeting a game isn't a new idea but it's not that widespread either. So I'm thinking how to turn budgeting into a game. Here's what I've gotten so far:

1. A "game" is a month-long and you play 30 day-long "sessions".

2. You allocate a specific amount of health points for the game (your monthly budget) and session (your daily budget). The daily allotment shows up on the screen as your avatar (think the Marine from DOOM; I am ;-).

3. As you spend money, your avatar's health deteriorates. If you go over budget, your avatar dies and the game is over until the next day.

4. If you do get through the day without killing your avatar, you win points (?) and the remaining health points goes into a vault. At the end of the game, you see how well you've scored by seeing how many health points you have in the vault.

The idea here is to track day-to-day spending, not the big monthly outlays like rent and heating.

Since this is going to run on my Android phone and I really dislike Java, I'm going to do this as an HTML5 app. If I understand correctly, I can write the app using vi and Chrome, then wrap it in an Android "package" and I'm done. I assume something similar can be done for the iPhone, but, come on, who uses an iPhone these days? Really?

The game idea still needs work, but that's what development is for. :-) If you have any suggestions, comment below or tweet me using the hashtag #TheBudgetGame.



Monday, June 22, 2009

Okay, I have no idea what this mean, but a crocheting friend of mine thinks this is the bee's knees for crocheters everywhere, so I'm passing it along:

With most afghans that run back-and-forth (in contrast to the concentric type), the base row can be inflexible compared to the rest of the afghan, which results in one end (usually the on shorter sides) being a consistent length and the opposite side getting longer as the afghan is stretched or pulled as it ages. I had about 10 rows done when I realized the base row was too tight, and I experimented and figured out a way to make the base row and the first row at the same time, and make it as flexible/stretchy as the rest of the crocheting. One small step for me, one giant leap for crocheters everywhere. I should publish this technique somewhere.

[Here's the the technique]
chain 4, yarn over, insert hook in first chain stitch, yarn over, pull through a loop. Yarn over and pull through the last loop, then yarn over and pull through the last loop (two chain stitches on the end of the hook), then yarn over and finish like a normal double crochet: pull through two loops, yarn over and pull through two loops. For the next stitch, yarn over, insert hook in the second chain stitch at the bottom of the previous eccentric double crochet, pull through a loop, do the two chain stitches at the end of the hook, then complete the double crochet.
Got that? Good! Now someone explain it to me! On second thought, don't bother. :-)

Monday, May 25, 2009

Thoughts about the Panopticon

I finally got around to reading Joshua-Michéle Ross's three articles over on O'Reilly's Radar. Nice overview of the topics but I was hoping for somehing more in depth. The last one, The Digital Panopticon, gave me an idea.

While I do love the idea of location based services (I'm even writing one of my own), I'm beginning to wonder if there is a way to anonymize such a service so the end-user can have the benefits of LBS without giving up information to the Watchers. Ideas are welcome.

As an aside, Joshua-Michéle states:
In the age of social networks we find ourselves coming under a vast grid of surveillance - of permanent visibility. The routine self-reporting of what we are doing, reading, thinking via status updates makes our every action and location visible to the crowd. This visibility has a normative effect on behavior (in other words we conform our behavior and/or our speech about that behavior when we know we are being observed).
He doesn't take into account that we (some of us, at least) are not reporting all of our activities and locations. True, we may be few and far between, but we do exist.


Wednesday, May 20, 2009

I loved this TED Talks about 10 things you didn't know about orgasm, mostly because I think Mary Roach is hot! :-)

Saturday, May 16, 2009

Last year's MP3 Experiment

With the latest NYC MP3 Experiment coming up next week, I thought this might be a good time to reminisce about my first experience with the MP3 Experiment from last year. So here's a write-up of last years activities. I'm really looking forward to this years!

Here are some pictures of last year's event.

------------------------------------------------------------------------------
Okay, if you followed my Twitter feed last Saturday (20080927), you have a pretty good idea how my Day In The city went and my thoughts on Governor's Island, dinner and girl-watching in Little Italy, and the Museum of Sex. What you don't know is what actually happened during the MP3 Experiment:NYC because I was busy doing the experiment. ImprovEverywhere will have film and commentary up on their site in a few weeks. Here's my experience with it all.

Background: We were told to show up at Governor's Island, just south of Manhattan, wearing a red, green, yellow, or blue t-shirt, bring an umbrella and a balloon. At precisely 3:15 PM, we were to push the play button on our MP3 players and do what the Voice of Steve told us.

I showed up about an hour early. The weather was warm, if misty/raining. After walking around awhile, I wasn't feeling too happy. Anyway, at 3 PM I head for the large field in the middle of the island, put in my earphones, checked the time on my cell phone, sent one last tweet, and waited. Fortunately, the weather started to cooperate; the sun hadn't come out but it did stop raining/misting. There were maybe 50 to 75 people scattered about the field which is a couple of acres large.

At precisely 3:15 PM, I pushed "play" and music played. And played. And played. I heard someone mention maybe they were just screwing with us and there was nothing but music on the MP3. After a minute, we heard The Voice Of Steve.

The Voice of Steve (which was obviously computer modulated but not computer generated) welcomed us to Governor's Island. After a few minutes of talk, Steve told us to stand and stretch. It was interesting watching the others; some were ten or even twenty seconds later than others. Along with not synchronizing properly, it turns out that MP3 players don't all play at the exact same rate.

After we stood and stretched, Steve had us point to NYC, then to our homes, then to Nicaragua. At this point, I saw some people point at the sky! Steve commented on our lack of geographical knowledge.

Steve then told us to look around for a person wearing a different color shirt (I was wearing green) and give them a big hug. The first person I saw was a large woman wearing blue, so I walked up to her and gave her a hug. And she hugged back. Hard!

Steve said to hug an inanimate object. Not being near a tree or anything, I hugged my umbrella. Then He said to hug an animal. Any animal would do: squirrel, goose, an ant. Unfortunately, I couldn't even find an ant to hug. :-(

Steve then declared we would have thumb wars! This cute little chick (she barely came up to my shoulders) in blue denim jacket and jeans was walking by so I grabbed her and we got into position. "One, two, three, four, I declare a thumb war!" boomed in our headsets and we started thumb wrestling! I easily one the first round. My opponent decided I wasn't going to win a second time; she put down her umbrella, got into a fighting stance and "One, two, three, four I declare a thumb war" was heard and she fought hard! And she won! So with a tied score, we fought one last battle which I handily one. She bowed to my impressive thumb-fighting skills and we went our different ways.

Steve then told us to walk to the field in the middle of the island where there would be an "epic battle" later. (Every time Steve mentioned the words "epic battle", a deep baritone would say "EPIC! BATTLE!" behind him.) This is when I noticed that many people were not on the field to begin with! He told us to take out our umbrellas, hold them high over our heads and walk around the field. We did this for several minutes; it actually got kinda boring after a while but I got to see some interesting ppl.

One interesting person I saw was a young woman with The Most Elegant Equation in Mathematics tattooed across her shoulder blades in letters a hand high! When she walked past me a few minutes later, I tapped her on her shoulder and said "Cool tat!". She looked at me, confused, and said "Burn" and walked away. That's when I noticed that it wasn't a tattoo; the equation was burned into her skin! Kids these days!

We played a couple of "motion games" as I call them: "Equilateral Triangle" and "Attacker Defender". In the former, you choose two other people on the field and you move in such a way as to form and equilateral triangle. It sounds easy until you realize they're doing the same thing with two *other* people. "Attacker Defender" is similar except you keep the person you've chosen as the Defender between you and the Attacker. Again, they're doing the same thing with two other people. Then it got interesting.

Steve told us to find three other people with the same color shirt and form a square with them, so there I was standing shoulder to shoulder forming a tight square with an Irish woman, a swarthy fellow and an Asian fellow. It's just weird standing that close to strangers, what with them in your personal space, you know?

At this point, Steve tells us we need to learn how to do a "fife and drum" shtick for the epic battle (EPIC! BATTLE!). He instructed the reds and the blues to tap out a rhythm on their thighs. He instructed the greens and the yellows to play the fife part and we whistled the tune he gave us. Imagine the scene: over 200 people standing on field, forming tight little squares, half of them drumming on their thighs and the other half whistling.

After we practiced that a few times, Steve congratulated us on a job well done. Then he said to find three other squares of the same color and form a larger geometric object, you know, like the shapes in Tetris. I'm sure you can guess what comes next. After we form a larger object, we were told to move around the field and find a shape we could fit in with. Yes, we were playing human Tetris. So now those squares of people were now scrunched together even tighter! There was, literally, no room to move.

Steve then told us to take out our umbrellas and life them over our heads. We did, and we blotted out the sky! One minute we're in an open field and the next minute we're under a canopy of plastic. That felt really weird! And then we started humming, cuz Steve said to.

After a few minutes of this, Steve congratulated us again, told us to put our umbrellas away and to find people wearing different colored shirts. That was easy, I just turned around and I was standing next to another little cutie in blue (I refer to her as Smurfette), a tall guy in red, etc. I notice Smurfette wasn't wearing headphones and I asked her if she was hearing The Voice Of Steve? She shook her head "no" so I took out one of my earplugs and put it on her. I was rather surprised she didn't pull away or anything when I did it, but where was she going to pull away to? We were all packed tightly together.

Steve mentioned that we were now going to play "Human Twister"; he would mention a color and either "head", "elbow", "shoulder" or "left/right foot" and you had to put your hand on the head/elbow/shoulder or put your foot next to the foot of someone with that color shirt. So when Steve said "Red head" we all put our hands on the head of the tall guy wearing red. With "blue shoulder" I put my hand on Smurfette's shoulder, etc. That was fun, but at one point I thought some people were going to fall over and take us (and everyone else?) with them. Fortunately, that was averted by Steve telling us to prepare for the "epic battle" (EPIC! BATTLE!).

To do so, the blues and greens went to one end of the field and the reds and the yellows to the other end. We got out our "weapons", the balloons, and blew them up. Smurfette was still with me, so I was echoing Steve's instructions since she wasn't wearing my other headphone anymore. The two groups started yelling at each other: "Red rules!" "Get them all!", ""Death to the Yellows!".

Steve told us to get our "weapons" ready and for each group to walk towards the other and stop when we were twenty yards apart while we did the fife and drum shtick. So there we were: two groups of people walking across the field of battle towards one another, whooping and hollering, brandishing our "weapons", some whistling and some tapping their thighs. That was so much fun and exciting. I actually thought back to eariier battles in history and wondered if this wasn't just a little like them.

Anyway, we stopped twenty yards from each other and commenced with more whooping and hollering. Steve said "Fight" and each group rushed the other one balloons flailing! Oh, it was an epic battle (EPIC! BATTLE!) to be sure! A Yellow had broken through the lines and I was attacking him when four Reds come around our left flank and attacked me. As they beat me with their balloons, I fell to my knees, crying for help, swinging my weapon uselessly around me, and then I fell over "dead". Cries of "Medic!" where heard all across the battlefield.

I just laid on my back and watched the whole thing. :-)

Eventually, Steve called a halt to the battle. By this time over half
of the people were laying on the ground "dead". Steve then had us close our eyes and meditate for a few moments. Then He said goodbye and we all waved as He left.

And with that, the MP3 Experiment:NYC was over.

As I and 200 of my closest friends walked over to the docks, I saw my thumb-war opponent. She saw me and stuck her thumb up in the air. As I passed, we did a quick thumb war which, alas, was a draw.

I was wondering how they were going to handle having 200-some people trying to get back to Manhattan all at the same time, especially when the ferries only run once every half hour. It turns out that they got some really big ferries and they were loading two up at the same time.

So we formed two queues each about six abreast and we were allowed onto the ferries in groups of ten or so. While we were waiting, someone led us in various activities like singing "If you're happy and you know it" and "Row, row, row your boat". Eventually, I did get on the first boat out, had a nice time talking to some other people, and eventually got into a cab and headed for dinner in Little Italy.