I've decided that "Diary of an Anomaly Detection System" is too wordy to keep writing in the title of the posts in this series, so I'm shortening it to "DADS" hence the title of this post "DADS: Massaging Data".
Anywho, as I said in the previous post, I'm going to talk a bit about what I needed to do to get my data ready for the anomaly detection algorithm. This post has nothing to do with machine learning, per se, but is an important part of designing an ML algorithm.
I'm going to use seven metrics ("features" in ML parlance) to start with: short-, medium-, and long-term load averages; memory use; number of processes; and the number of zombies. You can argue whether or not these are useful metrics but I'm not interested in that argument at this point. I'm currently building the framework for the ML algo; I'll be adding, subtracting, and inventing metrics once I have something to manipulate them with.
I'm using Python since that is one of the scripting languages of choice at my day job; Perl, unfortunately, is frowned upon and the consensus is Ruby can't do scientific programming just yet. Don't even get me started with Java.
where the first column is the number of seconds from the epoch and the three remaining colums are short-, medium- and long-term load averages; a very handy format. Unfortunately the Python output looks like this:
which is not a very handy format. For reasons which I'll get into later, I want the format to be this:
So the next step is to format the data.
Seven lines of code. I don't know about you, but I'm impressed when a language allows me to do that with native functions.
SQL.
I'm going to store my data into an SQLite3 database then generate a matrix from the database table. If I do my SQL correctly (and I will :-), SQLite3 will fill in missing data, order by timestamp and I don't have to keep track of values or timestamps across rrd files! This is why I break every
How that is actually done, I'll talk about in the next post since it's late.
Anywho, as I said in the previous post, I'm going to talk a bit about what I needed to do to get my data ready for the anomaly detection algorithm. This post has nothing to do with machine learning, per se, but is an important part of designing an ML algorithm.
I'm going to use seven metrics ("features" in ML parlance) to start with: short-, medium-, and long-term load averages; memory use; number of processes; and the number of zombies. You can argue whether or not these are useful metrics but I'm not interested in that argument at this point. I'm currently building the framework for the ML algo; I'll be adding, subtracting, and inventing metrics once I have something to manipulate them with.
I'm using Python since that is one of the scripting languages of choice at my day job; Perl, unfortunately, is frowned upon and the consensus is Ruby can't do scientific programming just yet. Don't even get me started with Java.
Let's read some data
The data originally resides in RRDtool and needs to be put into a standard matrix form. Shouldn't be that difficult, right? RRDtool has a Python interface, so it's just a matter of reading the data in, right? I wish! The RRDTool Python API is essentially a wrapper around the command-line tool but the output is "Python-esque". For example, the CLI ouput for the load average looks like this: [faber@fabers-desktop data] rrdtool fetch load/load.rrd AVERAGE -s 1321160400
shortterm midterm longterm
1321164000: 5.3888888889e-03 1.2805555556e-02 5.0000000000e-02
1321167600: 3.0555555556e-03 1.1388888889e-02 5.0000000000e-02
1321171200: 3.7500000000e-03 1.1861111111e-02 5.0000000000e-02
...
where the first column is the number of seconds from the epoch and the three remaining colums are short-, medium- and long-term load averages; a very handy format. Unfortunately the Python output looks like this:
>>> mydata = rrdtool.fetch('load/load.rrd', 'AVERAGE', '--start=1321160400')
>>> mydata
((1321160400, 1325098800, 3600), ('shortterm', 'midterm', 'longterm'), [(0.005388888888888891, 0.012805555555555468, 0.05000000000000019), (0.0030555555555555557, 0.011388888888888818, 0.05000000000000019), (0.0037500000000000016, 0.011861111111111041, 0.0500000000000002), ...]
which is not a very handy format. For reasons which I'll get into later, I want the format to be this:
shortterm = ((1321164000, 5.3888888889e-03),
(1321167600, 3.0555555556e-03),
(1321171200, 3.7500000000e-03),
...
)
mediumterm = ((1321164000, 1.2805555556e-02), (1321167600, 1.1388888889e-02), (1321171200, 1.1861111111e-02_,...)
longterm = ((1321164000, 5.0000000000e-02), (1321167600, 5.0000000000e-02), (1321171200, 5.0000000000e-02),...)
So the next step is to format the data.
List Comprehensions to the Rescue
I've always thought Python was just an okay language but its list comprehensions are kinda cute. It wasn't until this project that I found out just how useful they are. Here's the blow-by-blow action:
# mydata[0] = timestamp begin, end, and interval
# mydata[1] = labels
# mydata[2] = list of 3-tuples
mydata = rrdtool.fetch('load/load.rrd', 'AVERAGE', '--start=1321160400')
# create a list of timestamps at the appropriate intervals
tses = [ i for i in range(mydata[0][0], mydata[0][1], mydata[0][2]) ]
# create three lists from the 3-tuple list
st, mt, lt = zip(*mydata[2])
mydict = {}
mydict['shortterm'] = zip(tses, st)
mydict['midterm'] = zip(tses, mt)
mydict['longterm'] = zip(tses, lt)
Seven lines of code. I don't know about you, but I'm impressed when a language allows me to do that with native functions.
So what's with the key/value format?
There's a subtle problem with the raw data that's not obvious until you start reading in other RRDtool files and try munging them together: you don't always have data for all the same timestamps.memory.rrd
might have data for timestamps t1
and t2
while load.rrd
might have data for t2
and t3
. How do you manage your lists so that you don't duplicate timestamps (two t2
s in the above case) AND fill in values for data you don't have and don't know you don't have? Easy:SQL.
I'm going to store my data into an SQLite3 database then generate a matrix from the database table. If I do my SQL correctly (and I will :-), SQLite3 will fill in missing data, order by timestamp and I don't have to keep track of values or timestamps across rrd files! This is why I break every
metric.rrd
file into a (timestamp, value)
data structure and put it into a dictionary called mydict['metric']
; so I can easily insert and update the metric
column in the database!How that is actually done, I'll talk about in the next post since it's late.