(C) 2015 Mike Maul -- CC-BY-SA 3.0
This document is a the first is a series of tutorials illustrating the use of the CLML.time-series package. In fact it is a first in a series of series of tutorials illustrating the use of CLML.
Anyone wishing to run this notebook or the code contained there in must take note of the following:
CLML
https://github.com/mmaul/clml.git or a quicklist-dist CLML
> than 20150805 clml.extras
https://github.com/mmaul/clml.extras.git which is not currently in quicklisp.quickload
by placing the repositories in $HOME/quicklisp/local-projectsA time series is a set of data points collected over a given period of time. Examples of time series are stock ticker data, sensor data and netflow data. Generally one wants to preform some sort of analysis on a time series and or use previous performance to forecast future performance or else use past performance to detect anomalies new data.
The CLML.time-series system contains functionality to manipulate, analyze time series data. CLML.time-series has a definite opinion on what a time-series is. We will see that after we load some data.
Lets get started by loading the system necessary for this tutorial and creating a namespace to work in.
(ql:quickload '(:clml.utility ; Need clml.utility.data to get data from the net
:clml.hjs ; Need clml.hjs.read-data to poke around the raw dataset
:clml.time-series ; Need Time Series package obviously
:iolib
:clml.extras.eazy-gnuplot
:eazy-gnuplot
))
(defpackage #:time-series-part-2
(:use #:cl
#:cl-jupyter-user ; Not needed unless using iPython notebook
#:clml.time-series.read-data
#:clml.time-series.anomaly-detection
#:clml.time-series.exponential-smoothing
#:clml.extras.eazy-gnuplot)
(:import-from #:clml.hjs.read-data #:head-points #:!! #:dataset-dimensions)
(:import-from #:clml.time-series.util #:predict)
(:import-from #:clml.hjs.read-data #:read-data-from-file)
)
(in-package :time-series-part-2)
We are going to look at how time series data can be used by CLML. First lets get some data...
(defparameter dataset (read-data-from-file
(clml.utility.data:fetch
"https://mmaul.github.io/clml.data/sample/msi-access-stat/access-log-stat.sexp")))
CLML's main unit of currency in working with data is the dataset. The dataset is a hierarchy series of classes that contain datapoints and metadata. They are similar to dataframes in R or data-tables in Python.
dataset
CCLML has a number of different specializations of dataset such as
unspecialized-dataset
untyped and unspecialized datanumeric-dataset
dataset containing numeric (double-float
) datacategor-dataset
dataset for categorical (string
) datanumeric-and-category-dataset
dataset containing a mixture of numeric and categorical dataMost relevant to this tutorial
time-series-dataset
dataset containing time-series dataDatasets can be created directly or can be created by reading them from a file. Supported data formats or CSV and SEXP.
In this case the read-data-from-file
function is reading a data set from a file. The file in this case is a file that is obtained with the fetch
function, which downloads and caches a file from a location on a local files system or a URL.
Lets take a look at the data, it apparently is from a hit counter.
head-points
gives us the first 5 rows of a dataset ( if we wanted all the rows in a dataset we would have used dataset-points
(head-points dataset)
Examining the data it looks like hits collected hourly. time-series-datasets
can be created with the time-series
function.
Now Lets put this in a turn this into a time-series-dataset so we can do stuff with it.
(defparameter msi-access (time-series-data dataset :range '(1) :time-label 0 :frequency 24 :start '(18 3)))
msi-access
This is the point where we will talk about CLML.time-series's definite opinions about time series. Time series in CLML.time-series are discrete. In CLML.time-series
's opinion time series have a regular frequency. (This implies that time series data must have a reading at each period. However CLML.time-series
does support missing values which will be covered in a later part of this series) The representation of frequency is a important, especially when comparing time-series points at regular intervals. The FREQUENCY
slot specifies the number of datapoints per cycle. The START
slot indicates the starting time index and frequency interval. The measurements are contained in the points slot and are represented as a vector of ts-point
objects. Another useful thing to know is the slot accessor prefix is ts-
In fact in the dataset we just created if you look at the raw dataset above you will see there are no time specifiers in the data (there are labels however but they are not used in computations). This can actually be very important if your time-series has literally astronomical ranges. Some time-series libraries/databases encode the index as seconds or milliseconds from some fixed point in time. Doing that then constricts the ability of the time series to represent times to the range of the datatype being used to encode the time index. To be fair CLML.time-series in effect is doing the same thing however the time index is relative and the time indices can range from 0 to most-positive-fixnum
(~4.6e18 in SBCL) given a datapoint is defined by the time and frequency interval (which also range from 0 to most-positive-fixnum
the number of theoretically possible datapoints in a time series is most-positive-fixnum
squared (in SBCL this would be greater than 2.0e35)
Lets look at the points in the dataset to see how they are represented.
(subseq(ts-points msi-access) 0 5)
he ts-point
class encodes each measurement maintaining the time and frequency interval, a label (which is just a string, ad the actual measurements. The measurements in the pos
slot are stored in a vector arbitrary length. Looking back to IN[8] you can see when we gave time-series
a start time of 18 , and a start frequency interval of 3 we can see by examining the ts-points
s how this is actually represented. Another useful thing to know is that the accessor prefix of ts-point
is ts-s-
time-series-datasets
can also be created programattically.
Some examples are:
(make-constant-time-series-data '("a") (vector (clml.hjs.meta:make-dvec 1)))
(make-constant-time-series-data '("price") (vector (v2dvec #(43.2d0)) (v2dvec #(44.0d0)) (v2dvec #(1049.0d0))))
Lets plot our data with the clml.extras.eazy-gnuplot:plot-dataset
.
Some quick things to note about the plot-dataset
method:
dataset
and y-col
:wxt :persist
so for use in a notebook we specify a PNG terminalsvg
function is used to render the plot in the notebookeazy-gnuplot
is used as the plotting library all plotting arguments follow gnuplot and eazy-gnuplot
's conventions:range
argument specifies the start and end of the points to display:frequencies
argument is a list of frequencies to plot, handy for observing behavior over specific intervals.(progn
(plot-dataset msi-access "hits" :terminal '(:png)
:range '(0 40) :title "MSI Access Log - first 40 points" :ytics-font ",8" :xtics-font ",8"
:xlabel-font ",15" :ylabel-font ",15" :output "msi_access_log_40.png")
(display-png (png-from-file "msi_access_log_40.png")))
Now lets look at the whole dataset. Since each ts-point
has a label our x axis would get overwhelmed with labels, we use the :xtic-interval
to specify that we only want labels displayed every 500 points.
(progn
(plot-dataset msi-access "hits" :terminal '(:png )
:title "MSI Access Log" :ytics-font ",8" :xtics-font ",8"
:xlabel-font ",15" :ylabel-font ",15" :xtic-interval 500 :output "msi_access_log.png")
(display-png (png-from-file "msi_access_log.png")))
Sometimes data may have missing values or outliers. It is not unusual to have a broken or malfunctioning sensor generating your data. We have a way of dealing with that.
The time-series-dataset
class has a ts-cleaning
method which can clean missing values an outliers. Lets look at the documentation:
TS-CLEANING names a generic function:
Lambda-list: (D &KEY)
Derived type: (FUNCTION (T &KEY) *)
Documentation:
- return: <time-series-dataset>
- arguments:
- d : <time-series-dataset>
- interp-types-alist:
a-list (key: column name, datum: interpolation(:zero :min :max :mean :median :mode :spline)) | nil
- outlier-types-alist:
a-list (key: column name, datum: outlier-verification(:std-dev :mean-dev :user :smirnov-grubbs
:freq)) | nil
- outlier-values-alist :
a-list (key: outlier-verification datum: the value according to outlier-verification) | nil
- comment:
Same as /dataset-cleaning/ in read-data package.
Lets give it a try. In particular lets set the threshold for outliers to 5 standard deviations and set the interpolation method to mean.
(defparameter c-msi-access
(ts-cleaning msi-access :outlier-types-alist '(("hits" . :std-dev))
:outlier-values-alist '((:std-dev . 5))
:interp-types-alist '(("hits" . :mean))))
(let ((png-file "clean-msi-access-log"))
(plot-dataset c-msi-access "hits" :terminal '(:png)
:title "Cleaned MSI Access Log" :ytics-font ",8" :xtics-font ",8"
:xlabel-font ",15" :ylabel-font ",15" :xtic-interval 500
:yrange '(0 8000)
:output png-file)
(display-png (png-from-file png-file)))
Notice the datapoint near 13 July 2008 15:00 to 15:59 that previously spiked to over 7000, is now more reasonable.
I would like to thank Fredreric Peschanski the creator of fishbowl
which provides common lisp support for iPython. I would also like to thank Masataro Asai the creator of eazy-gnuplot
. I would like to thank the creators of iPython and project Jupyter a truly cross platform mechanisim for th presentation of code and content. Finally I would like to thank github for [providing the ability to view notebooks inside github repositories] (http://blog.jupyter.org/2015/05/07/rendering-notebooks-on-github/)
The iPython notebook and source for this tutorial can be found in the clml.tutorials https://github.com/mmaul/clml.tutorials.git github repository.