(C) 2015 Mike Maul -- CC-BY-SA 3.0

Anyone wishing to run this notebook or the code contained there in must take note of the following:

- This time series cleaning section of this tutorial relies on the github version of
`CLML`

https://github.com/mmaul/clml.git or a quicklist-dist`CLML`

> than 20150805 - The plotting portion of this code requires the system
`clml.extras`

https://github.com/mmaul/clml.extras.git which is not currently in quicklisp. - While the above git repositories are not in quicklisp they be loaded by
`quickload`

by placing the repositories in $HOME/quicklisp/local-projects

A time series is a set of data points collected over a given period of time. Examples of time series are stock ticker data, sensor data and netflow data. Generally one wants to preform some sort of analysis on a time series and or use previous performance to forecast future performance or else use past performance to detect anomalies new data.

The CLML.time-series system contains functionality to manipulate, analyze time series data. CLML.time-series has a definite opinion on what a time-series is. We will see that after we load some data.

Lets get started by loading the system necessary for this tutorial and creating a namespace to work in.

In [1]:

```
(ql:quickload '(:clml.utility ; Need clml.utility.data to get data from the net
:clml.hjs ; Need clml.hjs.read-data to poke around the raw dataset
:clml.time-series ; Need Time Series package obviously
:iolib
:clml.extras.eazy-gnuplot
:eazy-gnuplot
))
```

Out[1]:

In [2]:

```
(defpackage #:time-series-part-2
(:use #:cl
#:cl-jupyter-user ; Not needed unless using iPython notebook
#:clml.time-series.read-data
#:clml.time-series.anomaly-detection
#:clml.time-series.exponential-smoothing
#:clml.extras.eazy-gnuplot)
(:import-from #:clml.hjs.read-data #:head-points #:!! #:dataset-dimensions)
(:import-from #:clml.time-series.util #:predict)
(:import-from #:clml.hjs.read-data #:read-data-from-file)
)
```

Out[2]:

In [3]:

```
(in-package :time-series-part-2)
```

Out[3]:

We are going to look at how time series data can be used by CLML. First lets get some data...

In [4]:

```
(defparameter dataset (read-data-from-file
(clml.utility.data:fetch
"https://mmaul.github.io/clml.data/sample/msi-access-stat/access-log-stat.sexp")))
```

Out[4]:

In [5]:

```
dataset
```

Out[5]:

CCLML has a number of different specializations of dataset such as

`unspecialized-dataset`

untyped and unspecialized data`numeric-dataset`

dataset containing numeric (`double-float`

) data`categor-dataset`

dataset for categorical (`string`

) data`numeric-and-category-dataset`

dataset containing a mixture of numeric and categorical data

Most relevant to this tutorial

`time-series-dataset`

dataset containing time-series data

Datasets can be created directly or can be created by reading them from a file. Supported data formats or CSV and SEXP.
In this case the `read-data-from-file`

function is reading a data set from a file. The file in this case is a file that is obtained with the `fetch`

function, which downloads and caches a file from a location on a local files system or a URL.

Lets take a look at the data, it apparently is from a hit counter.
`head-points`

gives us the first 5 rows of a dataset ( if we wanted all the rows in a dataset we would have used `dataset-points`

In [6]:

```
(head-points dataset)
```

Out[6]:

Examining the data it looks like hits collected hourly. `time-series-datasets`

can be created with the `time-series`

function.

Now Lets put this in a turn this into a time-series-dataset so we can do stuff with it.

In [7]:

```
(defparameter msi-access (time-series-data dataset :range '(1) :time-label 0 :frequency 24 :start '(18 3)))
```

Out[7]:

In [8]:

```
msi-access
```

Out[8]:

This is the point where we will talk about CLML.time-series's definite opinions about time series. Time series in CLML.time-series are discrete. In `CLML.time-series`

's opinion time series have a regular frequency. (This implies that time series data must have a reading at each period. However `CLML.time-series`

does support missing values which will be covered in a later part of this series) The representation of frequency is a important, especially when comparing time-series points at regular intervals. The `FREQUENCY`

slot specifies the number of datapoints per cycle. The `START`

slot indicates the starting time index and frequency interval. The measurements are contained in the points slot and are represented as a vector of `ts-point`

objects. Another useful thing to know is the slot accessor prefix is `ts-`

In fact in the dataset we just created if you look at the raw dataset above you will see there are no time specifiers in the data (there are labels however but they are not used in computations). This can actually be very important if your time-series has literally astronomical ranges. Some time-series libraries/databases encode the index as seconds or milliseconds from some fixed point in time. Doing that then constricts the ability of the time series to represent times to the range of the datatype being used to encode the time index. To be fair CLML.time-series in effect is doing the same thing however the time index is relative and the time indices can range from 0 to `most-positive-fixnum`

(~4.6e18 in SBCL) given a datapoint is defined by the time and frequency interval (which also range from 0 to `most-positive-fixnum`

the number of theoretically possible datapoints in a time series is `most-positive-fixnum`

squared (in SBCL this would be greater than 2.0e35)

Lets look at the points in the dataset to see how they are represented.

In [9]:

```
(subseq(ts-points msi-access) 0 5)
```

Out[9]:

he `ts-point`

class encodes each measurement maintaining the time and frequency interval, a label (which is just a string, ad the actual measurements. The measurements in the `pos`

slot are stored in a vector arbitrary length. Looking back to **IN[8]** you can see when we gave `time-series`

a start time of 18 , and a start frequency interval of 3 we can see by examining the `ts-points`

s how this is actually represented. Another useful thing to know is that the accessor prefix of `ts-point`

is `ts-s-`

`time-series-datasets`

can also be created programattically.
Some examples are:

```
(make-constant-time-series-data '("a") (vector (clml.hjs.meta:make-dvec 1)))
(make-constant-time-series-data '("price") (vector (v2dvec #(43.2d0)) (v2dvec #(44.0d0)) (v2dvec #(1049.0d0))))
```

Lets plot our data with the `clml.extras.eazy-gnuplot:plot-dataset`

.

Some quick things to note about the `plot-dataset`

method:

- The only required arguments are
`dataset`

and`y-col`

- The terminal defaults to
`:wxt :persist`

so for use in a notebook we specify a**PNG**terminal - The
`svg`

function is used to render the plot in the notebook `eazy-gnuplot`

is used as the plotting library all plotting arguments follow gnuplot and`eazy-gnuplot`

's conventions- The
`:range`

argument specifies the start and end of the points to display - The
`:frequencies`

argument is a list of frequencies to plot, handy for observing behavior over specific intervals.

In [10]:

```
(progn
(plot-dataset msi-access "hits" :terminal '(:png)
:range '(0 40) :title "MSI Access Log - first 40 points" :ytics-font ",8" :xtics-font ",8"
:xlabel-font ",15" :ylabel-font ",15" :output "msi_access_log_40.png")
(display-png (png-from-file "msi_access_log_40.png")))
```

Out[10]:

`ts-point`

has a label our x axis would get overwhelmed with labels, we use the `:xtic-interval`

to specify that we only want labels displayed every 500 points.

In [13]:

```
(progn
(plot-dataset msi-access "hits" :terminal '(:png )
:title "MSI Access Log" :ytics-font ",8" :xtics-font ",8"
:xlabel-font ",15" :ylabel-font ",15" :xtic-interval 500 :output "msi_access_log.png")
(display-png (png-from-file "msi_access_log.png")))
```

Out[13]:

Sometimes data may have missing values or outliers. It is not unusual to have a broken or malfunctioning sensor generating your data. We have a way of dealing with that.

The `time-series-dataset`

class has a `ts-cleaning`

method which can clean missing values an outliers. Lets look at the documentation:

```
TS-CLEANING names a generic function:
Lambda-list: (D &KEY)
Derived type: (FUNCTION (T &KEY) *)
Documentation:
- return: <time-series-dataset>
- arguments:
- d : <time-series-dataset>
- interp-types-alist:
a-list (key: column name, datum: interpolation(:zero :min :max :mean :median :mode :spline)) | nil
- outlier-types-alist:
a-list (key: column name, datum: outlier-verification(:std-dev :mean-dev :user :smirnov-grubbs
:freq)) | nil
- outlier-values-alist :
a-list (key: outlier-verification datum: the value according to outlier-verification) | nil
- comment:
Same as /dataset-cleaning/ in read-data package.
```

Lets give it a try. In particular lets set the threshold for outliers to 5 standard deviations and set the interpolation method to mean.

In [15]:

```
(defparameter c-msi-access
(ts-cleaning msi-access :outlier-types-alist '(("hits" . :std-dev))
:outlier-values-alist '((:std-dev . 5))
:interp-types-alist '(("hits" . :mean))))
```

Out[15]:

In [16]:

```
(let ((png-file "clean-msi-access-log"))
(plot-dataset c-msi-access "hits" :terminal '(:png)
:title "Cleaned MSI Access Log" :ytics-font ",8" :xtics-font ",8"
:xlabel-font ",15" :ylabel-font ",15" :xtic-interval 500
:yrange '(0 8000)
:output png-file)
(display-png (png-from-file png-file)))
```

Out[16]:

Notice the datapoint near 13 July 2008 15:00 to 15:59 that previously spiked to over 7000, is now more reasonable.

I would like to thank Fredreric Peschanski the creator of `fishbowl`

which provides common lisp support for iPython. I would also like to thank Masataro Asai the creator of `eazy-gnuplot`

. I would like to thank the creators of iPython and project Jupyter a truly cross platform mechanisim for th presentation of code and content. Finally I would like to thank github for [providing the ability to view notebooks inside github repositories] (http://blog.jupyter.org/2015/05/07/rendering-notebooks-on-github/)

The iPython notebook and source for this tutorial can be found in the clml.tutorials https://github.com/mmaul/clml.tutorials.git github repository.