(C) 2015 Mike Maul -- CC-BY-SA 3.0
This document is part series of tutorials illustrating the use of CLML.
CLML datasets are two dimensional tabular data structures. In CLML datasets are used for (not to sound recursive) storing datasets. Datasets may contain numerical and categorical data. Datasets also contain column metadata (dimensions
) and also provide facilities for extracting columns, dataset cleaning and splitting. Datasets in CLML are similar to dataframes in R or Pandas.DataFrames in Python.
Lets get started by loading the system necessary for this tutorial and creating a namespace to work in.
(ql:quickload '(:clml.utility ; Need clml.utility.data to get data from the net
:clml.hjs ; Need clml.hjs.read-data for dataset
:iolib
:clml.extras.eazy-gnuplot
:eazy-gnuplot
))
(defpackage #:datasets-tutorial
(:use #:cl
#:cl-jupyter-user ; Not needed unless using iPython notebook
#:clml.hjs.read-data
#:clml.hjs.meta ; util function
#:clml.extras.eazy-gnuplot))
(in-package :datasets-tutorial)
Lets load some data that we will use as we learn about datasets.
(defparameter dataset (read-data-from-file
(clml.utility.data:fetch "https://mmaul.github.io/clml.data/sample/cars.csv")
:type :csv :csv-type-spec '(integer integer)))
CLML has a number of different specializations of datasets such as
unspecialized-dataset
untyped and unspecialized datanumeric-dataset
dataset containing numeric (double-float
) datacategory-dataset
dataset for categorical (string
) datanumeric-and-category-dataset
dataset containing a mixture of numeric and categorical datanumeric-matrix-dataset
dataset where numeric values are stored as a matrixnumeric-matrix-and-category-dataset
dataset where numeric values are stored as a matrix as well as having categorical dataAll datasets except the matrix datasets represent data as a vector of vectors. The inner vector contains the columns of each row. For datasets with categories, numeric and category data are stored in seperate vectors.
We can see below how the data is represented.
(dataset-points dataset)
It may not be convenient to display the whole dataset to take a look at is. We could have used subseq
but there is a helper method called head-points
.
(head-points dataset)
All Datasets have the dimensions
slot which contain the column metadata. The dimensions slot contains a list of dimension
instances. Each dimension instance contains the following slots (accessor prefix is dimension):
name
column nametype
type of data in column (e.g. :category :numeric :unknown)index
index on column vectors of columnmetadata
- alist that CAN containing useful information, such as equality tests for category data(dataset-dimensions dataset)
Datasets can be created directly or can be created by reading them from a file. Supported data formats or CSV and SEXP.
Earlier we used the read-data-from-file
function to read a dataset from a CSV file. The file in this case is a file that is obtained with the fetch
from the clml.utility
system, which downloads and caches a file in a location on a local files system or a URL. Datasets can also be created programatically.
(make-numeric-and-category-dataset
'("cat 1" "num 1") ; <-- Column names
(vector (v2dvec #(1.0d0)) (v2dvec #(2.0d0))) ; <-- Numeric data
'(1) ; <-- Indexes of numeric column
#(#("a") #("b")) ; <-- Category Data
'(0) ; <-- Indexes of category data
)
The dataset we loaded is currently unspecialized, we haven't told CLML much about it yet. We can use the pick-and-specialize-data
method to fill in the details.
(pick-and-specialize-data dataset :data-types '(:numeric :numeric))
We can see pick-and-specialize-data
returned a numeric dataset based on the supplied :data-types
specification. pick-and-specialize-data
has two parameters :range
and :except
. Both parameters deal with column selection :range
specifies a range of columns (as a list) to use in our new dataset while :except
specifies a list of columns to exclude from our new dataset. We had also mentioned the matrix datasets, pick-and-specialize-data
can also change the representation from a vector of vectors to a matrix.
(let ((ds (pick-and-specialize-data dataset :data-types '(:numeric :numeric)
:store-numeric-data-as-matrix t)))
(print ds)
(dataset-numeric-points ds))
We should also show an example of a dataset with categories.
(pick-and-specialize-data (read-data-from-file
(clml.utility.data:fetch "https://mmaul.github.io/clml.data/sample/UKgas.sexp"))
:data-types '(:category :numeric))
Datasets can be created and combined. Generally the dataset creation methods take the form of make-<dataset type>
and either take vectors containing data or other datasets and create a new dataset.
CLML datasets support missing values. Missing values are represented as follows in the dataset-points:
There are also the following predicates available to detect missing values:
CLML.HJS.MISSING-VALUE:C-NAN-P
CLML.HJS.MISSING-VALUE:NAN-P
The read-data-from-file
also supports the mapping representations of missing values in data files to datasets.
The missing-values-list
keyword argument specifies the character sequences that will be recognized as missing values.
To illustrate missing values support lets read in a CSV file containing the follow:
a, b, c
1.0, 2.0, x
NA, 3.0, NA
Here missing values are represented in the CSV file by NA. For the read-data
function to recognize the missing values we must set the :missing-values-list
parameter as shown below:
(let ((ds (read-data-from-file
(clml.utility.data:fetch "https://mmaul.github.io/clml.data/sample/simple1.csv")
:type :csv
:csv-type-spec '(double-float double-float string)
:missing-values-list '("NA")
)))
(format nil "~A~%~A~%" ds (dataset-points ds)))
We can also see how missing values are represented in a specialized dataset:
(let ((ds (pick-and-specialize-data
(read-data-from-file
(clml.utility.data:fetch "https://mmaul.github.io/clml.data/sample/simple1.csv")
:type :csv
:csv-type-spec '(double-float double-float string)
:missing-values-list '("NA")
)
:data-types '(:numeric :numeric :category)
)))
(format nil "~A~%~A~%~A~%" ds (dataset-numeric-points ds) (dataset-category-points ds))
)
The following operations can be preformed on datasets:
We will use the UK Gas dataset to illustrate these operations.
(defparameter ukgas (pick-and-specialize-data (read-data-from-file
(clml.utility.data:fetch "https://mmaul.github.io/clml.data/sample/UKgas.sexp"))
:data-types '(:category :numeric)))
The simplest operation is copying. copy-dataset
makes a deep copy of the contents of a dataset.
(copy-dataset ukgas)
Datasets can be subdivided by two similar methods make-bootstrap-sample-dataset
and divide-dataset
The divide dataset
returns a dataset split into two parts based upon the :divide-ratio
like pick-and-specialize-data
divide-dataset
can limit the section values accessed with the :range
and :except
parameters. It can also pull values in a pseudo-random manner values in to their new datasets.
(multiple-value-list (divide-dataset ukgas :divide-ratio '(3 1) :random t))
make-bootstrap-sample-datasets
on the other hand shuffles a dataset into a number of specified datasets of equal length to the original dataset. The :number-of-datasets
parameter defaults to 10.
(make-bootstrap-sample-datasets ukgas :number-of-datasets 3)
One nice features of CLML is the dataset cleaning capabilities. The dataset-cleaning
method provides the following:
To illustrate we will preform dataset cleaning where outliers will be points that exceed 1 standard deviation and will be replaced by zero.
(dataset-cleaning ukgas :outlier-types-alist '(("UKgas" . :std-dev))
:outlier-values-alist '((:std-dev . 1))
:interp-types-alist '(("UKgas" . :zero)))
In some cases you may want to add a computed column or add a column to a dataset to hold the product of a computation on a dataset. The add-dim
method can accomplish this easily. It can add an existing column of points with the :points parameter, it can also create a column with points filled with a initial value with the :initial-value
parameter. The two mandatory parameters are the dataset to add the dimension to, the name of the new dimension and the type. If the dataset is either a category or numeric only dataset add-dim
will create a numeric-and-category-dataset if a column of a different type is added.
(add-dim ukgas "mpg" :numeric :initial-value 0.0d0)
Two datasets with equal numbers of rows can be concatenated or glued together vertically. concatenate-datasets
takes two datasets as parameters and return a dataset with the points of the first dataset stacked on top of the points of the second dataset. The dimension name names of the first one dataset are retained in the new dataset.
(concatenate-datasets ukgas ukgas)
(write-dataset ukgas "gasgas.csv")
Columns and values can be accessed and extracted from datasets using the !!
macros. This macro returns the column name or list of column names as a vectors of vectors if multiple column names are specified or as a single vector if a single column name is specified.
(!! ukgas "UKgas")
Dataset points can also be accessed with the slot accessor. Since category and numeric data are stored separately in heterogeneous datasets separate accessors are used to access the points.
The list below shows which methods are applicable to the dataset type.
dataset-points
: unspecialized-dataset
dataset-numeric-points
: numeric-dataset
numeric-and-category-dataset
numeric-matrix-dataset numeric-matrix-and-category-dataset
dataset-category-points
: category-dataset
numeric-and-category-dataset
numeric-matrix-dataset
numeric-matrix-and-category-dataset
(dataset-numeric-points ukgas)
R-datasets
¶One thing that I've always found handy in R is a standard, curated, extensive and documented series of datasets. Wouldn't it be nice to have access to these directly as datasets in CLML. The R-datasets
system in clml.extras
provides this capability. A particularly good use case for these datasets is to be able to follow along with examples and tutorials for R in CLML.
The clml.extras
systems are not currently part of quicklisp so if you are following along with this tutorial and are expecting just to (quickload :clml.extras.Rdatasets)
you can't till you clone the clml.extras repository http://github.com/mmaul/clml.extras.git into your quicklisp/local-projects
directory
The Rdatasets package makes datasets included with the R language distribution available as clml datasets. R datasets are obtained csv files on Vincent Centarel's github repository. More information on these datasets can be found at http://vincentarelbundock.github.com/Rdatasets
Because type information is not included it may be necessary to provide a csv-type-spec
for the columns in the csv file.
(ql:quickload :clml.r-datasets)
(use-package :clml.r-datasets)
(defparameter dd (get-r-dataset-directory))
(subseq (inventory dd :stream nil) 0 505)
(subseq (dataset-documentation dd "datasets" "BOD" :stream nil) 0 200)
(defparameter bod (get-dataset dd "datasets" "BOD" :csv-type-spec '(double-float double-float double-float)))
bod
(pick-and-specialize-data bod :data-types '(:numeric :numeric :numeric))
The iPython notebook and source for this tutorial can be found in the clml.tutorials https://github.com/mmaul/clml.tutorials.git github repository.