\documentclass{article}
%include polycode.fmt

\usepackage{amsmath,amsthm,amsfonts,amssymb,graphicx,color,hyperref}
\usepackage{chngpage,array}
\usepackage{titling}
\newcommand{\subtitle}[1]{%
  \posttitle{%
    \par\end{center}
    \begin{center}\large#1\end{center}
    \vskip0.5em}%
}
\long\def\ignore#1{}


\title{Homomorphic Learning: Examples}
\author{Anonymous}

\begin{document}
\maketitle

\section{Introduction}

This file contains two short examples that empirically validate the theoretical claims of the main paper and demonstrates usage of the HLearn library.\footnote{We developed the library in Haskell because of Haskell's speed and strong support for algebraic programming.}  First, we run all three cross-validation algorithms side-by-side and show that they get the same exact results.  Second, we time these computations.  These timing results empirically verify the equations from Table 2 in the main text.  This is also how we generated Figure 4.

This document is also valid literate Haskell source code.  Compiling and running this file runs the experiments.  By modifying this file, you can modify these experiments or create your own.  To begin, follow these directions:

\begin{enumerate}
\item Download and install the latest version of the Haskell platform from:
\begin{center}
\url{http://www.haskell.org/platform/}
\end{center}

\item Unzip the file hlearn.tgz contained with the supplemental material

\item Install the HLearn library by running the command: 
\begin{verbatim}
./install.sh
\end{verbatim}

\item Compile this document by running:
\begin{verbatim}
ghc Examples.lhs
\end{verbatim}

\item Run the newly generated executable:
\begin{verbatim}
./Demo +RTS -N4
\end{verbatim}
where 4 is the number of cores you want to use for the parallelization

\end{enumerate}

\section{Demo 1: All cross-validation functions get the same results}

\ignore{
\begin{code}
import Criterion.Main
import Criterion.Config
 
import HLearn.Algebra
import HLearn.DataContainers
import HLearn.DataContainers.DS_List
import HLearn.Evaluation.CrossValidation
import HLearn.Models.Classification
import HLearn.Models.Classifiers.NBayes    
import HLearn.Models.Distributions
\end{code}
}

The function |cv_same| runs all three versions of cross validation, and prints the results right next to each other.  This makes it easy to verify that they all return the exact same value no matter the number of folds.  This will happen for every number in |k_list|.  In order to change the k values the test runs on, modify |k_list| here:
\begin{code}
k_list = [2..20]
\end{code}

\noindent
The output will look something like:

\begin{verbatim}
k=2
crossValidation:          mean=0.7549019607843137
crossValidation_monoid:   mean=0.7549019607843137
crossValidation_group:    mean=0.7549019607843137
k=3
crossValidation:          mean=0.7450980392156863
crossValidation_monoid:   mean=0.7450980392156863
crossValidation_group:    mean=0.7450980392156863
k=4
crossValidation:          mean=0.7466369065216105
crossValidation_monoid:   mean=0.7466369065216105
crossValidation_group:    mean=0.7466369065216105
k=5
crossValidation:          mean=0.7485596320106407
crossValidation_monoid:   mean=0.7485596320106407
crossValidation_group:    mean=0.7485596320106407
...
\end{verbatim}

\noindent
The code is:
\begin{code}
cv_same dataset = do
    let modelparams = (ClassificationParams (NBayesParams :: NBayesParams Int) (dsDesc dataset))
    sequence_ [ do
        putStrLn $ "k="++show k
        putStr "crossValidation:          "
        putStrLn $ "mean=" ++ (show $ m1 $ crossValidation modelparams dataset accuracy k)
        putStr "crossValidation_monoid:   "
        putStrLn $ "mean=" ++ (show $ m1 $ crossValidation_monoid modelparams dataset accuracy k)
        putStr "crossValidation_group:    "
        putStrLn $ "mean=" ++ (show $ m1 $ crossValidation_group modelparams dataset accuracy k)
        | k<-k_list
        ]
        
\end{code}
\section{Demo 2: Measuring the run times}

Now, we will use the Criterion package\footnote{\url{http://hackage.haskell.org/package/criterion}} to measure the run times of our algorithm.  Criterion runs each trial several times to reduce error in the timings, so this may take a while.  On the screen, we will get output that looks something like:

\begin{verbatim}
warming up
estimating clock resolution...
mean is 2.117738 us (320001 iterations)
found 20705 outliers among 319999 samples (6.5%)
  11861 (3.7%) low severe
  8844 (2.8%) high severe
estimating cost of a clock call...
mean is 67.71065 ns (20 iterations)

benchmarking cv-plain ; 30
collecting 10 samples, 1 iterations each, in estimated 5.493829 s
mean: 477.5195 ms, lb 473.4553 ms, ub 482.8627 ms, ci 0.950
std dev: 7.914094 ms, lb 5.782302 ms, ub 10.16983 ms, ci 0.950
variance introduced by outliers: 9.000%
variance is slightly inflated by outliers

...
\end{verbatim}

\noindent
The output of these tests is stored in a ``summary.csv''.  We can plot these results to get a graph like Figure 4 in the main text:

\begin{figure}
\end{figure}

\noindent
You can modify the test parameters by adjusting these options:
\begin{code}

numtrials = 30  -- determines the sample rate along the x-axis; 
                -- larger is more accurate but will take longer

testconfig = defaultConfig 
    { cfgPerformGC     = ljust True
    , cfgSummaryFile   = ljust "summary.csv"
    , cfgSamples       = ljust 10
    }

\end{code}
The code is:
\begin{code}
cv_runtimes dataset = defaultMainWith testconfig (return ()) $ mconcat
        [ testL "cv-plain     ; " crossValidation
        , testL "cv-plain-par ; " crossValidation_par
        , testL "cv-monoid    ; " crossValidation_monoid
        , testL "cv-monoid_par; " crossValidation_monoid_par
        , testL "cv-group     ; " crossValidation_group
        , testL "cv-group-par ; " crossValidation_group_par
        ]
    where
        modelparams     = (ClassificationParams (NBayesParams :: NBayesParams Int) (dsDesc dataset))
        testL str cv    =  [ bench (str++show n) $ nf (cv modelparams dataset accuracy) n 
                           | n<-numList dataset
                           ]
        numList ds      = map (\x-> floor $ (fi x)*((fi $ dsLen ds)/(fi numtrials))) [1..numtrials]

fi              = fromIntegral
\end{code}

\section{Main function}
We glue the demos together with this main function:
\begin{code}
main = do
    dataset <- datasetIOint
    cv_same dataset
    cv_runtimes dataset
\end{code}
\noindent
You can perform the demos on different data files by changing below.  Note that on smaller data sets (e.g. haberman), the overhead of parallelization is too great and there is no performance gain.  On larger data sets, however, the parallelization becomes significant.
\begin{code}
datafile = datafile_audiology -- adjust this line to select the data file

datasetIO :: IO (DS_List String (LDPS String))
datasetIO = loadDataCSV datafile
datasetIOint = fmap ds2intds datasetIO

datafile_haberman = DatafileDesc 
    { datafilePath = "examples/datasets/haberman.data"
    , datafileLabelColumn = LastC
    , datafileMissingStr = Nothing
    }

datafile_pima = DatafileDesc 
    { datafilePath = "examples/datasets/pima-indians-diabetes.data"
    , datafileLabelColumn = LastC
    , datafileMissingStr = Nothing
    }
    
datafile_krvskp = DatafileDesc 
    { datafilePath = "examples/datasets/kr-vs-kp.data"
    , datafileLabelColumn = LastC
    , datafileMissingStr = Nothing
    }
    
datafile_audiology = DatafileDesc -- multiclass, and missing labels
    { datafilePath = "examples/datasets/audiology.standardized.data"
    , datafileLabelColumn = LastC
    , datafileMissingStr = Just "?"
    }
\end{code}

\end{document}