\documentclass{article} %include polycode.fmt \usepackage{amsmath,amsthm,amsfonts,amssymb,graphicx,color,hyperref} \usepackage{chngpage,array} \usepackage{titling} \newcommand{\subtitle}[1]{% \posttitle{% \par\end{center} \begin{center}\large#1\end{center} \vskip0.5em}% } \long\def\ignore#1{} \title{Homomorphic Learning: Examples} \author{Anonymous} \begin{document} \maketitle \section{Introduction} This file contains two short examples that empirically validate the theoretical claims of the main paper and demonstrates usage of the HLearn library.\footnote{We developed the library in Haskell because of Haskell's speed and strong support for algebraic programming.} First, we run all three cross-validation algorithms side-by-side and show that they get the same exact results. Second, we time these computations. These timing results empirically verify the equations from Table 2 in the main text. This is also how we generated Figure 4. This document is also valid literate Haskell source code. Compiling and running this file runs the experiments. By modifying this file, you can modify these experiments or create your own. To begin, follow these directions: \begin{enumerate} \item Download and install the latest version of the Haskell platform from: \begin{center} \url{http://www.haskell.org/platform/} \end{center} \item Unzip the file hlearn.tgz contained with the supplemental material \item Install the HLearn library by running the command: \begin{verbatim} ./install.sh \end{verbatim} \item Compile this document by running: \begin{verbatim} ghc Examples.lhs \end{verbatim} \item Run the newly generated executable: \begin{verbatim} ./Demo +RTS -N4 \end{verbatim} where 4 is the number of cores you want to use for the parallelization \end{enumerate} \section{Demo 1: All cross-validation functions get the same results} \ignore{ \begin{code} import Criterion.Main import Criterion.Config import HLearn.Algebra import HLearn.DataContainers import HLearn.DataContainers.DS_List import HLearn.Evaluation.CrossValidation import HLearn.Models.Classification import HLearn.Models.Classifiers.NBayes import HLearn.Models.Distributions \end{code} } The function |cv_same| runs all three versions of cross validation, and prints the results right next to each other. This makes it easy to verify that they all return the exact same value no matter the number of folds. This will happen for every number in |k_list|. In order to change the k values the test runs on, modify |k_list| here: \begin{code} k_list = [2..20] \end{code} \noindent The output will look something like: \begin{verbatim} k=2 crossValidation: mean=0.7549019607843137 crossValidation_monoid: mean=0.7549019607843137 crossValidation_group: mean=0.7549019607843137 k=3 crossValidation: mean=0.7450980392156863 crossValidation_monoid: mean=0.7450980392156863 crossValidation_group: mean=0.7450980392156863 k=4 crossValidation: mean=0.7466369065216105 crossValidation_monoid: mean=0.7466369065216105 crossValidation_group: mean=0.7466369065216105 k=5 crossValidation: mean=0.7485596320106407 crossValidation_monoid: mean=0.7485596320106407 crossValidation_group: mean=0.7485596320106407 ... \end{verbatim} \noindent The code is: \begin{code} cv_same dataset = do let modelparams = (ClassificationParams (NBayesParams :: NBayesParams Int) (dsDesc dataset)) sequence_ [ do putStrLn $ "k="++show k putStr "crossValidation: " putStrLn $ "mean=" ++ (show $ m1 $ crossValidation modelparams dataset accuracy k) putStr "crossValidation_monoid: " putStrLn $ "mean=" ++ (show $ m1 $ crossValidation_monoid modelparams dataset accuracy k) putStr "crossValidation_group: " putStrLn $ "mean=" ++ (show $ m1 $ crossValidation_group modelparams dataset accuracy k) | k<-k_list ] \end{code} \section{Demo 2: Measuring the run times} Now, we will use the Criterion package\footnote{\url{http://hackage.haskell.org/package/criterion}} to measure the run times of our algorithm. Criterion runs each trial several times to reduce error in the timings, so this may take a while. On the screen, we will get output that looks something like: \begin{verbatim} warming up estimating clock resolution... mean is 2.117738 us (320001 iterations) found 20705 outliers among 319999 samples (6.5%) 11861 (3.7%) low severe 8844 (2.8%) high severe estimating cost of a clock call... mean is 67.71065 ns (20 iterations) benchmarking cv-plain ; 30 collecting 10 samples, 1 iterations each, in estimated 5.493829 s mean: 477.5195 ms, lb 473.4553 ms, ub 482.8627 ms, ci 0.950 std dev: 7.914094 ms, lb 5.782302 ms, ub 10.16983 ms, ci 0.950 variance introduced by outliers: 9.000% variance is slightly inflated by outliers ... \end{verbatim} \noindent The output of these tests is stored in a ``summary.csv''. We can plot these results to get a graph like Figure 4 in the main text: \begin{figure} \end{figure} \noindent You can modify the test parameters by adjusting these options: \begin{code} numtrials = 30 -- determines the sample rate along the x-axis; -- larger is more accurate but will take longer testconfig = defaultConfig { cfgPerformGC = ljust True , cfgSummaryFile = ljust "summary.csv" , cfgSamples = ljust 10 } \end{code} The code is: \begin{code} cv_runtimes dataset = defaultMainWith testconfig (return ()) $ mconcat [ testL "cv-plain ; " crossValidation , testL "cv-plain-par ; " crossValidation_par , testL "cv-monoid ; " crossValidation_monoid , testL "cv-monoid_par; " crossValidation_monoid_par , testL "cv-group ; " crossValidation_group , testL "cv-group-par ; " crossValidation_group_par ] where modelparams = (ClassificationParams (NBayesParams :: NBayesParams Int) (dsDesc dataset)) testL str cv = [ bench (str++show n) $ nf (cv modelparams dataset accuracy) n | n<-numList dataset ] numList ds = map (\x-> floor $ (fi x)*((fi $ dsLen ds)/(fi numtrials))) [1..numtrials] fi = fromIntegral \end{code} \section{Main function} We glue the demos together with this main function: \begin{code} main = do dataset <- datasetIOint cv_same dataset cv_runtimes dataset \end{code} \noindent You can perform the demos on different data files by changing below. Note that on smaller data sets (e.g. haberman), the overhead of parallelization is too great and there is no performance gain. On larger data sets, however, the parallelization becomes significant. \begin{code} datafile = datafile_audiology -- adjust this line to select the data file datasetIO :: IO (DS_List String (LDPS String)) datasetIO = loadDataCSV datafile datasetIOint = fmap ds2intds datasetIO datafile_haberman = DatafileDesc { datafilePath = "examples/datasets/haberman.data" , datafileLabelColumn = LastC , datafileMissingStr = Nothing } datafile_pima = DatafileDesc { datafilePath = "examples/datasets/pima-indians-diabetes.data" , datafileLabelColumn = LastC , datafileMissingStr = Nothing } datafile_krvskp = DatafileDesc { datafilePath = "examples/datasets/kr-vs-kp.data" , datafileLabelColumn = LastC , datafileMissingStr = Nothing } datafile_audiology = DatafileDesc -- multiclass, and missing labels { datafilePath = "examples/datasets/audiology.standardized.data" , datafileLabelColumn = LastC , datafileMissingStr = Just "?" } \end{code} \end{document}