# cassava-records

A library extension for ```cassava``` (Haskell CSV parser library) that
automatically creates a Record data type given an columnar input file. eg
a CSV file.

# What is this tool for?

Say you are working on a project that involves processing a number of
comma separated or tab serparated files. Assuming, you are using
cassava for loading the input files, here is a typical workflow you
would follow

a. Inspect the file that contains ```Salaries``` of ```Employees```.

b. Create a ```Record``` data type called ```Salaries``` to reflect 
the columns and types found in the file

c. Create required instances of the ```Salaries``` data type that may be 
required to load the files with ```cassava```.

Now, imagine this file you are inspecting to contains tens or hundreds
of columns. Now, as a good Haskeller you will want to automate steps
(a) and (b) to the extend possible. That is precisely, what this
library does.

```cassava-records``` performs the following tasks. Given, a input file
(command or tab-seperated for example), it reads the whole file,
infers some basic data types for each column and automatically creates
a ```Record``` of inferred data types using ```Template Haskell```.


# Quick Start

## Example 1 :

Using data/salaries_simple.csv

```
emp_no,name,salary,status,years
1,John Doe,100.0,True,1
2,Jill Doe, 200.10,False,2
3,John Doe Sr,101.0,T,3
4,Jill Doe Sr, 10101.10,f,4.2
5,John Doe Jr,1010101.0,true,5.1
6,Jill Doe Jr, 10101.10,false,6

```

``` haskell

{-#LANGUAGE TemplateHaskell#-}
{-#LANGUAGE DeriveGeneric #-}

import Data.Cassava.Records
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import Data.Vector as V
import Data.Text as DT

$(makeCsvRecord "Salaries" "data/salaries_simple.csv" "_" commaOptions)

```

The ```makeCsvRecord``` function take 4 arguments,

1. ```"Salaries"``` :  A ```String``` that will be used as name for
   the ```Record```.
2. ```data/salaries_simple.csv```: path to the input file
3. ```"_"```: string to prefix each field. Useful, if we need to build lens
for this record
4. ```commaOptions```: ```defaultDecodeOptions``` defined
   in ```cassava``` library

If you load this code in ```GHCi```, we will see

``` haskell
1 >:info Salaries
data Salaries
  = Salaries {
    _emp_no:: Integer,
    _name :: Text,
    _salary :: Double,
    _status :: Bool,
    _years:: Double}

```
Note that all column names are in lower case and "_" has been prefixed
to the column names.

To be consistent, ```cassava-record``` converts all column headers to
lower-case before created corresponding field names for each column
header. Therefore, if column headers were all upper-case, we need to
provide a field modifier while creating ```ToNamedRecord```
and ```FromNamedRecord``` instances for ```cassava```.

Note, that if the column headers are mixed case, it become
tricky. Current version of the library does not work very well with
mixed case column headers.

There is a convenience method called ```makeInstances``` that can create
the instances required for ```cassava```.The instances created use the
default ```fieldModifieroptions``` settings shown below.

``` haskell

fieldModifierOptions :: Options
fieldModifierOptions = defaultOptions { fieldLabelModifier = rmUnderscore }
  where
    rmUnderscore ('_':str) = DT.unpack . DT.toUpper . DT.pack $ str
    rmUnderscore str = str

-- the ToNamedRecord and FromNamedRecord are needed by Cassava since
-- we prefix
instance ToNamedRecord Salaries where
  toNamedRecord = genericToNamedRecord fieldModifierOptions

instance FromNamedRecord Salaries where
  parseNamedRecord = genericParseNamedRecord fieldModifierOptions

main :: IO ()
main = do
  v <- loadData "data/salaries_simple.csv":: IO (V.Vector Salaries)
  putStrLn . show $ v
```

In ```GHCi``` we see (formatted for clarity)

``` haskell
2 >loadData "data/salaries.csv"
[Salaries {_emp_no = 1, _name = "John Doe", _salary = 100.0, _status = False, _years = 1.0},
 Salaries {_emp_no = 2, _name = "Jill Doe", _salary = 200.1, _status = False, _years = 2.0},
 Salaries {_emp_no = 3, _name = "John Doe Sr", _salary = 101.0, _status = True, _years = 3.0},
 Salaries {_emp_no = 4, _name = "Jill Doe Sr", _salary = 10101.1, _status = False, _years = 4.2},
 Salaries {_emp_no = 5, _name = "John Doe Jr", _salary = 1010101.0, _status = False, _years = 5.1},
 Salaries {_emp_no = 6, _name = "Jill Doe Jr", _salary = 10101.1, _status = False, _years = 6.0}]
```

Note, the type inference in the above example is as follows:

1. If a column has values from the set {```true```, ```t```, ```false```, ```f```}
   (ignoring case) then the inferred type is ```Bool```.
2. If a column has values that are all numeric, then an ```Integer```
   type is attempted, or else a ```Double``` is infered. For example
   for ```emp_no``` the infered type is a ```Integer``` whereas for ```years``` the type is ```Double```.
3. For all other cases, a ```Text``` type is inferred.

# Example 2 (Missing Values)

The library also supports type inference when values are missing. For example in, data/salaries_mixed_input.csv

```
emp_no,name,salary,status,years
1,John Doe,100.0,True,1
2,Jill Doe, 200.10,False,2
3,John Doe Sr,101.0,T,
4,Jill Doe Sr,10101.10,,4.2
5,John Doe Jr,,true,5.1
6,, 10101.10,false,6
```

the ```status``` for Jill Doe Jr is missing and the ```salary``` for
John Doe Sr is missing. In this case, the type as wrapped in a ```Maybe``` type.

In that case, the record instance we get will be as follows:

``` haskell
3 >:info Salaries
data Salaries
  = Salaries {
    _emp_no:: Integer,
    _name :: Maybe Text,
    _salary :: Maybe Double,
    _status :: Maybe Bool,
    _years:: Maybe Double}
```

Loading this data, would produce the following output

``` haskell
{-#LANGUAGE TemplateHaskell#-}
{-#LANGUAGE DeriveGeneric #-}

import Data.Cassava.Records
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import Data.Vector as V
import Data.Text as DT

$(makeCsvRecord "SalariesMixed" "data/salaries_mixed_input.csv" "_" commaOptions)
$(makeInstance "SalariesMixed")
-- ^ note that we can use this function instead of manually defining
-- all instances required by Cassava

main :: IO ()
main = do
  v <- loadData "data/salaries_mixed_input.csv":: IO (V.Vector SalariesMixed)
  putStrLn . show $ v

```

The output will be as follows:

```
[SalariesMixed {_emp_no = 1, _name = Just "John Doe", _salary = Just 100.0, _status = Just False, _years = Just 1.0},
 SalariesMixed {_emp_no = 2, _name = Just "Jill Doe", _salary = Just 200.1, _status = Just False, _years = Just 2.0},
 SalariesMixed {_emp_no = 3, _name = Just "John Doe Sr", _salary = Just 101.0, _status = Just True, _years = Nothing},
 SalariesMixed {_emp_no = 4, _name = Just "Jill Doe Sr", _salary = Just 10101.1, _status = Nothing, _years = Just 4.2},
 SalariesMixed {_emp_no = 5, _name = Just "John Doe Jr", _salary = Nothing, _status = Just False, _years = Just 5.1},
 SalariesMixed {_emp_no = 6, _name = Nothing, _salary = Just 10101.1, _status = Just False, _years = Just 6.0}]
```

Here is a full working code that uses both the examples:

``` haskell
{-#LANGUAGE TemplateHaskell#-}
{-#LANGUAGE DeriveGeneric #-}
{-#LANGUAGE ScopedTypeVariables #-}
{-#LANGUAGE DuplicateRecordFields #-}
{-# LANGUAGE DeriveDataTypeable #-}

module Main where

import Data.Cassava.Records
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import Data.Vector as V
import Data.Text as DT
import qualified Text.PrettyPrint.Tabulate as T
import Language.Haskell.TH
-- import Control.Lens hiding (element)

$(makeCsvRecord "Salaries" "data/salaries_simple.csv" "_" commaOptions)
-- $(makeInstance "Salaries")

$(makeCsvRecord "SalariesMixed" "data/salaries_mixed_input.csv" "_" commaOptions)
$(makeInstance "SalariesMixed")

-- the following instance is not required, if $(makeInstance Salaries) statement
-- is spliced in (currently commented in the example)
myOptions :: Options
myOptions = defaultOptions { fieldLabelModifier = rmUnderscore }
  where
    rmUnderscore ('_':str) = DT.unpack . DT.toUpper . DT.pack $ str
    rmUnderscore str = str

instance ToNamedRecord Salaries where
  toNamedRecord = genericToNamedRecord myOptions

instance FromNamedRecord Salaries where
  parseNamedRecord = genericParseNamedRecord myOptions

main :: IO ()
main = do
  v <- loadData "data/salaries_simple.csv" :: IO (V.Vector Salaries)
  v1 <- loadData "data/salaries_mixed_input.csv" :: IO (V.Vector SalariesMixed)
  putStrLn . show $ v
  putStrLn . show $ v1

```


# Caveats (Or list of future enhancements)

1. The columns names along with prefix("_") should be a valid Haskell field
   names. For example, column names cannot have spaces or other
   characters not supported for use as ```field``` names of ```Record``` data type.
2. The library loads the whole file during compilation to infer
   types. Given the size of the file, this will increase the compile
   time. Alternative workflows, like stripping the file or dumping the
   created slice into a file is recommended. In the future, the
   makeCsvRecord function can take a parameter to specify the max
   number of rows that can be used to infer the types.
3. The inferred types are limited. Text, Bool, Integer, Double and the MayBe
   variants of those. Future, support may include DateTime.
4. Mixed case column headers not automatically supported. A more
   complex form of ```fieldOptionModifiers``` needs to be provided.
5. Currently there is no option to provide custom types.