datasets-0.3.0: Classical data sets for statistics and machine learning

Safe HaskellNone
LanguageHaskell2010

Numeric.Datasets

Contents

Description

The datasets package defines three different kinds of datasets:

  • Tiny datasets (up to a few tens of rows) are embedded as part of the library source code, as lists of values.
  • Small data sets are embedded indirectly (via file-embed) in the package as pure values and do not require IO to be downloaded (i.e. the data is loaded and parsed at compile time).
  • Larger data sets which need to be fetched over the network and are cached in a local temporary directory for subsequent use.

This module defines the getDataset function for fetching datasets and utilities for defining new data sets and modifying their options. It is only necessary to import this module when using fetched data sets. Embedded data sets can be used directly.

Please refer to the dataset modules for examples.

Synopsis

Documentation

getDataset :: Dataset h a -> IO [a] Source #

Load a dataset, using the system temporary directory as a cache

data Dataset h a Source #

A Dataset contains metadata for loading, caching, preprocessing and parsing data.

Constructors

Dataset 

Fields

data Source h Source #

A Dataset source can be either a URL (for remotely-hosted datasets) or the filepath of a local file.

Constructors

URL (Url h) 
File FilePath 

Parsing datasets

readDataset Source #

Arguments

:: ReadAs a

How to parse the raw data string

-> ByteString

The data string

-> [a] 

Parse a ByteString into a list of Haskell values

data ReadAs a where Source #

ReadAs is a datatype to describe data formats that hold data sets

csvRecord :: FromRecord a => ReadAs a Source #

A CSV record with default decoding options (i.e. columns are separated by commas)

Defining datasets

csvDataset :: FromRecord a => Source h -> Dataset h a Source #

Define a dataset from a source for a CSV file

csvHdrDataset :: FromNamedRecord a => Source h -> Dataset h a Source #

Define a dataset from a source for a CSV file with a known header

csvHdrDatasetSep :: FromNamedRecord a => Char -> Source h -> Dataset h a Source #

Define a dataset from a source for a CSV file with a known header and separator

csvDatasetSkipHdr :: FromRecord a => Source h -> Dataset h a Source #

Define a dataset from a source for a CSV file, skipping the header line

jsonDataset :: FromJSON a => Source h -> Dataset h a Source #

Define a dataset from a source for a JSON file

Dataset options

withPreprocess :: (ByteString -> ByteString) -> Dataset h a -> Dataset h a Source #

Include a preprocessing stage to a Dataset: each field in the raw data will be preprocessed with the given function.

withTempDir :: FilePath -> Dataset h a -> Dataset h a Source #

Include a temporary directory for caching the dataset after this has been downloaded one first time.

Preprocessing functions

These functions are to be used as first argument of withPreprocess in order to improve the quality of the parser inputs.

dropLines :: Int -> ByteString -> ByteString Source #

Drop lines from a bytestring

fixedWidthToCSV :: ByteString -> ByteString Source #

Convert a Fixed-width format to a CSV

removeEscQuotes :: ByteString -> ByteString Source #

Filter out escaped double quotes from a field

fixAmericanDecimals :: ByteString -> ByteString Source #

Turn US-style decimals starting with a period (e.g. .2) into something cassava can parse (e.g. 0.2)

Helper functions

parseReadField :: Read a => Field -> Parser a Source #

Parse a CSV field, based on its read instance

parseDashToCamelField :: Read a => Field -> Parser a Source #

Parse a field, first turning dashes to CamelCase

yearToUTCTime :: Double -> UTCTime Source #

Convert a fractional year to UTCTime with second-level precision (due to not taking into account leap seconds)

Dataset source URLs