tokenize-0.1.0: Simple tokenizer for English text.




newtype EitherList a b Source

The EitherList is a newtype-wrapped list of Eithers.




unE :: [Either a b]


type Tokenizer = String -> EitherList String StringSource

A Tokenizer is function which takes a list and returns a list of Eithers (wrapped in a newtype). Right Strings will be passed on for processing to tokenizers down the pipeline. Left Strings will be passed through the pipeline unchanged. Use a Left String in a tokenizer to protect certain tokens from further processing (e.g. see the uris tokenizer).

tokenize :: String -> [String]Source

Split string into words using the default tokenizer pipeline

run :: Tokenizer -> String -> [String]Source

Run a tokenizer

whitespace :: TokenizerSource

Split string on whitespace. This is just a wrapper for words

uris :: TokenizerSource

Detect common uris and freeze them

punctuation :: TokenizerSource

Split off initial and final punctuation

finalPunctuation :: TokenizerSource

Split off word-final punctuation

initialPunctuation :: TokenizerSource

Split off word-initial punctuation

negatives :: TokenizerSource

Split words ending in n't, and freeze n't