| Safe Haskell | Safe-Inferred | 
|---|---|
| Language | Haskell2010 | 
NLP.Tokenize.String
- newtype EitherList a b = E {}
 - type Tokenizer = String -> EitherList String String
 - tokenize :: String -> [String]
 - run :: Tokenizer -> String -> [String]
 - defaultTokenizer :: Tokenizer
 - whitespace :: Tokenizer
 - uris :: Tokenizer
 - punctuation :: Tokenizer
 - finalPunctuation :: Tokenizer
 - initialPunctuation :: Tokenizer
 - allPunctuation :: Tokenizer
 - contractions :: Tokenizer
 - negatives :: Tokenizer
 
Documentation
newtype EitherList a b Source
The EitherList is a newtype-wrapped list of Eithers.
Instances
| Monad (EitherList a) | |
| Functor (EitherList a) | |
| Applicative (EitherList a) | 
type Tokenizer = String -> EitherList String String Source
A Tokenizer is function which takes a list and returns a list of Eithers
  (wrapped in a newtype). Right Strings will be passed on for processing
  to tokenizers down
  the pipeline. Left Strings will be passed through the pipeline unchanged.
  Use a Left String in a tokenizer to protect certain tokens from further 
  processing (e.g. see the uris tokenizer). 
  You can define your own custom tokenizer pipelines by chaining tokenizers together:
whitespace :: Tokenizer Source
Split string on whitespace. This is just a wrapper for Data.List.words
punctuation :: Tokenizer Source
Split off initial and final punctuation
finalPunctuation :: Tokenizer Source
Split off word-final punctuation
initialPunctuation :: Tokenizer Source
Split off word-initial punctuation
allPunctuation :: Tokenizer Source
Split tokens on transitions between punctuation and non-punctuation characters. This tokenizer is not included in defaultTokenizer pipeline because dealing with word-internal punctuation is quite application specific.
contractions :: Tokenizer Source
Split common contractions off and freeze them. | Currently deals with: 'm, 's, 'd, 've, 'll