NLP.Tokenize
- newtype EitherList a b = E {}
- type Tokenizer = String -> EitherList String String
- tokenize :: String -> [String]
- run :: Tokenizer -> String -> [String]
- defaultTokenizer :: Tokenizer
- whitespace :: Tokenizer
- uris :: Tokenizer
- punctuation :: Tokenizer
- finalPunctuation :: Tokenizer
- initialPunctuation :: Tokenizer
- contractions :: Tokenizer
- negatives :: Tokenizer
Documentation
newtype EitherList a b Source
The EitherList is a newtype-wrapped list of Eithers.
Instances
Monad (EitherList a) |
type Tokenizer = String -> EitherList String StringSource
A Tokenizer is function which takes a list and returns a list of Eithers
(wrapped in a newtype). Right Strings will be passed on for processing
to tokenizers down
the pipeline. Left Strings will be passed through the pipeline unchanged.
Use a Left String in a tokenizer to protect certain tokens from further
processing (e.g. see the uris
tokenizer).
Split string on whitespace. This is just a wrapper for Data.List.words
punctuation :: TokenizerSource
Split off initial and final punctuation
finalPunctuation :: TokenizerSource
Split off word-final punctuation
initialPunctuation :: TokenizerSource
Split off word-initial punctuation
contractions :: TokenizerSource
Split common contractions off and freeze them. | Currently deals with: 's, 'd, 've, 'll