tokenize-0.3.0: Simple tokenizer for English text.

Safe HaskellSafe-Inferred
LanguageHaskell2010

NLP.Tokenize.Text

Description

NLP Tokenizer, adapted to use Text instead of Strings from the tokenize package.

Synopsis

Documentation

newtype EitherList a b Source

The EitherList is a newtype-wrapped list of Eithers.

Constructors

E 

Fields

unE :: [Either a b]
 

type Tokenizer = Text -> EitherList Text Text Source

A Tokenizer is function which takes a list and returns a list of Eithers (wrapped in a newtype). Right Texts will be passed on for processing to tokenizers down the pipeline. Left Texts will be passed through the pipeline unchanged. Use a Left Texts in a tokenizer to protect certain tokens from further processing (e.g. see the uris tokenizer). You can define your own custom tokenizer pipelines by chaining tokenizers together:

tokenize :: Text -> [Text] Source

Split string into words using the default tokenizer pipeline

run :: Tokenizer -> Text -> [Text] Source

Run a tokenizer

whitespace :: Tokenizer Source

Split string on whitespace. This is just a wrapper for Data.List.words

uris :: Tokenizer Source

Detect common uris and freeze them

punctuation :: Tokenizer Source

Split off initial and final punctuation

finalPunctuation :: Tokenizer Source

Split off word-final punctuation

initialPunctuation :: Tokenizer Source

Split off word-initial punctuation

allPunctuation :: Tokenizer Source

Split tokens on transitions between punctuation and non-punctuation characters. This tokenizer is not included in defaultTokenizer pipeline because dealing with word-internal punctuation is quite application specific.

contractions :: Tokenizer Source

Split common contractions off and freeze them. | Currently deals with: 'm, 's, 'd, 've, 'll

negatives :: Tokenizer Source

Split words ending in n't, and freeze n't