chatter-0.5.2.0: A library of simple NLP algorithms.

Safe HaskellNone
LanguageHaskell2010

NLP.Tokenize.Annotations

Synopsis

Documentation

protectTerms :: [Text] -> CaseSensitive -> RawToken -> [RawToken]

Create a tokenizer that protects the provided terms (to tokenize multi-word terms)

whitespace :: RawToken -> [RawToken]

Tokenize on whitespace, as defined by 'ch -> Char.isSeparator ch || Char.isSpace ch'

contractions :: RawToken -> [RawToken]

Split common contractions off and freeze them. Currently deals with: 'm, 's, 'd, 've, 'll, and negations (n't)

tokenizeOn :: (Char -> Bool) -> RawToken -> [RawToken]

Tokenize on characters that satisfy the provided predicate.