glider-nlp-0.4: Natural Language Processing library

CopyrightCopyright (C) 2013-2014 Krzysztof Langner
LicenseBSD3
MaintainerKrzysztof Langner <klangner@gmail.com>
Stabilityalpha
Portabilityportable
Safe HaskellSafe
LanguageHaskell2010

Glider.NLP.Tokenizer

Description

This module contains functions which parses text into tokens. Tokens are not normalized. If you need all tokens from the document then use function "tokenize". If you need only words (no dots, numbers etc.) then check function "getWords".

Synopsis

Documentation

data Token Source #

Token type

Instances

Eq Token Source # 

Methods

(==) :: Token -> Token -> Bool #

(/=) :: Token -> Token -> Bool #

Show Token Source # 

Methods

showsPrec :: Int -> Token -> ShowS #

show :: Token -> String #

showList :: [Token] -> ShowS #

foldCase :: [Text] -> [Text] Source #

Convert all words to the same case

getWords :: [Token] -> [Text] Source #

Extract all words from tokens

getWords "one two." == ["one", "two"] 

tokenize :: Text -> [Token] Source #

Split text into tokens

tokenize "one two." == [Word "one", Whitespace, Word "two", "Separator "."] 

wordParser :: Parser Source #

Parse word

numberParser :: Parser Source #

Parse number

punctuationParser :: Parser Source #

Parse punctuation

symbolParser :: Parser Source #

Parse symbol

spaceParser :: Parser Source #

Parse whitespaces

allParser :: Parser Source #

Apply all parsers to the input. Return result from the first which will parse correctly given text.