Portability	portable
Stability	alpha
Maintainer	Krzysztof Langner <klangner@gmail.com>
Safe Haskell	Safe-Inferred

Glider.NLP.Tokenizer

Description

This module contains functions which parses text into tokens. tokens are not normalized. If you need all tokens from the document then check function tokenize. If you need only words (na dots, numbers etc.) then check function getWords.

Synopsis

Documentation

data Token Source

Token type

Constructors

Word Text
Number Text
Punctuation Char
Symbol Char
Whitespace
Unknown Char

Instances

Eq Token
Show Token

foldCase :: [Text] -> [Text]Source

Convert all words to the same case

getWords :: [Token] -> [Text]Source

Exctract all words from tokens

 getWords "one two." == ["one", "two"]

tokenize :: Text -> [Token]Source

Split text into tokens

 tokenize "one two." == [Word "one", Whitespace, Word "two", "Separator "."]

wordParser :: ParserSource

Parse word

numberParser :: ParserSource

Parse number

punctuationParser :: ParserSource

Parse punctuation

symbolParser :: ParserSource

Parse symbol

spaceParser :: ParserSource

Parse whitespaces

allParser :: ParserSource

Apply all parsers to the input. Return result from the first which will parse correctly given text.