glider-nlp-0.1: Natural Language Processing library

Portabilityportable
Stabilityalpha
MaintainerKrzysztof Langner <klangner@gmail.com>
Safe HaskellSafe-Inferred

Glider.NLP.Tokenizer

Description

This module contains functions which parses text into tokens. tokens are not normalized. If you need all tokens from the document then check function tokenize. If you need only words (na dots, numbers etc.) then check function getWords.

Synopsis

Documentation

data Token Source

Token type

Instances

foldCase :: [Text] -> [Text]Source

Convert all words to the same case

getWords :: [Token] -> [Text]Source

Exctract all words from tokens

 getWords "one two." == ["one", "two"] 

tokenize :: Text -> [Token]Source

Split text into tokens

 tokenize "one two." == [Word "one", Whitespace, Word "two", "Separator "."] 

wordParser :: ParserSource

Parse word

numberParser :: ParserSource

Parse number

punctuationParser :: ParserSource

Parse punctuation

symbolParser :: ParserSource

Parse symbol

spaceParser :: ParserSource

Parse whitespaces

allParser :: ParserSource

Apply all parsers to the input. Return result from the first which will parse correctly given text.