Copyright	Copyright (C) 2013-2014 Krzysztof Langner
License	BSD3
Maintainer	Krzysztof Langner <klangner@gmail.com>
Stability	alpha
Portability	portable
Safe Haskell	Safe
Language	Haskell2010

Glider.NLP.Tokenizer

Description

This module contains functions which parses text into tokens. Tokens are not normalized. If you need all tokens from the document then use function "tokenize". If you need only words (no dots, numbers etc.) then check function "getWords".

Synopsis

Documentation

data Token Source #

Token type

Constructors

Word Text
Number Text
Punctuation Char
Symbol Char
Whitespace
Unknown Char

Instances

Eq Token Source #
Methods (==) :: Token -> Token -> Bool # (/=) :: Token -> Token -> Bool #
Show Token Source #
Methods showsPrec :: Int -> Token -> ShowS # show :: Token -> String # showList :: [Token] -> ShowS #

foldCase :: [Text] -> [Text] Source #

Convert all words to the same case

getWords :: [Token] -> [Text] Source #

Extract all words from tokens

getWords "one two." == ["one", "two"]

tokenize :: Text -> [Token] Source #

Split text into tokens

tokenize "one two." == [Word "one", Whitespace, Word "two", "Separator "."]

wordParser :: Parser Source #

Parse word

numberParser :: Parser Source #

Parse number

punctuationParser :: Parser Source #

Parse punctuation

symbolParser :: Parser Source #

Parse symbol

spaceParser :: Parser Source #

Parse whitespaces

allParser :: Parser Source #

Apply all parsers to the input. Return result from the first which will parse correctly given text.