Copyright | Copyright (C) 2013-2014 Krzysztof Langner |
---|---|
License | BSD3 |
Maintainer | Krzysztof Langner <klangner@gmail.com> |
Stability | alpha |
Portability | portable |
Safe Haskell | Safe |
Language | Haskell2010 |
Glider.NLP.Tokenizer
Description
This module contains functions which parses text into tokens. Tokens are not normalized. If you need all tokens from the document then use function "tokenize". If you need only words (no dots, numbers etc.) then check function "getWords".
- data Token
- foldCase :: [Text] -> [Text]
- getWords :: [Token] -> [Text]
- tokenize :: Text -> [Token]
- wordParser :: Parser
- numberParser :: Parser
- punctuationParser :: Parser
- symbolParser :: Parser
- spaceParser :: Parser
- allParser :: Parser
Documentation
Token type
getWords :: [Token] -> [Text] Source #
Extract all words from tokens
getWords "one two." == ["one", "two"]
tokenize :: Text -> [Token] Source #
Split text into tokens
tokenize "one two." == [Word "one", Whitespace, Word "two", "Separator "."]
wordParser :: Parser Source #
Parse word
numberParser :: Parser Source #
Parse number
punctuationParser :: Parser Source #
Parse punctuation
symbolParser :: Parser Source #
Parse symbol
spaceParser :: Parser Source #
Parse whitespaces