Copyright | (c) Vjeran Crnjak, 2014 |
---|---|
License | BSD3 |
Maintainer | vjeran.crnjak@gmail.com |
Stability | experimental |
Portability | portable |
Safe Haskell | None |
Language | Haskell2010 |
Implementation of a space-efficient morphosyntactic analyzer.
It solves a problem of providing a set of possible tags for a given word. Instead of just matching on the word-set pair, one can assume that suffixes of an unknown word also hold some information about the set.
This library provides the functionality of that kind of analysis. One example of
where this might be useful is concraft
tagging library. Before the POS-tagging
one needs to have a set of possible tags for a word from which the correct one is
disambiguated.
For a sufficiently large construction corpus this analyzer might only benefit from additional regular expressions for punctuation and number matching. There is a possibility of returning a set of possible tags that isn't complete - the set doesn't contain a correct tag. If construction corpus isn't sufficiently large, there might be a fair amount of incomplete sets on unseen named entities (person names, corporation names etc.).
If one needs the analyzer to be less aggressive, it is recommended to extend the functionality and remove the sets of possible tags from words which might be named (ex. capitalized words in the middle of a sentence). This is present mostly in use cases where part-of-speech tags of a language contain information whether a word represents a named entity or not, so if this is not a case, there will be no need to extend the current functionality.
A simple example of using GHCi
for construction:
:set -XOverloadedStrings import qualified Data.Text.IO as T import qualified Data.Tagset.Positional as P f <- readFile "tagset.cfg" let tset = P.parseTagset "tagset1" f f <- T.readFile "fulldict.txt" let train = map (\(word:tags) -> (word, map (P.parseTag tset) tags)) . map T.words . filter (not . T.null) . T.lines $ f let an = create tset (AConf 3 [] M.empty) train save "analyzer.gz" an
It is assumed that tag attributes are separated with :
for parseTag
. One could write a
different parsing function.
Model
elem :: Text -> Analyzer -> Bool Source
Checks whether a word is in the analyzer. If it is the set of tags
returned by the getTags
will be non-empty.
getTags :: Analyzer -> Text -> Set Tag Source
Gives a set of possible tags for a given word. It is possible that the set of possible tags is empty.
save :: FilePath -> Analyzer -> IO () Source
Save analyzer in a file. Data is compressed using the gzip format.
:: Tagset | Tagset used in the construction corpus. |
-> AConf | Configuration of the analyzer. |
-> [(Text, [Tag])] | Construction corpus. |
-> Analyzer | Morphological analyzer. |
Creates a morphological analyzer given a tagset, a list of regex for additional matching, smallest suffix length and a construction corpus.
Token matching
Replaces the need of writing regular expressions for simple matching. Matching on punctuation, number, alphanumeric, upper-case tokens or regular expressions.
Punct | Matches a token with all punctuation characters. |
Number | Matches a token with all unicode numeral characters. |
AlphaNum | Matches a token with all alphanumeric characters. |
AnyUpper | Matches a token with at least one uppercase characther. |
AllUpper | Matches a token with all uppercase characters. |
AnyLower | Matches a token with at least one lowercase characther. |
AllLower | Matches a token with all lowercase characters. |
Capital | Matches a capitalized token. |
RegExpr Text | Matches on a regular expression. |
Configuration
Configuration for the analyzer.
AConf | |
|