Copyright | Rogan Creswick, 2014 |
---|---|
Maintainer | creswick@gmail.com |
Stability | experimental |
Safe Haskell | None |
Language | Haskell2010 |
This module aims to make tagging text with parts of speech trivially easy.
If you're new to chatter
and POS-tagging, then I
suggest you simply try:
>>>
tagger <- defaultTagger
>>>
tagStr tagger "This is a sample sentence."
"This/dt is/bez a/at sample/nn sentence/nn ./."
Note that we used tagStr
, instead of tag
, or tagText
. Many
people don't (yet!) use Data.Text by default, so there is a
wrapper around tag
that packs and unpacks the String
. This is
innefficient, but it's just to get you started, and tagStr
can be
very handy when you're debugging a tagger in ghci (or cabal repl).
tag
exposes more details of the tokenization and tagging, since
it returns a list of TaggedSentence
s, but it doesn't print
results as nicely.
- tag :: Tag t => POSTagger t -> Text -> [TaggedSentence t]
- tagStr :: Tag t => POSTagger t -> String -> String
- tagText :: Tag t => POSTagger t -> Text -> Text
- train :: Tag t => POSTagger t -> [TaggedSentence t] -> IO (POSTagger t)
- trainStr :: Tag t => POSTagger t -> String -> IO (POSTagger t)
- trainText :: Tag t => POSTagger t -> Text -> IO (POSTagger t)
- tagTokens :: Tag t => POSTagger t -> [Sentence] -> [TaggedSentence t]
- eval :: Tag t => POSTagger t -> [TaggedSentence t] -> Double
- serialize :: Tag t => POSTagger t -> ByteString
- deserialize :: Tag t => Map ByteString (ByteString -> Maybe (POSTagger t) -> Either String (POSTagger t)) -> ByteString -> Either String (POSTagger t)
- taggerTable :: Tag t => Map ByteString (ByteString -> Maybe (POSTagger t) -> Either String (POSTagger t))
- saveTagger :: Tag t => POSTagger t -> FilePath -> IO ()
- loadTagger :: Tag t => FilePath -> IO (POSTagger t)
- defaultTagger :: IO (POSTagger Tag)
- conllTagger :: IO (POSTagger Tag)
- brownTagger :: IO (POSTagger Tag)
Documentation
tag :: Tag t => POSTagger t -> Text -> [TaggedSentence t] Source
Tag a chunk of input text with part-of-speech tags, using the
sentence splitter, tokenizer, and tagger contained in the POSTager
.
tagStr :: Tag t => POSTagger t -> String -> String Source
Tag the tokens in a string.
Returns a space-separated string of tokens, each token suffixed with the part of speech. For example:
>>>
tag tagger "the dog jumped ."
"the/at dog/nn jumped/vbd ./."
train :: Tag t => POSTagger t -> [TaggedSentence t] -> IO (POSTagger t) Source
Train a POSTagger
on a corpus of sentences.
This will recurse through the POSTagger
stack, training all the
backoff taggers as well. In order to do that, this function has to
be generic to the kind of taggers used, so it is not possible to
train up a new POSTagger from nothing: train
wouldn't know what
tagger to create.
To get around that restriction, you can use the various mkTagger
implementations, such as mkTagger
or
NLP.POS.AvgPerceptronTagger.mkTagger'. For example:
import NLP.POS.AvgPerceptronTagger as APT let newTagger = APT.mkTagger APT.emptyPerceptron Nothing posTgr <- train newTagger trainingExamples
trainStr :: Tag t => POSTagger t -> String -> IO (POSTagger t) Source
Train a tagger on string input in the standard form for POS tagged corpora:
trainStr tagger "the/at dog/nn jumped/vbd ./."
eval :: Tag t => POSTagger t -> [TaggedSentence t] -> Double Source
Evaluate a POSTager
.
Measures accuracy over all tags in the test corpus.
Accuracy is calculated as:
|tokens tagged correctly| / |all tokens|
serialize :: Tag t => POSTagger t -> ByteString Source
deserialize :: Tag t => Map ByteString (ByteString -> Maybe (POSTagger t) -> Either String (POSTagger t)) -> ByteString -> Either String (POSTagger t) Source
taggerTable :: Tag t => Map ByteString (ByteString -> Maybe (POSTagger t) -> Either String (POSTagger t)) Source
The default table of tagger IDs to readTagger functions. Each tagger packaged with Chatter should have an entry here. By convention, the IDs use are the fully qualified module name of the tagger package.
loadTagger :: Tag t => FilePath -> IO (POSTagger t) Source
Load a tagger, using the interal taggerTable
. If you need to
specify your own mappings for new composite taggers, you should use
deserialize
.
This function checks the filename to determine if the content should be decompressed. If the file ends with ".gz", then we assume it is a gziped model.
defaultTagger :: IO (POSTagger Tag) Source
A basic POS tagger.
conllTagger :: IO (POSTagger Tag) Source
A POS tagger that has been trained on the Conll 2000 POS tags.
brownTagger :: IO (POSTagger Tag) Source
A POS tagger trained on a subset of the Brown corpus.