Safe Haskell	None

NLP.POS.AvgPerceptronTagger

Description

Avegeraged Perceptron Tagger

Adapted from the python implementation found here:

https://github.com/sloria/textblob-aptagger/blob/master/textblob_aptagger/taggers.py

Synopsis

Documentation

mkTagger :: Perceptron -> Maybe POSTagger -> POSTagger Source

Create an Averaged Perceptron Tagger using the specified back-off tagger as a fall-back, if one is specified.

This uses a tokenizer adapted from the tokenize package for a tokenizer, and Erik Kow's fullstop sentence segmenter (http://hackage.haskell.org/package/fullstop) as a sentence splitter.

trainNew :: Text -> IO Perceptron Source

Train a new Perceptron.

The training corpus should be a collection of sentences, one sentence on each line, and with each token tagged with a part of speech.

For example, the input:

 "The/DT dog/NN jumped/VB ./.\nThe/DT cat/NN slept/VB ./."

defines two training sentences.

>>> tagger <- trainNew "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n"
>>> tag tagger $ map T.words $ T.lines "Dear sir"
"Dear/jj Sirs/nns :/: Let/vb"

trainOnFiles :: [FilePath] -> IO Perceptron Source

Train a new Perceptron on a corpus of files.

train Source

Arguments

:: Perceptron	The inital model.
-> Text	Training data; formatted with one sentence per line, and standard POS tags after each space-delimeted token.
-> IO Perceptron

Add training examples to a perceptron.

>>> tagger <- train emptyPerceptron "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n"
>>> tag tagger $ map T.words $ T.lines "Dear sir"
"Dear/jj Sirs/nns :/: Let/vb"

If you're using multiple input files, this can be useful to improve performance (by folding over the files). For example, see trainOnFiles

trainInt Source

Arguments

:: Int	The number of times to iterate over the training data, randomly shuffling after each iteration. (`5` is a reasonable choice.)
-> Perceptron	The `Perceptron` to train.
-> [TaggedSentence]	The training data. (A list of `[(Text, Tag)]`'s)
-> IO Perceptron	A trained perceptron. IO is needed for randomization.

Train a model from sentences.

Ported from Python:

 def train(self, sentences, save_loc=None, nr_iter=5):
     self._make_tagdict(sentences)
     self.model.classes = self.classes
     prev, prev2 = START
     for iter_ in range(nr_iter):
         c = 0
         n = 0
         for words, tags in sentences:
             context = START + [self._normalize(w) for w in words] + END
             for i, word in enumerate(words):
                 guess = self.tagdict.get(word)
                 if not guess:
                     feats = self._get_features(i, word, context, prev, prev2)
                     guess = self.model.predict(feats)
                     self.model.update(tags[i], guess, feats)
                 prev2 = prev; prev = guess
                 c += guess == tags[i]
                 n += 1
         random.shuffle(sentences)
         logging.info("Iter {0}: {1}/{2}={3}".format(iter_, c, n, _pc(c, n)))
     self.model.average_weights()
     # Pickle as a binary file
     if save_loc is not None:
         pickle.dump((self.model.weights, self.tagdict, self.classes),
                      open(save_loc, 'wb'), -1)
     return None

tag :: Perceptron -> [Sentence] -> [TaggedSentence]Source

Tag a document (represented as a list of Sentences) with a trained Perceptron

Ported from Python:

 def tag(self, corpus, tokenize=True):
     '''Tags a string `corpus`.'''
     # Assume untokenized corpus has \n between sentences and ' ' between words
     s_split = nltk.sent_tokenize if tokenize else lambda t: t.split('\n')
     w_split = nltk.word_tokenize if tokenize else lambda s: s.split()
     def split_sents(corpus):
         for s in s_split(corpus):
             yield w_split(s)
      prev, prev2 = self.START
     tokens = []
     for words in split_sents(corpus):
         context = self.START + [self._normalize(w) for w in words] + self.END
         for i, word in enumerate(words):
             tag = self.tagdict.get(word)
             if not tag:
                 features = self._get_features(i, word, context, prev, prev2)
                 tag = self.model.predict(features)
             tokens.append((word, tag))
             prev2 = prev
             prev = tag
     return tokens

tagSentence :: Perceptron -> Sentence -> TaggedSentence Source

Tag a single sentence.

emptyPerceptron :: Perceptron Source

An empty perceptron, used to start training.

taggerID :: ByteString Source

readTagger :: ByteString -> Maybe POSTagger -> Either String POSTagger Source