chatter-0.1.0.4: A library of simple NLP algorithms.

Safe HaskellNone

NLP.POS.AvgPerceptronTagger

Description

Avegeraged Perceptron Tagger

Adapted from the python implementation found here:

Synopsis

Documentation

mkTagger :: Perceptron -> Maybe POSTagger -> POSTaggerSource

Create an Averaged Perceptron Tagger using the specified back-off tagger as a fall-back, if one is specified.

This uses a tokenizer adapted from the tokenize package for a tokenizer, and Erik Kow's fullstop sentence segmenter (http://hackage.haskell.org/package/fullstop) as a sentence splitter.

trainNew :: Text -> IO PerceptronSource

Train a new Perceptron.

The training corpus should be a collection of sentences, one sentence on each line, and with each token tagged with a part of speech.

For example, the input:

 "The/DT dog/NN jumped/VB ./.\nThe/DT cat/NN slept/VB ./."

defines two training sentences.

>>> tagger <- trainNew "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n"
>>> tag tagger $ map T.words $ T.lines "Dear sir"
"Dear/jj Sirs/nns :/: Let/vb"

trainOnFiles :: [FilePath] -> IO PerceptronSource

Train a new Perceptron on a corpus of files.

trainSource

Arguments

:: Perceptron

The inital model.

-> Text

Training data; formatted with one sentence per line, and standard POS tags after each space-delimeted token.

-> IO Perceptron 

Add training examples to a perceptron.

>>> tagger <- train emptyPerceptron "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n"
>>> tag tagger $ map T.words $ T.lines "Dear sir"
"Dear/jj Sirs/nns :/: Let/vb"

If you're using multiple input files, this can be useful to improve performance (by folding over the files). For example, see trainOnFiles

trainIntSource

Arguments

:: Int

The number of times to iterate over the training data, randomly shuffling after each iteration. (5 is a reasonable choice.)

-> Perceptron

The Perceptron to train.

-> [TaggedSentence]

The training data. (A list of [(Text, Tag)]'s)

-> IO Perceptron

A trained perceptron. IO is needed for randomization.

Train a model from sentences.

Ported from Python:

 def train(self, sentences, save_loc=None, nr_iter=5):
     self._make_tagdict(sentences)
     self.model.classes = self.classes
     prev, prev2 = START
     for iter_ in range(nr_iter):
         c = 0
         n = 0
         for words, tags in sentences:
             context = START + [self._normalize(w) for w in words] + END
             for i, word in enumerate(words):
                 guess = self.tagdict.get(word)
                 if not guess:
                     feats = self._get_features(i, word, context, prev, prev2)
                     guess = self.model.predict(feats)
                     self.model.update(tags[i], guess, feats)
                 prev2 = prev; prev = guess
                 c += guess == tags[i]
                 n += 1
         random.shuffle(sentences)
         logging.info("Iter {0}: {1}/{2}={3}".format(iter_, c, n, _pc(c, n)))
     self.model.average_weights()
     # Pickle as a binary file
     if save_loc is not None:
         pickle.dump((self.model.weights, self.tagdict, self.classes),
                      open(save_loc, 'wb'), -1)
     return None

tag :: Perceptron -> [Sentence] -> [TaggedSentence]Source

Tag a document (represented as a list of Sentences) with a trained Perceptron

Ported from Python:

 def tag(self, corpus, tokenize=True):
     '''Tags a string `corpus`.'''
     # Assume untokenized corpus has \n between sentences and ' ' between words
     s_split = nltk.sent_tokenize if tokenize else lambda t: t.split('\n')
     w_split = nltk.word_tokenize if tokenize else lambda s: s.split()
     def split_sents(corpus):
         for s in s_split(corpus):
             yield w_split(s)
      prev, prev2 = self.START
     tokens = []
     for words in split_sents(corpus):
         context = self.START + [self._normalize(w) for w in words] + self.END
         for i, word in enumerate(words):
             tag = self.tagdict.get(word)
             if not tag:
                 features = self._get_features(i, word, context, prev, prev2)
                 tag = self.model.predict(features)
             tokens.append((word, tag))
             prev2 = prev
             prev = tag
     return tokens

tagSentence :: Perceptron -> Sentence -> TaggedSentenceSource

Tag a single sentence.

emptyPerceptron :: PerceptronSource

An empty perceptron, used to start training.