chatter-0.9.1.0: A library of simple NLP algorithms.

Safe HaskellNone
LanguageHaskell2010

NLP.POS.AvgPerceptronTagger

Description

Avegeraged Perceptron Tagger

Adapted from the python implementation found here:

Synopsis

Documentation

mkTagger :: Tag t => Perceptron -> Maybe (POSTagger t) -> POSTagger t Source #

Create an Averaged Perceptron Tagger using the specified back-off tagger as a fall-back, if one is specified.

This uses a tokenizer adapted from the tokenize package for a tokenizer, and Erik Kow's fullstop sentence segmenter (http://hackage.haskell.org/package/fullstop) as a sentence splitter.

trainNew :: Tag t => (Text -> t) -> Text -> IO Perceptron Source #

Train a new Perceptron.

The training corpus should be a collection of sentences, one sentence on each line, and with each token tagged with a part of speech.

For example, the input:

"The/DT dog/NN jumped/VB ./.\nThe/DT cat/NN slept/VB ./."

defines two training sentences.

>>> tagger <- trainNew "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n"
>>> tag tagger $ map T.words $ T.lines "Dear sir"
"Dear/jj Sirs/nns :/: Let/vb"

trainOnFiles :: Tag t => (Text -> t) -> [FilePath] -> IO Perceptron Source #

Train a new Perceptron on a corpus of files.

train Source #

Arguments

:: Tag t 
=> (Text -> t)

The POS tag parser.

-> Perceptron

The inital model.

-> Text

Training data; formatted with one sentence per line, and standard POS tags after each space-delimeted token.

-> IO Perceptron 

Add training examples to a perceptron.

>>> tagger <- train emptyPerceptron "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n"
>>> tag tagger $ map T.words $ T.lines "Dear sir"
"Dear/jj Sirs/nns :/: Let/vb"

If you're using multiple input files, this can be useful to improve performance (by folding over the files). For example, see trainOnFiles

trainInt Source #

Arguments

:: Tag t 
=> Int

The number of times to iterate over the training data, randomly shuffling after each iteration. (5 is a reasonable choice.)

-> Perceptron

The Perceptron to train.

-> [TaggedSentence t]

The training data. (A list of [(Text, Tag)]'s)

-> IO Perceptron

A trained perceptron. IO is needed for randomization.

Train a model from sentences.

Ported from Python:

def train(self, sentences, save_loc=None, nr_iter=5):
    self._make_tagdict(sentences)
    self.model.classes = self.classes
    prev, prev2 = START
    for iter_ in range(nr_iter):
        c = 0
        n = 0
        for words, tags in sentences:
            context = START + [self._normalize(w) for w in words] + END
            for i, word in enumerate(words):
                guess = self.tagdict.get(word)
                if not guess:
                    feats = self._get_features(i, word, context, prev, prev2)
                    guess = self.model.predict(feats)
                    self.model.update(tags[i], guess, feats)
                prev2 = prev; prev = guess
                c += guess == tags[i]
                n += 1
        random.shuffle(sentences)
        logging.info("Iter {0}: {1}/{2}={3}".format(iter_, c, n, _pc(c, n)))
    self.model.average_weights()
    # Pickle as a binary file
    if save_loc is not None:
        pickle.dump((self.model.weights, self.tagdict, self.classes),
                     open(save_loc, 'wb'), -1)
    return None

tag :: Tag t => Perceptron -> [Sentence] -> [TaggedSentence t] Source #

Tag a document (represented as a list of Sentences) with a trained Perceptron

Ported from Python:

def tag(self, corpus, tokenize=True):
    '''Tags a string `corpus`.'''
    # Assume untokenized corpus has \n between sentences and ' ' between words
    s_split = nltk.sent_tokenize if tokenize else lambda t: t.split('\n')
    w_split = nltk.word_tokenize if tokenize else lambda s: s.split()
    def split_sents(corpus):
        for s in s_split(corpus):
            yield w_split(s)
     prev, prev2 = self.START
    tokens = []
    for words in split_sents(corpus):
        context = self.START + [self._normalize(w) for w in words] + self.END
        for i, word in enumerate(words):
            tag = self.tagdict.get(word)
            if not tag:
                features = self._get_features(i, word, context, prev, prev2)
                tag = self.model.predict(features)
            tokens.append((word, tag))
            prev2 = prev
            prev = tag
    return tokens

tagSentence :: Tag t => Perceptron -> Sentence -> TaggedSentence t Source #

Tag a single sentence.

emptyPerceptron :: Perceptron Source #

An empty perceptron, used to start training.