| Safe Haskell | None |
|---|---|
| Language | Haskell2010 |
NLP.POS.AvgPerceptronTagger
Description
Avegeraged Perceptron Tagger
Adapted from the python implementation found here:
- mkTagger :: Tag t => Perceptron -> Maybe (POSTagger t) -> POSTagger t
- trainNew :: Tag t => (Text -> t) -> Text -> IO Perceptron
- trainOnFiles :: Tag t => (Text -> t) -> [FilePath] -> IO Perceptron
- train :: Tag t => (Text -> t) -> Perceptron -> Text -> IO Perceptron
- trainInt :: Tag t => Int -> Perceptron -> [TaggedSentence t] -> IO Perceptron
- tag :: Tag t => Perceptron -> [Sentence] -> [TaggedSentence t]
- tagSentence :: Tag t => Perceptron -> Sentence -> TaggedSentence t
- emptyPerceptron :: Perceptron
- taggerID :: ByteString
- readTagger :: Tag t => ByteString -> Maybe (POSTagger t) -> Either String (POSTagger t)
Documentation
mkTagger :: Tag t => Perceptron -> Maybe (POSTagger t) -> POSTagger t Source
Create an Averaged Perceptron Tagger using the specified back-off tagger as a fall-back, if one is specified.
This uses a tokenizer adapted from the tokenize package for a
tokenizer, and Erik Kow's fullstop sentence segmenter
(http://hackage.haskell.org/package/fullstop) as a sentence
splitter.
trainNew :: Tag t => (Text -> t) -> Text -> IO Perceptron Source
Train a new Perceptron.
The training corpus should be a collection of sentences, one sentence on each line, and with each token tagged with a part of speech.
For example, the input:
"The/DT dog/NN jumped/VB ./.\nThe/DT cat/NN slept/VB ./."
defines two training sentences.
>>>tagger <- trainNew "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n">>>tag tagger $ map T.words $ T.lines "Dear sir""Dear/jj Sirs/nns :/: Let/vb"
trainOnFiles :: Tag t => (Text -> t) -> [FilePath] -> IO Perceptron Source
Train a new Perceptron on a corpus of files.
Arguments
| :: Tag t | |
| => (Text -> t) | The POS tag parser. |
| -> Perceptron | The inital model. |
| -> Text | Training data; formatted with one sentence per line, and standard POS tags after each space-delimeted token. |
| -> IO Perceptron |
Add training examples to a perceptron.
>>>tagger <- train emptyPerceptron "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n">>>tag tagger $ map T.words $ T.lines "Dear sir""Dear/jj Sirs/nns :/: Let/vb"
If you're using multiple input files, this can be useful to improve
performance (by folding over the files). For example, see trainOnFiles
Arguments
| :: Tag t | |
| => Int | The number of times to iterate over the training
data, randomly shuffling after each iteration. ( |
| -> Perceptron | The |
| -> [TaggedSentence t] | The training data. (A list of |
| -> IO Perceptron | A trained perceptron. IO is needed for randomization. |
Train a model from sentences.
Ported from Python:
def train(self, sentences, save_loc=None, nr_iter=5):
self._make_tagdict(sentences)
self.model.classes = self.classes
prev, prev2 = START
for iter_ in range(nr_iter):
c = 0
n = 0
for words, tags in sentences:
context = START + [self._normalize(w) for w in words] + END
for i, word in enumerate(words):
guess = self.tagdict.get(word)
if not guess:
feats = self._get_features(i, word, context, prev, prev2)
guess = self.model.predict(feats)
self.model.update(tags[i], guess, feats)
prev2 = prev; prev = guess
c += guess == tags[i]
n += 1
random.shuffle(sentences)
logging.info("Iter {0}: {1}/{2}={3}".format(iter_, c, n, _pc(c, n)))
self.model.average_weights()
# Pickle as a binary file
if save_loc is not None:
pickle.dump((self.model.weights, self.tagdict, self.classes),
open(save_loc, 'wb'), -1)
return Nonetag :: Tag t => Perceptron -> [Sentence] -> [TaggedSentence t] Source
Tag a document (represented as a list of Sentences) with a
trained Perceptron
Ported from Python:
def tag(self, corpus, tokenize=True):
'''Tags a string `corpus`.'''
# Assume untokenized corpus has \n between sentences and ' ' between words
s_split = nltk.sent_tokenize if tokenize else lambda t: t.split('\n')
w_split = nltk.word_tokenize if tokenize else lambda s: s.split()
def split_sents(corpus):
for s in s_split(corpus):
yield w_split(s)
prev, prev2 = self.START
tokens = []
for words in split_sents(corpus):
context = self.START + [self._normalize(w) for w in words] + self.END
for i, word in enumerate(words):
tag = self.tagdict.get(word)
if not tag:
features = self._get_features(i, word, context, prev, prev2)
tag = self.model.predict(features)
tokens.append((word, tag))
prev2 = prev
prev = tag
return tokenstagSentence :: Tag t => Perceptron -> Sentence -> TaggedSentence t Source
Tag a single sentence.
emptyPerceptron :: Perceptron Source
An empty perceptron, used to start training.
readTagger :: Tag t => ByteString -> Maybe (POSTagger t) -> Either String (POSTagger t) Source