Safe Haskell | None |
---|---|
Language | Haskell2010 |
Avegeraged Perceptron Tagger
Adapted from the python implementation found here:
- mkTagger :: Tag t => Perceptron -> Maybe (POSTagger t) -> POSTagger t
- trainNew :: Tag t => (Text -> t) -> Text -> IO Perceptron
- trainOnFiles :: Tag t => (Text -> t) -> [FilePath] -> IO Perceptron
- train :: Tag t => (Text -> t) -> Perceptron -> Text -> IO Perceptron
- trainInt :: Tag t => Int -> Perceptron -> [TaggedSentence t] -> IO Perceptron
- tag :: Tag t => Perceptron -> [Sentence] -> [TaggedSentence t]
- tagSentence :: Tag t => Perceptron -> Sentence -> TaggedSentence t
- emptyPerceptron :: Perceptron
- taggerID :: ByteString
- readTagger :: Tag t => ByteString -> Maybe (POSTagger t) -> Either String (POSTagger t)
Documentation
mkTagger :: Tag t => Perceptron -> Maybe (POSTagger t) -> POSTagger t Source #
Create an Averaged Perceptron Tagger using the specified back-off tagger as a fall-back, if one is specified.
This uses a tokenizer adapted from the tokenize
package for a
tokenizer, and Erik Kow's fullstop sentence segmenter
(http://hackage.haskell.org/package/fullstop) as a sentence
splitter.
trainNew :: Tag t => (Text -> t) -> Text -> IO Perceptron Source #
Train a new Perceptron
.
The training corpus should be a collection of sentences, one sentence on each line, and with each token tagged with a part of speech.
For example, the input:
"The/DT dog/NN jumped/VB ./.\nThe/DT cat/NN slept/VB ./."
defines two training sentences.
>>>
tagger <- trainNew "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n"
>>>
tag tagger $ map T.words $ T.lines "Dear sir"
"Dear/jj Sirs/nns :/: Let/vb"
trainOnFiles :: Tag t => (Text -> t) -> [FilePath] -> IO Perceptron Source #
Train a new Perceptron
on a corpus of files.
:: Tag t | |
=> (Text -> t) | The POS tag parser. |
-> Perceptron | The inital model. |
-> Text | Training data; formatted with one sentence per line, and standard POS tags after each space-delimeted token. |
-> IO Perceptron |
Add training examples to a perceptron.
>>>
tagger <- train emptyPerceptron "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n"
>>>
tag tagger $ map T.words $ T.lines "Dear sir"
"Dear/jj Sirs/nns :/: Let/vb"
If you're using multiple input files, this can be useful to improve
performance (by folding over the files). For example, see trainOnFiles
:: Tag t | |
=> Int | The number of times to iterate over the training
data, randomly shuffling after each iteration. ( |
-> Perceptron | The |
-> [TaggedSentence t] | The training data. (A list of |
-> IO Perceptron | A trained perceptron. IO is needed for randomization. |
Train a model from sentences.
Ported from Python:
def train(self, sentences, save_loc=None, nr_iter=5): self._make_tagdict(sentences) self.model.classes = self.classes prev, prev2 = START for iter_ in range(nr_iter): c = 0 n = 0 for words, tags in sentences: context = START + [self._normalize(w) for w in words] + END for i, word in enumerate(words): guess = self.tagdict.get(word) if not guess: feats = self._get_features(i, word, context, prev, prev2) guess = self.model.predict(feats) self.model.update(tags[i], guess, feats) prev2 = prev; prev = guess c += guess == tags[i] n += 1 random.shuffle(sentences) logging.info("Iter {0}: {1}/{2}={3}".format(iter_, c, n, _pc(c, n))) self.model.average_weights() # Pickle as a binary file if save_loc is not None: pickle.dump((self.model.weights, self.tagdict, self.classes), open(save_loc, 'wb'), -1) return None
tag :: Tag t => Perceptron -> [Sentence] -> [TaggedSentence t] Source #
Tag a document (represented as a list of Sentence
s) with a
trained Perceptron
Ported from Python:
def tag(self, corpus, tokenize=True): '''Tags a string `corpus`.''' # Assume untokenized corpus has \n between sentences and ' ' between words s_split = nltk.sent_tokenize if tokenize else lambda t: t.split('\n') w_split = nltk.word_tokenize if tokenize else lambda s: s.split() def split_sents(corpus): for s in s_split(corpus): yield w_split(s) prev, prev2 = self.START tokens = [] for words in split_sents(corpus): context = self.START + [self._normalize(w) for w in words] + self.END for i, word in enumerate(words): tag = self.tagdict.get(word) if not tag: features = self._get_features(i, word, context, prev, prev2) tag = self.model.predict(features) tokens.append((word, tag)) prev2 = prev prev = tag return tokens
tagSentence :: Tag t => Perceptron -> Sentence -> TaggedSentence t Source #
Tag a single sentence.
emptyPerceptron :: Perceptron Source #
An empty perceptron, used to start training.
readTagger :: Tag t => ByteString -> Maybe (POSTagger t) -> Either String (POSTagger t) Source #