|z      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`a b c d e f g h i j k l m n o p q r s t u v w x y None0_Defaulting Map; a Map that returns a default value when queried for a key that does not exist.Create an empty IQuery the map for a value. Returns the default if the key is not found. Create a ! from a default value and a list.Access the keys as a list. Fold over the values in the map.kNote that this *does* not fold over the default value -- this fold behaves in the same way as a standard       Safe-Inferred 7Path to the directory containing all the PLUG archives.     Safe-Inferred z{|}~{ z{|}~None0Document corpus.CThis is a simple hashed corpus, the document content is not stored.&The number of documents in the corpus.9A count of the number of documents each term occurred in.,Part of Speech tagger, with back-off tagger.A sequence of pos taggers can be assembled by using backoff taggers. When tagging text, the first tagger is run on the input, possibly tagging some tokens as unknown ('Tag Unk'). The first backoff tagger is then recursively invoked on the text to fill in the unknown tags, but that may still leave some tokens marked with 'Tag Unk'. This process repeats until no more taggers are found. (The current implementation is not very efficient in this respect.).DBack off taggers are particularly useful when there is a set of domain specific vernacular that a general purpose statistical tagger does not know of. A LitteralTagger can be created to map terms to fixed POS tags, and then delegate the bulk of the text to a statistical back off tagger, such as an AvgPerceptronTagger.5 values can be serialized and deserialized by using  / and NLP.POS.deserialize`. This is a bit tricky because the POSTagger abstracts away the implementation details of the particular tagging algorithm, and the model for that tagger (if any). To support serialization, each POSTagger value must provide a serialize value that can be used to generate a ? representation of the model, as well as a unique id (also a ). Furthermore, that ID must be added to a `Map ByteString (ByteString -> Maybe POSTagger -> Either String POSTagger)` that is provided to  deserialize0. The function in the map takes the output of , and possibly a backoff tagger, and reconstitutes the POSTagger that was serialized (assigning the proper functions, setting up closures as needed, etc.) Look at the source for   and  for examples."The initial part-of-speech tagger.4Training function to train the immediate POS tagger.%A tagger to invoke on unknown tokens.A tokenizer; ( will work.)UA sentence splitter. If your input is formatted as one sentence per line, then use -, otherwise try Erik Kow's fullstop library.3Store this POS tagger to a bytestring. This does not serialize the backoff taggers.iA unique id that will identify the algorithm used for this POS Tagger. This is used in deserializationCBoolean type to indicate case sensitivity for textual comparisons.%True if the input sentence contains the given text token. Does not do partial or approximate matching, and compares details in a fully case-sensitive manner.&nTrue if the input sentence contains the given POS tag. Does not do partial matching (such as prefix matching)'&Remove the tags from a tagged sentence*Constant tag for "unknown"+4Get the number of documents that a term occurred in.,Add a document to the corpus.This can be dangerous if the documents are pre-processed differently. All corpus-related functions assume that the documents have all been tokenized and the tokens normalized, in the same way.-LCreate a corpus from a list of documents, represented by normalized tokens.$ !"#$%&'()*+,-./0123  !"#$%&'()*+,-./$#"$%&! '3()*210+,-./! "#$%&'()*+,-./0123None05aCreate a Literal Tagger using the specified back-off tagger as a fall-back, if one is specified.'This uses a tokenizer adapted from the tokenize^ package for a tokenizer, and Erik Kow's fullstop sentence segmenter as a sentence splitter.6SCreate a tokenizer that protects the provided terms (to tokenize multi-word terms)9deserialization for Literal Taggers. The serialization logic is in the posSerialize record of the POSTagger created in mkTagger.456789  !456789 78549! 6456789None<1Create an unambiguous tagger, using the supplied  as a source of tags.='Trainer method for unambiguous taggers.:;<=:;<=:;<=:;<=None0 >The perceptron model.@GEach feature gets its own weight vector, so weights is a dict-of-dictsAVThe accumulated values, for the averaging. These will be keyed by feature/clas tuplesBThe last time the feature was changed, for the averaging. Also keyed by feature/clas tuples (tstamps is short for timestamps)CNumber of instances seenDeTypedef for doubles to make the code easier to read, and to make this simple to change if necessary.EVThe classes that the perceptron assigns are represnted with a newtype-wrapped String.Eventually, I think this should become a typeclass, so the classes can be defined by the users of the Perceptron (such as custom POS tag ADTs, or more complex classes).I,An empty perceptron, used to start training.J'Predict a class given a feature vector.Ported from python: def predict(self, features): '''Dot-product the features and current weights and return the best label.''' scores = defaultdict(float) for feat, value in features.items(): if feat not in self.weights or value == 0: continue weights = self.weights[feat] for label, weight in weights.items(): scores[label] += value * weight # Do a secondary alphabetic sort, for stability return max(self.classes, key=lambda label: (scores[label], label))K)Update the perceptron with a new example. update(self, truth, guess, features) ... self.i += 1 if truth == guess: return None for f in features: weights = self.weights.setdefault(f, {}) -- setdefault is Map.findWithDefault, and destructive. upd_feat(truth, f, weights.get(truth, 0.0), 1.0) upd_feat(guess, f, weights.get(guess, 0.0), -1.0) return Noneported from python: def update(self, truth, guess, features): '''Update the feature weights.''' def upd_feat(c, f, w, v): param = (f, c) self._totals[param] += (self.i - self._tstamps[param]) * w self._tstamps[param] = self.i self.weights[f][c] = w + vLAverage the weightsPorted from Python: def average_weights(self): for feat, weights in self.weights.items(): new_feat_weights = {} for clas, weight in weights.items(): param = (feat, clas) total = self._totals[param] total += (self.i - self._tstamps[param]) * weight averaged = round(total / float(self.i), 3) if averaged: new_feat_weights[clas] = averaged self.weights[feat] = new_feat_weights return None7round a fractional number to a specified decimal place.roundTo 2 3.14593.15>?@ABCDEFGHIJKLM>?@ABCDEFGHIJKLM>?@ABCEFDGHIJMKL>?@ABCDEFGHIJKLMNone NMAn efficient (ish) representation for documents in the "bag of words" sense.O Generate a N from a tokenized document.P*Invokes similarity on full strings, using $ for tokenization, and no stemming.4There *must* be at least one document in the corpus.Q(Determine how similar two documents are.fThis function assumes that each document has been tokenized and (if desired) stemmed/case-normalized.This is a wrapper around R, which is a *much* more efficient implementation. If you need to run similarity against any single document more than once, then you should create N&s for each of your documents and use R instead of Q.4There *must* be at least one document in the corpus.R(Determine how similar two documents are.ACalculates the similarity between two documents, represented as  TermVectorsS5Return the raw frequency of a term in a body of text.The firt argument is the term to find, the second is a tokenized document. This function does not do any stemming or additional text modification.T)Calculate the inverse document frequency.AThe IDF is, roughly speaking, a measure of how popular a term is.UGCalculate the tf*idf measure for a term given a document and a corpus.W$Calculate the magnitude of a vector.X$find the dot product of two vectors. NOPQRSTUVWX NOPQRSTUVWX NOPQRSTUVWX NOPQRSTUVWXNone3HMYA Parsec parser.Example usage: n> set -XOverloadedStrings > import Text.Parsec.Prim > parse myExtractor "interactive repl" someTaggedSentence Z&Consume a token with the given POS Tag[.Consume a token with the specified POS prefix. 0parse (posPrefix "n") "ghci" [("Bob", Tag "np")]Right [(Bob , Tag "np")]\6Text equality matching with optional case sensitivity.]6Consume a token with the given lexical representation.^ Consume any one non-empty token.`cSkips any number of fill tokens, ending with the end parser, and returning the last parsed result.eThis is useful when you know what you're looking for and (for instance) don't care what comes first.YZ[\]^_`YZ[\]^_`YZ[\]^_`YZ[\]^_` NoneaTRead a POS-tagged corpus out of a Text string of the form: "token/tag token/tag..."%readPOS "Dear/jj Sirs/nns :/: Let/vb"5[("Dear",JJ),("Sirs",NNS),(":",Other ":"),("Let",VB)]bpReturns all but the last element of a string, unless the string is empty, in which case it returns that string.abababab None enCreate an Averaged Perceptron Tagger using the specified back-off tagger as a fall-back, if one is specified.'This uses a tokenizer adapted from the H package for a tokenizer, and Erik Kow's fullstop sentence segmenter ( +http://hackage.haskell.org/package/fullstop) as a sentence splitter.f Train a new >.The training corpus should be a collection of sentences, one sentence on each line, and with each token tagged with a part of speech.For example, the input: 9"The/DT dog/NN jumped/VB ./.\nThe/DT cat/NN slept/VB ./."defines two training sentences.Btagger <- trainNew "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n"-tag tagger $ map T.words $ T.lines "Dear sir""Dear/jj Sirs/nns :/: Let/vb"g Train a new > on a corpus of files.h&Add training examples to a perceptron.Otagger <- train emptyPerceptron "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n"-tag tagger $ map T.words $ T.lines "Dear sir""Dear/jj Sirs/nns :/: Let/vb"If you're using multiple input files, this can be useful to improve performance (by folding over the files). For example, see g]start markers to ensure all features in context are valid, even for the first "real" tokens.Oend markers to ensure all features are valid, even for the last "real" tokens.i)Tag a document (represented as a list of #s) with a trained >Ported from Python: wdef tag(self, corpus, tokenize=True): '''Tags a string `corpus`.''' # Assume untokenized corpus has \n between sentences and ' ' between words s_split = nltk.sent_tokenize if tokenize else lambda t: t.split('\n') w_split = nltk.word_tokenize if tokenize else lambda s: s.split() def split_sents(corpus): for s in s_split(corpus): yield w_split(s) prev, prev2 = self.START tokens = [] for words in split_sents(corpus): context = self.START + [self._normalize(w) for w in words] + self.END for i, word in enumerate(words): tag = self.tagdict.get(word) if not tag: features = self._get_features(i, word, context, prev, prev2) tag = self.model.predict(features) tokens.append((word, tag)) prev2 = prev prev = tag return tokensjTag a single sentence.kTrain a model from sentences.Ported from Python: 3def train(self, sentences, save_loc=None, nr_iter=5): self._make_tagdict(sentences) self.model.classes = self.classes prev, prev2 = START for iter_ in range(nr_iter): c = 0 n = 0 for words, tags in sentences: context = START + [self._normalize(w) for w in words] + END for i, word in enumerate(words): guess = self.tagdict.get(word) if not guess: feats = self._get_features(i, word, context, prev, prev2) guess = self.model.predict(feats) self.model.update(tags[i], guess, feats) prev2 = prev; prev = guess c += guess == tags[i] n += 1 random.shuffle(sentences) logging.info("Iter {0}: {1}/{2}={3}".format(iter_, c, n, _pc(c, n))) self.model.average_weights() # Pickle as a binary file if save_loc is not None: pickle.dump((self.model.weights, self.tagdict, self.classes), open(save_loc, 'wb'), -1) return NoneTrain on one sentence.5Adapted from this portion of the Python train method:  context = START + [self._normalize(w) for w in words] + END for i, word in enumerate(words): guess = self.tagdict.get(word) if not guess: feats = self._get_features(i, word, context, prev, prev2) guess = self.model.predict(feats) self.model.update(tags[i], guess, feats) prev2 = prev; prev = guess c += guess == tags[i] n += 1,Predict a Part of Speech, defaulting to the Unk% tag, if no classification is found.Default feature set. def _get_features(self, i, word, context, prev, prev2): '''Map tokens into a feature representation, implemented as a {hashable: float} dict. If the features change, a new model must be trained. ''' def add(name, *args): features[' '.join((name,) + tuple(args))] += 1 i += len(self.START) features = defaultdict(int) # It's useful to have a constant feature, which acts sort of like a prior add('bias') add('i suffix', word[-3:]) add('i pref1', word[0]) add('i-1 tag', prev) add('i-2 tag', prev2) add('i tag+i-2 tag', prev, prev2) add('i word', context[i]) add('i-1 tag+i word', prev, context[i]) add('i-1 word', context[i-1]) add('i-1 suffix', context[i-1][-3:]) add('i-2 word', context[i-2]) add('i+1 word', context[i+1]) add('i+1 suffix', context[i+1][-3:]) add('i+2 word', context[i+2]) return featurescdefghThe inital model.nTraining data; formatted with one sentence per line, and standard POS tags after each space-delimeted token.ijkbThe number of times to iterate over the training data, randomly shuffling after each iteration. (5 is a reasonable choice.)The > to train.The training data. (A list of  [(Text, Tag)]'s)7A trained perceptron. IO is needed for randomization. Icdefghijk efghkijIcdcdefghijk None mThe default table of tagger IDs to readTagger functions. Each tagger packaged with Chatter should have an entry here. By convention, the IDs use are the fully qualified module name of the tagger package.nStore a POSTager to a file.o!Load a tagger, using the interal mX. If you need to specify your own mappings for new composite taggers, you should use q.This function checks the filename to determine if the content should be decompressed. If the file ends with ".gz", then we assume it is a gziped model.ryTag a chunk of input text with part-of-speech tags, using the sentence splitter, tokenizer, and tagger contained in the POSTager.GCombine the results of POS taggers, using the second param to fill in * entries, where possible.-Returns the first param, unless it is tagged *.. Throws an error if the text does not match.tTag the tokens in a string.gReturns a space-separated string of tokens, each token suffixed with the part of speech. For example:tag tagger "the dog jumped .""the/at dog/nn jumped/vbd ./."uText version of tagStrvLTrain a tagger on string input in the standard form for POS tagged corpora: .trainStr tagger "the/at dog/nn jumped/vbd ./."wThe  version of vxTrain a  on a corpus of sentences.This will recurse through the  stack, training all the backoff taggers as well. In order to do that, this function has to be generic to the kind of taggers used, so it is not possible to train up a new POSTagger from nothing: x& wouldn't know what tagger to create.8To get around that restriction, you can use the various mkTagger implementations, such as 9 or NLP.POS.AvgPerceptronTagger.mkTagger'. For example: import NLP.POS.AvgPerceptronTagger as APT let newTagger = APT.mkTagger APT.emptyPerceptron Nothing posTgr <- train newTagger trainingExamplesy Evaluate a POSTager.3Measures accuracy over all tags in the test corpus.Accuracy is calculated as: (|tokens tagged correctly| / |all tokens|lmnopqrstuvwxylmnopqrstuvwxyrtuxvwsypqmnollmnopqrstuvwxy  !"#$%%&'(())*+,-./0123456789:;<=>?@ABCDEFGHIFJKKLMNOPQQRSTUVWJXYZ[\]^_`abcdefghij k l F   m n J H I o p  q r  s H t u v w x J yz{|}~  chatter-0.2.0.0Data.DefaultMapNLP.Corpora.Email NLP.TypesNLP.POS.LiteralTaggerNLP.POS.UnambiguousTaggerNLP.POS.AvgPerceptronNLP.Similarity.VectorSimNLP.Extraction.ParsecNLP.Corpora.ParsingNLP.POS.AvgPerceptronTaggerNLP.POSData.Mapfoldl Paths_chatter serialize taggerTable readTagger Data.TextwordslinesmkTagger DefaultMapDefMap defDefaultdefMapemptylookupfromListkeys$fNFDataDefaultMap$fSerializeDefaultMap plugDataPathplugArchiveTextplugArchiveTokensfullPlugArchivereadFCorpus corpLengthcorpTermCountsTag POSTagger posTagger posTrainer posBackoff posTokenizer posSplitter posSerializeposID CaseSensitive Insensitive SensitiveTaggedSentenceSentence flattenTextcontains containsTag stripTagsfromTagparseTagtagUNK termCounts addDocumentmkCorpusaddTermsaddTerm$fSerializeCorpus$fNFDataCorpus$fSerializeText$fSerializeTagtaggerID protectTermstag tagSentencetrain Perceptronweightstotalststamps instancesWeightClassFeatureFeatemptyPerceptronpredictupdateaverageWeights TermVectormkVectorsim similaritytvSimtfidftf_idfcosVec magnitudedotProd ExtractorposTok posPrefixmatchestxtTokanyTokenoneOf followedByreadPOSsafeInittrainNew trainOnFilestrainInt defaultTagger saveTagger loadTagger deserialize tagTokenstagStrtagTexttrainStr trainTextevalcatchIOversionbindirlibdirdatadir libexecdir sysconfdir getBinDir getLibDir getDataDir getLibexecDir getSysconfDirgetDataFileNamebytestring-0.10.4.0Data.ByteString.Internal ByteString$fSerializeCaseSensitivecontainers-0.5.5.1 Data.Map.BaseMapupd_featroundToinfinityincrementInstances getTimestampgetTotalgetFeatureWeighttrainEx$fNFDataPerceptron$fSerializePerceptron$fSerializeClass$fSerializeFeature text-1.1.1.3tokenize-0.2.2NLP.Tokenize.Texttokenize startToksendToks trainSentence predictPos getFeatures itterations toClassLsttrainCls mkFeaturesuffixcombinepickTagData.Text.InternalTextcombineSentences