dr      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q None@Defaulting Map; a Map that returns a default value when queried  for a key that does not exist. Create an empty  BQuery the map for a value. Returns the default if the key is not  found.  Create a " from a default value and a list. Access the keys as a list. !Fold over the values in the map. Note that this *does* not fold B over the default value -- this fold behaves in the same way as a  standard    rSerialize instance r  r Safe-Inferred 5The EitherList is a newtype-wrapped list of Eithers. IA Tokenizer is function which takes a list and returns a list of Eithers I (wrapped in a newtype). Right Strings will be passed on for processing  to tokenizers down L the pipeline. Left Strings will be passed through the pipeline unchanged. K Use a Left String in a tokenizer to protect certain tokens from further  processing (e.g. see the  tokenizer). >Split string into words using the default tokenizer pipeline Run a tokenizer #Detect common uris and freeze them (Split off initial and final punctuation !Split off word-final punctuation #Split off word-initial punctuation Split words ending in n't, and freeze n't /Split common contractions off and freeze them.  | Currently deals with: 'm, 's, 'd, 've, 'll GSplit string on whitespace. This is just a wrapper for Data.List.words  stuv    stuvNone8Path to the directory containing all the PLUG archives.  Safe-Inferred wxyz{|}~x~ wxyz{|}~NoneDocument corpus. DThis is a simple hashed corpus, the document content is not stored. 'The number of documents in the corpus. :A count of the number of documents each term occurred in. "-Part of Speech tagger, with back-off tagger. <A sequence of pos taggers can be assembled by using backoff D taggers. When tagging text, the first tagger is run on the input, * possibly tagging some tokens as unknown ('Tag Unk'). The first C backoff tagger is then recursively invoked on the text to fill in D the unknown tags, but that may still leave some tokens marked with  'Tag Unk'9. This process repeats until no more taggers are found. ; (The current implementation is not very efficient in this  respect.). @Back off taggers are particularly useful when there is a set of ? domain specific vernacular that a general purpose statistical B tagger does not know of. A LitteralTagger can be created to map D terms to fixed POS tags, and then delegate the bulk of the text to @ a statistical back off tagger, such as an AvgPerceptronTagger. "4 values can be serialized and deserialized by using    and NLP.POS.deserialize`. This is a bit tricky D because the POSTagger abstracts away the implementation details of E the particular tagging algorithm, and the model for that tagger (if D any). To support serialization, each POSTagger value must provide 2 a serialize value that can be used to generate a  = representation of the model, as well as a unique id (also a  ,). Furthermore, that ID must be added to a `Map < ByteString (ByteString -> Maybe POSTagger -> Either String  POSTagger)` that is provided to  deserialize. The function in the  map takes the output of ), and possibly a backoff = tagger, and reconstitutes the POSTagger that was serialized A (assigning the proper functions, setting up closures as needed,  etc.) Look at the source for   and   for examples. $#The initial part-of-speech tagger. %5Training function to train the immediate POS tagger. &&A tagger to invoke on unknown tokens. 'A tokenizer; ( will work.) (4A sentence splitter. If your input is formatted as ! one sentence per line, then use ,  otherwise try Erik Kow's fullstop library. )Store this POS tagger to a  bytestring. This does not  serialize the backoff taggers. *#A unique id that will identify the + algorithm used for this POS Tagger. This  is used in deserialization -'Remove the tags from a tagged sentence 0Constant tag for unknown 15Get the number of documents that a term occurred in. 2Add a document to the corpus. 9This can be dangerous if the documents are pre-processed < differently. All corpus-related functions assume that the E documents have all been tokenized and the tokens normalized, in the  same way. 39Create a corpus from a list of documents, represented by  normalized tokens.  !"#$%&'()*+,-./012345 !"#$%&'()*+,-./012345,+"#$%&'()*- !./012345 !"#$%&'()*+,-./012345None7ACreate a Literal Tagger using the specified back-off tagger as a ! fall-back, if one is specified. 'This uses a tokenizer adapted from the   package for a  tokenizer, and Erik Kow',s fullstop sentence segmenter as a sentence  splitter. 6789:6789:8976:6789:None=1Create an unambiguous tagger, using the supplied  as a  source of tags. >(Trainer method for unambiguous taggers. ;<=>;<=>;<=>;<=>None ?The perceptron model. A9Each feature gets its own weight vector, so weights is a  dict-of-dicts B9The accumulated values, for the averaging. These will be  keyed by feature/ clas tuples C?The last time the feature was changed, for the averaging. Also  keyed by feature/ clas tuples # (tstamps is short for timestamps) DNumber of instances seen EATypedef for doubles to make the code easier to read, and to make % this simple to change if necessary. F>The classes that the perceptron assigns are represnted with a  newtype-wrapped String. CEventually, I think this should become a typeclass, so the classes C can be defined by the users of the Perceptron (such as custom POS % tag ADTs, or more complex classes). J-An empty perceptron, used to start training. K(Predict a class given a feature vector. Ported from python:  def predict(self, features): S '''Dot-product the features and current weights and return the best label.''' ! scores = defaultdict(float) * for feat, value in features.items(): 4 if feat not in self.weights or value == 0:  continue & weights = self.weights[feat] / for label, weight in weights.items(): - scores[label] += value * weight 5 # Do a secondary alphabetic sort, for stability H return max(self.classes, key=lambda label: (scores[label], label)) L*Update the perceptron with a new example. & update(self, truth, guess, features)  ...  self.i += 1  if truth == guess:  return None  for f in features: m weights = self.weights.setdefault(f, {}) -- setdefault is Map.findWithDefault, and destructive. > upd_feat(truth, f, weights.get(truth, 0.0), 1.0) ? upd_feat(guess, f, weights.get(guess, 0.0), -1.0)  return None ported from python: + def update(self, truth, guess, features): * '''Update the feature weights.''' " def upd_feat(c, f, w, v):  param = (f, c) F self._totals[param] += (self.i - self._tstamps[param]) * w ) self._tstamps[param] = self.i & self.weights[f][c] = w + v MAverage the weights Ported from Python:  def average_weights(self): 0 for feat, weights in self.weights.items():  new_feat_weights = {} . for clas, weight in weights.items(): " param = (feat, clas) ) total = self._totals[param] ? total += (self.i - self._tstamps[param]) * weight 8 averaged = round(total / float(self.i), 3)  if averaged: 3 new_feat_weights[clas] = averaged / self.weights[feat] = new_feat_weights  return None 8round a fractional number to a specified decimal place. roundTo 2 3.14593.15?@ABCDEFGHIJKLMN?@ABCDEFGHIJKLMN?@ABCDFGEHIJKNLM?@ABCDEFGHIJKLMNNone O7An efficient (ish) representation for documents in the  bag of words sense. P Generate a O from a tokenized document. Q*Invokes similarity on full strings, using  for  tokenization, and no stemming. 5There *must* be at least one document in the corpus. R)Determine how similar two documents are. DThis function assumes that each document has been tokenized and (if  desired) stemmed/case-normalized. This is a wrapper around S#, which is a *much* more efficient C implementation. If you need to run similarity against any single 1 document more than once, then you should create Os for  each of your documents and use S instead of R. 5There *must* be at least one document in the corpus. S)Determine how similar two documents are. @Calculates the similarity between two documents, represented as   TermVectors T6Return the raw frequency of a term in a body of text. AThe firt argument is the term to find, the second is a tokenized E document. This function does not do any stemming or additional text  modification. U*Calculate the inverse document frequency. BThe IDF is, roughly speaking, a measure of how popular a term is. V?Calculate the tf*idf measure for a term given a document and a  corpus. X%Calculate the magnitude of a vector. Y%find the dot product of two vectors. OPQRSTUVWXY OPQRSTUVWXY OPQRSTUVWXY OPQRSTUVWXY NoneZ;Read a POS-tagged corpus out of a Text string of the form:  token/tag token/tag... %readPOS "Dear/jj Sirs/nns :/: Let/vb"5[("Dear",JJ),("Sirs",NNS),(":",Other ":"),("Let",VB)][@Returns all but the last element of a string, unless the string 1 is empty, in which case it returns that string. Z[Z[Z[Z[ None ^BCreate an Averaged Perceptron Tagger using the specified back-off - tagger as a fall-back, if one is specified. 'This uses a tokenizer adapted from the   package for a  tokenizer, and Erik Kow's fullstop sentence segmenter  ( +http://hackage.haskell.org/package/fullstop) as a sentence  splitter. _ Train a new ?. =The training corpus should be a collection of sentences, one B sentence on each line, and with each token tagged with a part of  speech. For example, the input:  ; "The/DT dog/NN jumped/VB ./.\nThe/DT cat/NN slept/VB ./."  defines two training sentences. Btagger <- trainNew "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n"-tag tagger $ map T.words $ T.lines "Dear sir""Dear/jj Sirs/nns :/: Let/vb"` Train a new ? on a corpus of files. a'Add training examples to a perceptron. Otagger <- train emptyPerceptron "Dear/jj Sirs/nns :/: Let/vb\nUs/nn begin/vb\n"-tag tagger $ map T.words $ T.lines "Dear sir""Dear/jj Sirs/nns :/: Let/vb"If you'=re using multiple input files, this can be useful to improve < performance (by folding over the files). For example, see ` ;start markers to ensure all features in context are valid,  even for the first real tokens. 7end markers to ensure all features are valid, even for  the last real tokens. b)Tag a document (represented as a list of , s) with a  trained ? Ported from Python: ' def tag(self, corpus, tokenize=True): # '''Tags a string `corpus`.''' P # Assume untokenized corpus has \n between sentences and ' ' between words K s_split = nltk.sent_tokenize if tokenize else lambda t: t.split('\n') G w_split = nltk.word_tokenize if tokenize else lambda s: s.split()  def split_sents(corpus): # for s in s_split(corpus):  yield w_split(s)  prev, prev2 = self.START  tokens = [] ' for words in split_sents(corpus): O context = self.START + [self._normalize(w) for w in words] + self.END * for i, word in enumerate(words): * tag = self.tagdict.get(word)  if not tag: N features = self._get_features(i, word, context, prev, prev2) 4 tag = self.model.predict(features) ( tokens.append((word, tag))  prev2 = prev  prev = tag  return tokens cTag a single sentence. dTrain a model from sentences. Ported from Python: 7 def train(self, sentences, save_loc=None, nr_iter=5): # self._make_tagdict(sentences) ' self.model.classes = self.classes  prev, prev2 = START " for iter_ in range(nr_iter):  c = 0  n = 0 ' for words, tags in sentences: I context = START + [self._normalize(w) for w in words] + END . for i, word in enumerate(words): 0 guess = self.tagdict.get(word)  if not guess: O feats = self._get_features(i, word, context, prev, prev2) 7 guess = self.model.predict(feats) > self.model.update(tags[i], guess, feats) , prev2 = prev; prev = guess ' c += guess == tags[i]  n += 1 # random.shuffle(sentences) N logging.info("Iter {0}: {1}/{2}={3}".format(iter_, c, n, _pc(c, n))) " self.model.average_weights()  # Pickle as a binary file  if save_loc is not None: G pickle.dump((self.model.weights, self.tagdict, self.classes), 0 open(save_loc, 'wb'), -1)  return None Train on one sentence. 6Adapted from this portion of the Python train method: I context = START + [self._normalize(w) for w in words] + END . for i, word in enumerate(words): 0 guess = self.tagdict.get(word)  if not guess: O feats = self._get_features(i, word, context, prev, prev2) 7 guess = self.model.predict(feats) > self.model.update(tags[i], guess, feats) , prev2 = prev; prev = guess ' c += guess == tags[i]  n += 1 ,Predict a Part of Speech, defaulting to the Unk tag, if no  classification is found. Default feature set. 9 def _get_features(self, i, word, context, prev, prev2): C '''Map tokens into a feature representation, implemented as a I {hashable: float} dict. If the features change, a new model must be  trained.  '''  def add(name, *args): 8 features[' '.join((name,) + tuple(args))] += 1  i += len(self.START) ! features = defaultdict(int) O # It's useful to have a constant feature, which acts sort of like a prior  add('bias')  add('i suffix', word[-3:])  add('i pref1', word[0])  add('i-1 tag', prev)  add('i-2 tag', prev2) ' add('i tag+i-2 tag', prev, prev2)  add('i word', context[i]) - add('i-1 tag+i word', prev, context[i]) # add('i-1 word', context[i-1]) * add('i-1 suffix', context[i-1][-3:]) # add('i-2 word', context[i-2]) # add('i+1 word', context[i+1]) * add('i+1 suffix', context[i+1][-3:]) # add('i+2 word', context[i+2])  return features \]^_`aThe inital model. +Training data; formatted with one sentence , per line, and standard POS tags after each  space-delimeted token. bcd1The number of times to iterate over the training 1 data, randomly shuffling after each iteration. (5  is a reasonable choice.) The ? to train. The training data. (A list of  [(Text, Tag)]'s) $A trained perceptron. IO is needed  for randomization. J\]^_`abcd ^_`adbcJ\]\]^_`abcd None f?The default table of tagger IDs to readTagger functions. Each = tagger packaged with Chatter should have an entry here. By D convention, the IDs use are the fully qualified module name of the  tagger package. gStore a POSTager to a file. h!Load a tagger, using the interal f. If you need to E specify your own mappings for new composite taggers, you should use  j. >This function checks the filename to determine if the content 0 should be decompressed. If the file ends with .gz , then we  assume it is a gziped model. k>Tag a chunk of input text with part-of-speech tags, using the ; sentence splitter, tokenizer, and tagger contained in the POSTager. >Combine the results of POS taggers, using the second param to  fill in 0 entries, where possible. -Returns the first param, unless it is tagged 0. - Throws an error if the text does not match. lTag the tokens in a string. @Returns a space-separated string of tokens, each token suffixed ( with the part of speech. For example: tag tagger "the dog jumped .""the/at dog/nn jumped/vbd ./."mText version of tagStr n<Train a tagger on string input in the standard form for POS  tagged corpora: 0 trainStr tagger "the/at dog/nn jumped/vbd ./." oThe  version of n pTrain a " on a corpus of sentences. This will recurse through the " stack, training all the E backoff taggers as well. In order to do that, this function has to B be generic to the kind of taggers used, so it is not possible to ( train up a new POSTagger from nothing: p wouldn' t know what  tagger to create. 8To get around that restriction, you can use the various mkTagger  implementations, such as  or % NLP.POS.AvgPerceptronTagger.mkTagger'. For example: + import NLP.POS.AvgPerceptronTagger as APT  : let newTagger = APT.mkTagger APT.emptyPerceptron Nothing , posTgr <- train newTagger trainingExamples q Evaluate a POSTager. 4Measures accuracy over all tags in the test corpus. Accuracy is calculated as: * |tokens tagged correctly| / |all tokens| efghijklmnopq efghijklmnopq klmpnoqijfgheefghijklmnopq  !"#$%&'()*+,-./0112344556789:;<=>?@ABCDEFGHIJHKLLMNOPQRRSTUVWXKYZ[\]^_`abc d e H   f g K I J h i  j k  l I m n o p K qrstuvwxyz{|}~  chatter-0.0.0.4Data.DefaultMap NLP.TokenizeNLP.Corpora.Email NLP.TypesNLP.POS.LiteralTaggerNLP.POS.UnambiguousTaggerNLP.POS.AvgPerceptronNLP.Similarity.VectorSimNLP.Corpora.ParsingNLP.POS.AvgPerceptronTaggerNLP.POSData.Mapfoldl Paths_chatter serialize taggerTable readTagger Data.TextwordslinesmkTagger DefaultMapDefMap defDefaultdefMapemptylookupfromListkeys EitherListEunE TokenizertokenizerundefaultTokenizeruris punctuationfinalPunctuationinitialPunctuation negatives contractions whitespace plugDataPathplugArchiveTextplugArchiveTokensfullPlugArchivereadFCorpus corpLengthcorpTermCountsTag POSTagger posTagger posTrainer posBackoff posTokenizer posSplitter posSerializeposIDTaggedSentenceSentence stripTagsfromTagparseTagtagUNK termCounts addDocumentmkCorpusaddTermsaddTermtaggerIDtag tagSentencetrain Perceptronweightstotalststamps instancesWeightClassFeatureFeatemptyPerceptronpredictupdateaverageWeights TermVectormkVectorsim similaritytvSimtfidftf_idfcosVec magnitudedotProdreadPOSsafeInittrainNew trainOnFilestrainInt defaultTagger saveTagger loadTagger deserializetagStrtagTexttrainStr trainTexteval$fSerializeDefaultMaphyphensunwrapexamples$fMonadEitherListcatchIOversionbindirlibdirdatadir libexecdir sysconfdir getBinDir getLibDir getDataDir getLibexecDir getSysconfDirgetDataFileNamebytestring-0.10.0.2Data.ByteString.Internal ByteString$fSerializeCorpus$fSerializeText$fSerializeTagcontainers-0.5.0.0 Data.Map.BaseMapupd_featroundToinfinityincrementInstances getTimestampgetTotalgetFeatureWeighttrainEx$fSerializePerceptron$fSerializeClass$fSerializeFeature text-1.0.0.1 startToksendToks trainSentence predictPos getFeatures itterations toClassLsttrainCls mkFeaturesuffixcombinepickTagData.Text.InternalText tagTokenscombineSentences