Copyright  © 20162017 George Steel and Peter Jurgec 

License  GPL2+ 
Maintainer  george.steel@gmail.com 
Safe Haskell  None 
Language  Haskell2010 
Library for using DFAs to represent maxent grammars. A mexent grammar consists of a set of constraints, each of which is given a weight, which define a probability diatribution over the set of strings of each given length.
The relative probability (maxent score) of each string is equal to the negative exponential of the the total weight of the violated constraints. In this module, such a grammar is reperesented by a DFST
which can count violations and a Vec
of weights.
This module is mainly concerned with calculating probabilities of samples of text and finding the optimal weights to maximize that probability. There are also functions to randomly generate text using the distribution implied by a mexent grammar.
 type Length = Int
 data Lexicon sigma = Lex {
 totalWords :: Int
 lengthFreqs :: Array Length Int
 wordFreqs :: [([sigma], Int)]
 sortLexicon :: Ord sigma => [([sigma], Int)] > Lexicon sigma
 lengthCdf :: Lexicon sigma > Cdf Length
 lengthPdf :: Lexicon sigma > [(Length, Double)]
 maxentProb :: Vec > Multicount > Double
 lexLogProbTotalDeriv :: Ix sigma => MulticountDFST sigma > Array Length Int > Vec > Vec > (Double, Vec)
 lexLogProbPartialDeriv :: Ix sigma => MulticountDFST sigma > Array Length Int > Vec > Vec > Vec > Double
 llpOptimizeWeights :: Ix sigma => Array Length Int > PackedText sigma > MulticountDFST sigma > Vec > Vec
 sampleWord :: forall g sigma m. (RandomGen g, Ix sigma, MonadState g m) => DFST Int sigma Double > Length > Length > m [sigma]
 sampleWordSalad :: (RandomGen g, Ix sigma, MonadState g m) => DFST Int sigma Double > Cdf Length > Int > m [[sigma]]
Documentation
Returns the probability (as a logarithm) of a lexicon with aand associated length distribution.
Lex  

sortLexicon :: Ord sigma => [([sigma], Int)] > Lexicon sigma Source #
Convert jumbled list of words and frequencies to sorted lexicon.
lengthCdf :: Lexicon sigma > Cdf Length Source #
Retrieve length distribution as a Cdf
for sampling.
lengthPdf :: Lexicon sigma > [(Length, Double)] Source #
Retrieve length distribution as a normalized probability mass function. Probabilities add up to 1.
maxentProb :: Vec > Multicount > Double Source #
Apply weights to violation counts to get a relative probability.
:: Ix sigma  
=> MulticountDFST sigma  DFST counting constraint violations 
> Array Length Int  Length distribution of lexicon 
> Vec  Observed violations in lexicon 
> Vec  Weights to give constraints 
> (Double, Vec)  Probability and its derivative w.r.t. the weights 
For a given set of consteraints (reperesented by a DFST counting violations), lexicon (reprersented as length distribution and total violation count, which should be precomputed), and weight vector, returns the absolute probability (as a negative logarithm) and its derivative with respect to the weight vector.
Minimize this to find the optimal weights. To prevent overfitting, this function includes an exponential (L₁) prior equivalent to each constraint being violated once for existing. This intentionally differs from Hayes and Wilson since their gaussian (L₂²) prior had a strong preference for as many simillar constraints as possible as opposed to a single constraint. The exponential prior was chosen since it is independent of splitting constraints into duplicates with the weight distributed between them.
lexLogProbPartialDeriv :: Ix sigma => MulticountDFST sigma > Array Length Int > Vec > Vec > Vec > Double Source #
Compute partial derivative of lexicon probability. Much faster equivalent of
lexLogProbPartialDeriv ctr lengths oviols weights dir = dir `innerProd` snd (lexLogProbTotalDeriv ctr lengths oviols weights)
llpOptimizeWeights :: Ix sigma => Array Length Int > PackedText sigma > MulticountDFST sigma > Vec > Vec Source #
Calculate weights to maximize probability of lexicon.
Takes starting position of search which MUST have the correct dimensionality (do not use zero
)
:: (RandomGen g, Ix sigma, MonadState g m)  
=> DFST Int sigma Double  Probability DFST 
> Length  Maximum length to greate generator fot 
> Length > m [sigma]  Random generator taking length and returning action. 
Returns a monadic action to sample random words from a probability transducer,
which may be generated from a violation counter with (
).
For efficiency, evaluate this once then sequence the action repeatedly as intermediate values will be memoized.fmap
(maxentProb
weights) ctr)