The maxent-learner-hw package

[maintain]

Provides an implementation of Hayes and Wilson's machine learning algorithm for maxent phonotactic grammars, as both a command-line tool and a function library. The learner takes in a lexicon and produces a list of weighted constraints penalizing certain sound sequemces in an attempt to produce a probability distribution of words which maximizes the probability of the lexicon. Once such a set of constraints is generated, it can be tested by using it to generate random pronounceable text.

This package is an implementation of the algorithm described in Hayes and Wilson's paper A Maximum Entropy Model of Phonotactics and Phonotactic Learning (available at http://www.linguistics.ucla.edu/people/hayes/Phonotactics/Index.htm).


[Skip to ReadMe]

Properties

Versions0.1.0, 0.1.0, 0.1.1, 0.1.2
Dependenciesarray (>=0.3 && <0.6), base (>=4.7 && <5), containers (==0.5.*), csv (==0.1.*), deepseq (==1.4.*), file-embed, maxent-learner-hw, mtl (>=2.1 && <2.3), optparse-applicative, parallel (==3.2.*), random (==1.1), text (==1.2.*), vector (>=0.10) [details]
LicenseGPL
Copyright2016 George Steel and Peter Jurgec
AuthorGeorge Steel
Maintainergeorge.steel@gmail.com
CategoryLinguistics
Home pagehttps://github.com/george-steel/maxent-learner
Source repositoryhead: git clone https://github.com/githubuser/maxent-learner-hw
Executablesphono-learner-hw
UploadedThu Feb 16 03:54:02 UTC 2017 by gtsteel

Modules

Downloads

Maintainers' corner

For package maintainers and hackage trustees

Readme for maxent-learner-hw-0.1.0

Maxent Phonotactic Learner

A tool for automatically inferring phonotactic grammars from a lexicon and using those grammars to generate random text, based on Hayes and Wilson's A Maximum Entropy Model of Phonotactics and Phonotactic Learning. This package provides functionality both as a Haskell library and as a command line tool.

To compile this package, run stack build in the root of this repository. Run stack haddock to build the library documentation. The library may be useful if you wish to use a custom set of candidate constraints beyond the generators offered by the command line tool.

Command line usage

The command line tool (phono-learner-hw) has two commands: learn, which infers grammars, and gensalad, which generates random text using those grammars. The learn command takes the name of a lexicon file as an argument and outputs a grammar (note this is quite slow). By default the candidates consist of single classes and bigrams, and several; mote constraint types can be added with options. The gensalad takes a grammar generated by learn and uses it to generate random text. Both commands can also take global options to output their final results to a file, to use a custom-defined feature table for the generation of natural classes, and to control how text is divided into segments.

The command line works as follows:

phono-learner-hw COMMAND [-t|--featuretable CSVFILE] ([-c|--charsegs] | [-w|--wordsegs] | [--fierrosegs]) [-n|--samples ARG] [-o|--output OUTFILE]

| Option | Description | | --- | --- | | -t, --featuretable CSVFILE | Use the features and segment list from a feature table in CSV format (a table for IPA is used by default). | | -c, --charsegs | Use characters as segments (default). | | -w, --wordsegs | Separate segments by spaces. | | --fierosegs | Parse segments by repeatedly taking the longest possible match and use ' to break up unintended digraphs (used for Fiero orthography). | | -n, --samples N | Number of samples to use for salad generation. | | -o, --output OUTFILE | Record final output to OUTFILE as well as stdout. |

hw-learner learn LEXICON [--thresholds THRESHOLDS] [-f|--freqs] [-e|--edges] [-3|--trigrams COREFEATURES] [-l|--longdistance SKIPFEATURES] [GLOBALOPTIONS]

| Option | Description | | --- | --- | | --thresholds THRESHOLDS | thresholds to use for candidate selection (default is `[0.01, 0.1, 0.2, 0.3]``). | | -f,--freqs | Lexicon file contains word frequencies. | -e,--edges | Allow constraints involving word boundaries. | -3,--trigrams COREFEATURES | Allow trigram constraints where at least one class uses a single one of the following features (space separated in quotes). | | -l,--longdistance SKIPFEATURES |Allow constraints with two classes separated by a run of characters possibly restricted to all having one of the following features.

hw-learner gensalad GRAMMAR [GLOBALOPTIONS]

Example usage

The following two command calculates a grammar using Hayes and Wilson's Shona test data using their selection of trigram restrictions and then generate random text using it.

phono-learner-hw learn ShonaLearningData.txt -f -e -3 "syllabic consonantal sonorant" -t ShonaFeatures.csv -w -o shonalongdistance.txt
phono-learner-hw gensalad ShonaGrammar.txt -t ShonaFeatures.csv -w -o ShonaSalad.txt

Feature Table Format

To use a feature table other than the default IPA one, you may define it in CSV format (RFC 4180). The segment names are defined by the first row (they may be any strings as long as they are all distinct, i.e. no duplicate names) and the feature names are defined by the first column (they are not hard-coded). Data cells should contain +, -, or 0 for binary features and + or 0 for privative features (where we do not want a minus set that could form classes).

As a simple example, consider the following CSV file, defining three segments (a, n, and t), and two features (vowel and nasal).

     ,a,n,t
vowel,+,-,-
nasal,0,+,-

If a row contains a different number of cells (separated by commas) than the header line, is rejected as invalid and does not define a feature (and will not be dispayed in the formatted feature table). If the CSV which is entered has duplicate segment names, no segments, or no valid features, the entire table is rejected (indicated by a red border around the text area, green is normal) and the last valid table is used and displayed.


Copyright © 2016-2017 George Steel and Peter Jurgec.

This project is supported by the University of Toronto Advancing Teaching and Learning in Arts and Science (ATLAS) grant to Peter Jurgec.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.