# Maxent Phonotactic Learner

A tool for automatically inferring phonotactic grammars from a lexicon and using those grammars to generate random text, based on Hayes and Wilson's [A Maximum Entropy Model of Phonotactics and Phonotactic Learning](http://www.linguistics.ucla.edu/people/hayes/Phonotactics/Index.htm).  This package provides functionality both as a Haskell library and as a command line tool.

To compile this package, run `stack build` in the root of this repository. Run `stack haddock` to build the library documentation. The library may be useful if you wish to use a custom set of candidate constraints beyond the generators offered by the command line tool.

## Command line usage

The command line tool (`phono-learner-hw`) has two commands: `learn`, which infers grammars, and `gensalad`, which generates random text using those grammars. The learn command takes the name of a lexicon file as an argument and outputs a grammar (note this is quite slow). By default the candidates consist of single classes and bigrams, and several; mote constraint types can be added with options. The `gensalad` takes a grammar generated by `learn` and uses it to generate random text. Both commands can also take global options to output their final results to a file, to use a custom-defined feature table for the generation of natural classes, and to control how text is divided into segments.

The command line works as follows:

    phono-learner-hw COMMAND [-t|--featuretable CSVFILE] ([-c|--charsegs] | [-w|--wordsegs] | [--fierrosegs]) [-n|--samples ARG] [-o|--output OUTFILE]


| Option | Description |
| --- | --- |
| -t, --featuretable *CSVFILE* | Use the features and segment list from a feature table in CSV format (a table for IPA is used by default). |
| -c, --charsegs             | Use characters as segments (default). |
| -w, --wordsegs             | Separate segments by spaces. |
| --fierosegs              | Parse segments by repeatedly taking the longest possible match and use ' to break up unintended digraphs (used for Fiero orthography). |
| -n, --samples *N*          | Number of samples to use for salad generation. |
| -o, --output *OUTFILE*       | Record final output to OUTFILE as well as stdout. |

    hw-learner learn LEXICON [--thresholds THRESHOLDS] [-f|--freqs] [-e|--edges] [-3|--trigrams COREFEATURES] [-l|--longdistance SKIPFEATURES] [GLOBALOPTIONS]

| Option | Description |
| --- | --- |
| --thresholds *THRESHOLDS* | thresholds to use for candidate selection (default is `[0.01, 0.1, 0.2, 0.3]``). |
| -f,--freqs              | Lexicon file contains word frequencies.
| -e,--edges              | Allow constraints involving word boundaries.
| -3,--trigrams *COREFEATURES* | Allow trigram constraints where at least one class uses a single one of the following features (space separated in quotes). |
| -l,--longdistance SKIPFEATURES  |Allow constraints with two classes separated by a run of characters possibly restricted to all having one of the following features.

    hw-learner gensalad GRAMMAR [GLOBALOPTIONS]

### Example usage

The following two command calculates a grammar using Hayes and Wilson's Shona test data using their selection of trigram restrictions and then generate random text using it.


    phono-learner-hw learn ShonaLearningData.txt -f -e -3 "syllabic consonantal sonorant" -t ShonaFeatures.csv -w -o shonalongdistance.txt
    phono-learner-hw gensalad ShonaGrammar.txt -t ShonaFeatures.csv -w -o ShonaSalad.txt


## Feature Table Format

To use a feature table other than the default IPA one, you may define it in CSV format (RFC 4180). The segment names are defined by the first row (they may be any strings as long as they are all distinct, i.e. no duplicate names) and the feature names are defined by the first column (they are not hard-coded). Data cells should contain `+`, `-`, or `0` for binary features and `+` or `0` for privative features (where we do not want a minus set that could form classes).

As a simple example, consider the following CSV file, defining three segments (a, n, and t), and two features (vowel and nasal).

         ,a,n,t
    vowel,+,-,-
    nasal,0,+,-

If a row contains a different number of cells (separated by commas) than the header line, is rejected as invalid and does not define a feature (and will not be dispayed in the formatted feature table). If the CSV which is entered has duplicate segment names, no segments, or no valid features, the entire table is rejected (indicated by a red border around the text area, green is normal) and the last valid table is used and displayed.

---

Copyright © 2016-2017 George Steel and Peter Jurgec.

This project is supported by the University of Toronto Advancing Teaching and Learning in Arts and Science (ATLAS) grant to Peter Jurgec.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.