ngram: Ngram models for compressing and classifying text.

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain]

Please see the README on Github at https://github.com/TomLippincott/ngram#README.md


[Skip to ReadMe]

Properties

Versions0.1.0.0, 0.1.0.0, 0.1.0.1
Change logNone available
Dependenciesbase (>=4.7 && <5), bytestring (>=0.10.8.1), cereal (>=0.5.4.0), cereal-text (>=0.1.0.2), containers (>=0.5.10.2), ngram, optparse-generic (>=1.2.2), text (>=1.2.2), zlib (>=0.6.1) [details]
LicenseBSD-3-Clause
Copyright2018 Tom Lippincott
AuthorTom Lippincott
Maintainertom@cs.jhu.edu
Categorynatural-language-processing, machine-learning
Home pagehttps://github.com/TomLippincott/ngram#readme
Bug trackerhttps://github.com/TomLippincott/ngram/issues
Source repositoryhead: git clone https://github.com/TomLippincott/ngram
Executablesppm
UploadedMon Aug 27 12:21:41 UTC 2018 by TomLippincott

Modules

[Index]

Downloads

Maintainers' corner

For package maintainers and hackage trustees


Readme for ngram-0.1.0.0

[back to package description]

NGrams

This is a code base for experimenting with various approaches to n-gram-based text modeling. To get started, run:

stack build
stack install

This will build and install the library and binary commands. Generally, the commands expect data to be text files where each line has the format:

${id}<TAB>${label}<TAB>${text}

When a model is applied to data, the output will generally have a header with the format:

ID<TAB>GOLD<TAB>${label_1_name}<TAB>${label_2_name}<TAB>...

and lines with the corresponding format:

${doc_id}<TAB>${gold_label_name}<TAB>${label_1_prob}<TAB>${label_2_prob}<TAB>...

where probabilities are represented as natural logarithms.

The remainder of this document describes the implemented models, most of which have a corresponding command that stack will have installed. The library aims to be parametric over the sequence types, and most commands allow users to specify whether to consider bytes, unicode characters, or whitespace-delimited tokens.

Prediction by Partial Matching

PPM is essentially an n-gram model with a particular backoff logic that can't quite be reduced to more widespread approaches to smoothing, but empirically tends to outperform them on short documents. To create a PPM model, run:

sh> ppm train --train train.txt --dev dev.txt --n 4 --modelFile model.gz
Dev accuracy: 0.8566666666666667

The model can then be applied to new data:

sh> ppm apply --test test.txt --modelFile model.gz --n 4 --scoresFile scores.txt

The value of --n can also be less than the model size, which will run a bit faster, and (perhaps) less tuned to the original training data.