Sequor ====== Sequor is a sequence labeler based on Collins's (2002) perceptron. Sequor has a flexible feature template language and is meant mainly for NLP applications such as Named Entity labeling, Part of Speech tagging or syntactic chunking. It includes the SemiNER named entity recognizer, with pre-trained models for German and English (see `Named Entity Recognition (SemiNER)`_). Sequor is especially useful if your dataset has a large label set. In this case it is likely to run faster and allow you to use much less RAM than a sequence labeler based on Conditional Random Fields. Additionally sequor implements options which allow you to control the size of model and tradeoff speed against accuracy: - size of the beam - label dictionary - feature hashing See https://bitbucket.org/gchrupala/sequor/wiki/Options for details. Installation ------------ The easiest way to compile and install sequor is to 1. Install the `Haskell platform `_ 2. Run:: cabal update cabal install sequor --prefix=`pwd` Cabal should then download and install the necessary packages, and install the sequor binary in ./bin, and the data files in ./share Usage ----- With Sequor you can learn a model from sequences manually annotated with labels, and then apply this model to new data in order to add labels. Sequor is meant to be used mainly with linguistic data, for example to learn Part of Speech tagging, syntactic chunking or Named Entity labeling:: Usage: sequor command [OPTION...] [ARG...] train: train model train [OPTION...] TEMPLATE-FILE TRAIN-FILE MODEL-FILE --rate=NUM (0.01) learning rate --beam=INT (10) beam size --iter=INT (10) number of iterations --min-count=INT (100) minimum feature frequency for label dictionary --heldout=FILE path to heldout data --hash use hashing instead of feature dictionary --hash-sample=INT (1000) sample size to estimate number of features when hashing --hash-max-size=INT maximum size of parameter vector when hashing :: See https://bitbucket.org/gchrupala/sequor/wiki/Options for more details about the training options. predict: predict using model predict MODEL-FILE version: print version version help: print usage information help Data files should be in the UTF-8 encoding. As an example we can use data annotated with syntactic chunk labels in the data directory. For example:: ./bin/sequor train data/all.features data/train.conll model\ --rate 0.1 --beam 10 --iter 5 --hash\ --heldout data/devel.conll ./bin/sequor predict model < data/test.conll > data/test.labels Feature template syntax ----------------------- Sequor uses a mini language to specify which features to extract from data. For details see https://bitbucket.org/gchrupala/sequor/wiki/Templates Named Entity Recognition (SemiNER) ---------------------------------- Sequor includes the SemiNER named entity recognizer, with pre-trained models for German and English. The German model recognizer is trained on the CoNLL 2003 data and recognizes the following labels: - PER - people - ORG - organizations - LOC - locations such as cities and countries - MISC - miscellaneous entities such as nationalities The German model is described in [Chrupala_and_Klakow_2010]_. The English model is trained on the BBN Wall Street Journal data and recognizes the following labels: * CARDINAL - cardinal number * DATE - calendar date * GPE:CITY - city * GPE:COUNTRY - country * GPE:STATE_PROVINCE - state or province * MONEY - currency * NORP:NATIONALITY - nationality * NORP:OTHER - * NORP:POLITICAL - political affiliation * ORDINAL - ordinal number * ORGANIZATION - organization * PERCENT - percentage * PERSON - people * QUANTITY - numerical quantity See https://bitbucket.org/gchrupala/sequor/wiki/SemiNER for usage information. Sequence perceptron ------------------- Compared to the commonly used Conditional Random Field model, the Sequence Perceptron algorithm is simpler, more efficient and often has similar performance. The sequence perceptron was introduced in [Collins_2002]_. .. [Collins_2002] Collins, Michael. 2002. Discriminative training methods for Hidden Markov Models: Theory and experiments with perceptron algorithms. EMNLP 2002. http://ir.hit.edu.cn/~car/research/dishmm.pdf .. [Chrupala_and_Klakow_2010] Grzegorz ChrupaƂa and Dietrich Klakow. 2010. A Named Entity Labeler for German: exploiting Wikipedia and distributional clusters. LREC. http://grzegorz.chrupala.me/papers/lrec-2010.pdf