Sequor
======

Sequor is a sequence labeler based on Collins's (2002)
perceptron. Sequor has a flexible feature template language and is
meant mainly for NLP applications such as Named Entity labeling, Part
of Speech tagging or syntactic chunking. It includes the SemiNER named
entity recognizer, with pre-trained models for German and English (see
`Named Entity Recognition (SemiNER)`_).

Sequor is especially useful if your dataset has a large label set. In
this case it is likely to run faster and allow you to use much less
RAM than a sequence labeler based on Conditional Random
Fields. Additionally sequor implements options which allow you to
control the size of model and tradeoff speed against accuracy:

- size of the beam
- label dictionary
- feature hashing 

See https://bitbucket.org/gchrupala/sequor/wiki/Options for details.

Installation
------------

The easiest way to compile and install sequor is to 

1. Install the `Haskell platform <http://www.haskell.org/platform/>`_
2. Run::

    cabal update
    cabal install sequor --prefix=`pwd`

Cabal should then download and install the necessary packages, and
install the sequor binary in ./bin, and the data files in ./share


Usage
-----
With Sequor you can learn a model from sequences manually annotated
with labels, and then apply this model to new data in order to add
labels. Sequor is meant to be used mainly with linguistic data, for
example to learn Part of Speech tagging, syntactic chunking or Named
Entity labeling::
 
    Usage: sequor command [OPTION...] [ARG...]
    train:    train model
    train [OPTION...] TEMPLATE-FILE TRAIN-FILE MODEL-FILE 
      --rate=NUM (0.01)         learning rate
      --beam=INT (10)           beam size
      --iter=INT (10)           number of iterations
      --min-count=INT (100)     minimum feature frequency for label dictionary
      --heldout=FILE            path to heldout data
      --hash                    use hashing instead of feature dictionary
      --hash-sample=INT (1000)  sample size to estimate number of features when hashing
      --hash-max-size=INT       maximum size of parameter vector when hashing

::

See https://bitbucket.org/gchrupala/sequor/wiki/Options for more
details about the training options.

   predict:  predict using model
   predict  MODEL-FILE 

   version:  print version
   version  

   help:     print usage information
   help  

Data files should be in the UTF-8 encoding.

As an example we can use data annotated with syntactic chunk labels in
the data directory. For example::

  ./bin/sequor train data/all.features data/train.conll  model\
	     --rate 0.1 --beam 10 --iter 5 --hash\
             --heldout data/devel.conll

  ./bin/sequor predict model < data/test.conll > data/test.labels

Feature template syntax
-----------------------

Sequor uses a mini language to specify which features to extract from
data. For details see https://bitbucket.org/gchrupala/sequor/wiki/Templates


Named Entity Recognition (SemiNER)
----------------------------------

Sequor includes the SemiNER named entity recognizer, with pre-trained
models for German and English.

The German model recognizer is trained on the CoNLL 2003 data and
recognizes the following labels:

- PER - people
- ORG - organizations
- LOC - locations such as cities and countries
- MISC - miscellaneous entities such as nationalities   

The German model is described in [Chrupala_and_Klakow_2010]_.

The English model is trained on the BBN Wall Street Journal data and
recognizes the following labels:

* CARDINAL - cardinal number
* DATE - calendar date
* GPE:CITY - city
* GPE:COUNTRY - country
* GPE:STATE_PROVINCE - state or province
* MONEY - currency
* NORP:NATIONALITY - nationality 
* NORP:OTHER - 
* NORP:POLITICAL - political affiliation 
* ORDINAL - ordinal number
* ORGANIZATION - organization
* PERCENT - percentage 
* PERSON - people
* QUANTITY - numerical quantity

See https://bitbucket.org/gchrupala/sequor/wiki/SemiNER for usage
information.


Sequence perceptron
-------------------

Compared to the commonly used Conditional Random Field model, the
Sequence Perceptron algorithm is simpler, more efficient and often has
similar performance.

The sequence perceptron was introduced in [Collins_2002]_.

.. [Collins_2002] Collins, Michael. 2002. Discriminative training
   methods for Hidden Markov Models: Theory and experiments with
   perceptron
   algorithms. EMNLP 2002. http://ir.hit.edu.cn/~car/research/dishmm.pdf


.. [Chrupala_and_Klakow_2010] Grzegorz Chrupała and Dietrich
   Klakow. 2010. A Named Entity Labeler for German: exploiting
   Wikipedia and distributional
   clusters. LREC. http://grzegorz.chrupala.me/papers/lrec-2010.pdf