sequor: A sequence labeler based on Collins's sequence perceptron.

[ bsd3, library, natural-language-processing, program ] [ Propose Tags ]

A sequence labeler based on Collins's sequence perceptron.

[Skip to Readme]

Downloads

sequor-0.1.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

GrzegorzChrupala

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.1, 0.2, 0.2.2, 0.2.3, 0.3.0, 0.3.1, 0.3.2, 0.4.2, 0.7.0, 0.7.1, 0.7.2, 0.7.5
Dependencies	array (>=0.2), base (>=3 && <5), binary (>=0.5), bytestring (>=0.9), containers (>=0.2), mtl (>=1.1), utf8-string (>=0.3), vector (>=0.5) [details]
License	BSD-3-Clause
Author	Grzegorz Chrupała
Maintainer	gchrupala@lsv.uni-saarland.de
Category	Natural Language Processing
Home page	http://code.google.com/p/sequor/
Uploaded	by GrzegorzChrupala at 2010-09-24T10:49:38Z
Distributions
Reverse Dependencies	1 direct, 0 indirect [details]
Executables	sequor
Downloads	9316 total (30 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs not available [build log] All reported builds failed as of 2016-12-28 [all 7 reports]

Readme for sequor-0.1

[back to package description]

sequor 0.1

AUTHOR: Grzegorz Chrupała <pitekus@gmail.com>

Sequor is a sequence labeler based on Collins's sequence
perceptron. Sequor has a flexible feature template language and is
meant mainly for NLP applications such as Part of Speech tagging,
syntactic chunking or Named Entity labeling.
 


USAGE

With Sequor you can learn a model from sequences manually annotated
with labels, and then apply this model to new data in order to add
labels. Sequor is meant to be used mainly with linguistic data, for
example to learn Part of Speech tagging, syntactic chunking or Named
Entity labeling.
 
In order to learn a model from labeled data call sequor like this:

sequor train FEATURE-TEMPLATE LEARNING-RATE BEAM-SIZE MAX-ITERATIONS \
             MIN-DICT-COUNT TRAIN-FILE HELDOUT-FILE MODEL-FILE
FEATURE-TEMPLATE - specification of which features to use. For an 
		   example see data/AllFeatures.txt
LEARNING-RATE    - positive number (=<1) which controls how fast learning is
	           0.1 is a reasonable default
BEAM-SIZE        - positive integer controlling the size of the beam. 
ITERATIONS       - positive integer controlling for how many iterations to train
MIN-DICT-COUNT   - positive integer specifying how many times an indexed 
	           feature has to use the label dictionary for this feature. 
                   Using a large number will effectively disable use of label
     		   dictionary
TRAIN-FILE       - annotated data in CoNLL format. Sequences separated by 
		   blank lines, features separated by space
HELDOUT-FILE     - annotated heldout data. To disable use an empty file 
		   (/dev/null) 
MODEL-FILE       - name of the file where the learned model will be stored



In order to apply the learned model to new data, call:

sequor predict MODEL-FILE < NEW-DATA > NEW-LABELS

Data files should be in the UTF-8 encoding.

As an example we can use data annotated with syntactic chunk labels in
the data directory. For example:

./bin/sequor train data/all.features 0.1 10 5 50 \
    data/train.conll data/devel.conll model

./bin/sequor predict model < data/test.conll > data/test.labels

FEATURE TEMPLATE SYNTAX

Sequor uses a small language for specifying feature templates to use
when learing. This section gives an informal overview of this
language.  Sequor uses the simple CoNLL format for the input files. In
this format sentences are separated by blank lines. Each line
represents a single token (word). Each token should have the same
fixed number of space-separated fields, where the last field is the
label, e.g.

der d ART I-NC O
Europäischen europäisch ADJA I-NC ORG
Union Union NN I-NC ORG

The template language treats the input sentence as a matrix of
features (i.e. field values) and allows you to select and apply some
transformations to those features.

The language consists of a number predefined functions. By calling the
functions with certain argument you can specify the feature set to
use. As an example consider the following template: 
Cat [ Cell 0 0, Suffix 2 (Cell 0 0), Row -1, Row 1 ]. 
It specifies the following features: the first field in the current
token, the two-characted suffix of the first field of the current
token, all the fields of the previous token and all the fields of the
following token. 

Functions:

Cell r c		Selects field in row r and column c. 
Rect r c r' c		Selects all features in the rectangle
       	    		whose upper-left corner is in row r column c
			and lower-right corner is in row r' column c'.
Row r			Selects all features in row r.
MarkNull f              If feature does not exist, replace it with a NULL mark
	 		Typically used when absence of feature is significant, 
			e.g. to mark the beginning of the sentence.
Index f                 Marks the feature f to use in indexing for label 
      			dictionary.
Cat [f1,f2,...,fn]      Selects features in the list.
Cart f f'               Creates Cartesian product of feature sets f and f'. 
       			If f and f' are singletons, simply conjoins the 
			two features.
Lower f			Maps f to lower case characters.
Suffix i f		Takes suffix of i character length of feature f.
Prefix i f		Takes prefix of i character langth of feature f.
WordShape f		Creates a specification of which charater classes such 
	  		as lower case and upper case letters, digits or 
			punctuation occur in feature f.

Remarks: Rows are indexed relative to the current token (0).  Columns
are indexed starting with 0.  Functions which take features as
arguments can be passed either singleton features or sequences of
features. If passed a sequence they are applied to each of its
elements. For example (Suffix 3 (Row 0)) will return the sequence of
features formed by taking the suffix of length 3 of each field of row
0.

For more examples see files all.features and example.features in the
directory data.