[![Build Status](https://travis-ci.org/choener/WordAlignment.svg?branch=master)](https://travis-ci.org/choener/WordAlignment)

# Word alignments in natural languages

This library and program are designed for the alignment of *words* in human
languages. Each word is encoded as a list of *characters*. A single
*character*, however, is encoded as a *list of unicode symbols*, not just a
single symbol.

This particular encoding is necessary because there is no general one-to-one
mapping of atomic alphabet symbols to unicode characters. Just consider
character decorations, such as *Umlaute* (even though Umlaute are available in
unicode).

As user, you send a list to stdin that is formatted as follows (with all fields
mandatory; the tab symbol being the separator between each column):

* word id
* language name
* meaning identifier (this is a string, not a number)
* length of the word
* space-separated characters (each character is actually a string)

(In case you read the plain-text version of this document, the 4 whitespace
characters should be left out.)

    0	Albanian_Tosk	1.100	5	\' b o t ə
    2	Albanian_Tosk	1.100	10	r̃ o k u lʸ i a\' lʸ e m


The WordAlignment program comes with two modes, *twowaysimple* and *twoway*.

## TwoWay - Simple

```WordAlign twowaysimple``` aligns words based on a simple scoring model. Gaps
are scored with linear costs. Matches are scored based on a simple scoring file
handling unigrams of characters. Even the simple model allows for the concept
of equivalence classes. This makes it possible to not only score exact matches,
but to give somewhat high scores to, say, two vowels that should be matched.

The following command-line options are provided:

    --scorefile=ITEM
    --lpblock=ITEM,ITEM
    --showmanual
    --prettystupid
    --outfile=ITEM

```--scorefile``` is the simple score file to be used by the mode. the
*defaultSimpleScoring* file can be used as a template.

```--lpblock``` expects a pair of language names (Breton,Breton) or a pair of
integers (3,3 or 4,6) and will then align only the given language pairs with
each other. This option should be very helpful in case you want to parallelize
the program.

```--showmanual``` will show this manual in plain text.

```--prettystupid``` will show a progress bar of the current language pair.
It's pretty and helpful with smaller tasks but should not be used when you
parallelize on a grid engine.

```--outfile``` writes to the given output file, not stdout. Actually required
when ```--prettystupid``` is active.

```--nobacktrack``` disables backtracking. Sometimes just the alignment score
is enough.


## TwoWay (Complex Scoring)

The complex scoring model uses linear gap costs. In contrast to the simple
model above, however, character matching is now performed in a bigram context.

The required score file is currently using an in-house format with the
following columns all required (with whitespace, not tab between the entries):

* language name
* character (bigram 1.1)
* character (bigram 1.2)
* language name
* character (bigram 2.1)
* character (bigram 2.2)
* score

Three example lines

    Albanian_Tosk \' a Albanian_Tosk \' a 4.25238
    Albanian_Tosk \' a Albanian_Tosk \' b 0.402228
    Albanian_Tosk \' a Albanian_Tosk \' g 1.07432

In addition to the scoring file, set via ```--scorefile```, the default score
constant and the gap cost constant need to be set.

```--bigramdef=NUM``` is the score given to unknown matches.

```--gapopen=NUM``` is the score given to each gap character (and not just the
opening score)

```--lpblock``` as above

```--showmanual``` as above

```--prettystupid``` as above

```--outfile``` as above

```--nobacktrack``` as above

An example output is given below.

The output are four lines for each alignment. An info line with the word ids
(IDS), the alignment score (SCORE), the normalized scores (NSCORE) and the
actual words, started by (WORDS) and interleaved by (WORD). The next two lines
are the alignment, with deletions showning up as minus symbols (---) in the
deletion field. Note that a deletion does not delete a character from the
input, it merely aligns an existing character in one alignment with the symbol
for deletion (--) in the other. The final line provides the per-column score
for the alignment. After the alignment follows one empty line.

The normalized score is defined as SCORE / maximum of input word lengths

```cat albanian.input | head -n 3 | ./dist/build/WordAlign/WordAlign twoway -s albanian.scores```

yields:

    IDS: 1 2 SCORE: 32.85 NSCORE: 5.48    WORDS: ^ d o u a r $   WORD   ^ d o u a r $
       ^   d   o   u   a   r   $
       ^   d   o   u   a   r   $
     0.0 4.8 5.9 4.1 6.6 4.6 6.8
    
    IDS: 1 3 SCORE: -12.36 NSCORE: -1.77    WORDS: ^ d o u a r $   WORD   ^ p o u l t r $
         ^     d     -     o     u     a     r     -     -     $
         ^     p     o     -     u     -     l     t     r     $
       0.0   0.3  -5.0  -5.0   4.1  -5.0 -20.0  -5.0  -5.0   6.8


## Performance Notes

Measured on a core i5-3570K @ 3.40 GHz; single-threaded.

The program is not compiled for multi-threading, if you need this consider the
```--lpblock``` option first. Otherwise, send a mail.

The running time for calculating 100 000 alignments is:

    Mode              Seconds     Alignments per Second
    ====              =======     =====================
    TwoWay Simple:     6.7 s        ~ 15 000
    TwoWay Complex:   15.8 s        ~  6 300

Disabling backtracking (to get just the alignment scores) improves performance
by roughly a factor of ```x 2```. Words are very short, backtracking overhead
is quite large!

As in: the number of actual /unicode/ symbols is approximately the same as the
number of forward DP cells to be filled, since each /character/ can be a
combination of multiple unicdoe symbols.


## Three- and Four-way Alignments

These are currently disabled. If you need them, consider contacting me.


#### Contact

Christian Hoener zu Siederdissen  
Leipzig University, Leipzig, Germany  
choener@bioinf.uni-leipzig.de  
http://www.bioinf.uni-leipzig.de/~choener/