lcs-0.2: Find longest common sublist of two lists

Portability non-portable (uses STUArray) provisional igloo@earth.li

Data.List.LCS.HuntSzymanski

Contents

Description

This is an implementation of the Hunt-Szymanski LCS algorithm. Derived from the description in "String searching algorithms" by Graham A Stephen, ISBN 981021829X.

Synopsis

• lcs :: Ord a => [a] -> [a] -> [a]

# Algorithm

We take two sequences, `xs` and `ys`, of length `#xs` and `#ys`.

First we make an array

``` matchlist[i=0..(#xs-1)]
```

such that

``` (matchlist[i] = js) => ((j `elem` js) <=> (xs !! i == ys !! j))
&& sort js == reverse js
```

i.e. `matchlist[i]` is the indices of elements of `ys` equal to the ith element of `xs`, in descending order.

Let `#xys` be the minimum of `#xs` and `#ys`. Trivially this is the maximum possible length of the LCS of `xs` and `ys`. Then we can imagine an array

``` k[i=0..#xs][l=0..#xys]
```

such that `k[i][l] = j` where `j` is the smallest value such that the LCS of `xs[0..i]` and `ys[0..j]` has length `l`. We use `#ys` to mean there is no such `j`.

We will not need to whole array at once, though. Instead we use an array

``` kk[l=0..#xys]
```

representing a row of `kk` for a particular `i`. Initially it is for `i = -1`, so `kk = -1` and `kk[l] = #ys` otherwise. As the algorithm progresses we will increase `i` by one at the outer level and compute the replacement values for `k`'s elements.

But we want more than just the length of the LCS, we also want the LCS itself. Another array

``` revres[l=0..#xys]
```

stores the list of `xs` indices an LCS of length `l`, if one is known, at `revres[l]`.

Now, suppose `kk` contains `k[i-1]`. We consider each `j` in `matchlist[i]` in turn. We find the `l` such that `k[l-1] < j <= k[l]`. If `j < k[l]` then we updated `k[l]` to be `j` and set `revres[l]` to be `i:revres[l-1]`.

Finding `l` is basically binary search, but there are some tricks we can do. First, as the `j`s are decreasing the last `l` we had for this `i` is an upper bound on this `l`. Second, we use another array

``` lastl[j=0..#ys-1]
```

to store the `l` we got last time for this `j`, initially all `1`. As the values in `kk[j]` monotonically decrease this is a lower bound for `l`. We also test to see whether this old `l` is still `l` before we start the binary search.

# LCS

lcs :: Ord a => [a] -> [a] -> [a]Source

The `lcs` function takes two lists and returns a list with a longest common subsequence of the two.