| Portability | non-portable (uses STUArray) |
|---|---|
| Stability | provisional |
| Maintainer | igloo@earth.li |
Data.List.LCS.HuntSzymanski
Description
This is an implementation of the Hunt-Szymanski LCS algorithm. Derived from the description in "String searching algorithms" by Graham A Stephen, ISBN 981021829X.
Algorithm
We take two sequences, xs and ys, of length #xs and #ys.
First we make an array
matchlist[i=0..(#xs-1)]
such that
(matchlist[i] = js) => ((j `elem` js) <=> (xs !! i == ys !! j))
&& sort js == reverse js
i.e. matchlist[i] is the indices of elements of ys equal to the
ith element of xs, in descending order.
Let #xys be the minimum of #xs and #ys. Trivially this is the maximum
possible length of the LCS of xs and ys. Then we can imagine an array
k[i=0..#xs][l=0..#xys]
such that k[i][l] = j where j is the smallest value such that the
LCS of xs[0..i] and ys[0..j] has length l. We use #ys to
mean there is no such j.
We will not need to whole array at once, though. Instead we use an array
kk[l=0..#xys]
representing a row of kk for a particular i. Initially it is for
i = -1, so kk[0] = -1 and kk[l] = #ys otherwise. As the algorithm
progresses we will increase i by one at the outer level and compute
the replacement values for k's elements.
But we want more than just the length of the LCS, we also want the LCS itself. Another array
revres[l=0..#xys]
stores the list of xs indices an LCS of length l, if one is known,
at revres[l].
Now, suppose kk contains k[i-1]. We consider each j in matchlist[i]
in turn. We find the l such that k[l-1] < j <= k[l]. If j < k[l] then
we updated k[l] to be j and set revres[l] to be i:revres[l-1].
Finding l is basically binary search, but there are some tricks we can
do. First, as the js are decreasing the last l we had for this i is
an upper bound on this l. Second, we use another array
lastl[j=0..#ys-1]
to store the l we got last time for this j, initially all 1. As the
values in kk[j] monotonically decrease this is a lower bound for l.
We also test to see whether this old l is still l before we start the
binary search.