Portability | non-portable (uses STUArray) |
---|---|
Stability | provisional |
Maintainer | igloo@earth.li |
Data.List.LCS.HuntSzymanski
Description
This is an implementation of the Hunt-Szymanski LCS algorithm. Derived from the description in "String searching algorithms" by Graham A Stephen, ISBN 981021829X.
Algorithm
We take two sequences, xs
and ys
, of length #xs
and #ys
.
First we make an array
matchlist[i=0..(#xs-1)]
such that
(matchlist[i] = js) => ((j `elem` js) <=> (xs !! i == ys !! j)) && sort js == reverse js
i.e. matchlist[i]
is the indices of elements of ys
equal to the
ith element of xs
, in descending order.
Let #xys
be the minimum of #xs
and #ys
. Trivially this is the maximum
possible length of the LCS of xs
and ys
. Then we can imagine an array
k[i=0..#xs][l=0..#xys]
such that k[i][l] = j
where j
is the smallest value such that the
LCS of xs[0..i]
and ys[0..j]
has length l
. We use #ys
to
mean there is no such j
.
We will not need to whole array at once, though. Instead we use an array
kk[l=0..#xys]
representing a row of kk
for a particular i
. Initially it is for
i = -1
, so kk[0] = -1
and kk[l] = #ys
otherwise. As the algorithm
progresses we will increase i
by one at the outer level and compute
the replacement values for k
's elements.
But we want more than just the length of the LCS, we also want the LCS itself. Another array
revres[l=0..#xys]
stores the list of xs
indices an LCS of length l
, if one is known,
at revres[l]
.
Now, suppose kk
contains k[i-1]
. We consider each j
in matchlist[i]
in turn. We find the l
such that k[l-1] < j <= k[l]
. If j < k[l]
then
we updated k[l]
to be j
and set revres[l]
to be i:revres[l-1]
.
Finding l
is basically binary search, but there are some tricks we can
do. First, as the j
s are decreasing the last l
we had for this i
is
an upper bound on this l
. Second, we use another array
lastl[j=0..#ys-1]
to store the l
we got last time for this j
, initially all 1
. As the
values in kk[j]
monotonically decrease this is a lower bound for l
.
We also test to see whether this old l
is still l
before we start the
binary search.