LCS stands for Longest Common Subsequence, and it is a relatively
challenging problem to find an LCS efficiently. This module implements
the algorithm described in:
An O(ND) Difference Algorithm and its Variations, Eugene Myers,
Algorithmica Vol. 1 No. 2, 1986, pp. 251-266;
especially the variation described in section 4.2 and most refinements
implemented in GNU diff (D is the edit-distance).
There is currently no heuristic to reduce the running time and produce
suboptimal output for large inputs with many differences. It behaves like
GNU diff with the -d option in this regard.
In the first step, a hash value for every line is calculated and collisions
are marked with a special value. This reduces a string comparison to an
int comparison for line tuples where at least one of the hash values is
not equal to the special value. After that, lines which only exists in one
of the files are removed and marked as changed which reduces the running
time of the following difference algorithm. GNU diff additionally removes
lines that appear very often in the other file in some cases.
The last step tries to create longer changed regions and line up deletions
in the first file to insertions in the second by shifting changed lines
forward and backward.
|