Safe Haskell | None |
---|---|
Language | Haskell2010 |
The Needleman-Wunsch global alignment algorithm. This algorithm is extremely simple but provides a good showcase for what ADPfusion offers.
The Needleman-Wunsch algorithm aligns to strings x = x_1 x_2 x_3 ...
and
y = y_1 y_2 y_3 ...
which may be of differing lengths. Assume that x_1
... x_{i-1}
and y_1 ... y_{j-1}
have already been optimally aligned. We
can match x_i
with y_j
, or perform one of two possible insert-deletion
pairs. Either x_i
is aligned with -
or -
is aligned with y_j
. More
general, in each DP step, either one or both inputs are extended by one
character.
For the actual implementation, we assume however, that we work backward.
The entries d
, u
, and l
have already been calculated. Now we want to
compute the entry at x
in the lower right corner.
----- |d|u| ----- |l|x| -----
We introduce a generic naming scheme for each possible move. If we move in
a direction, we call it a step
. If we do not move, then we call it
a loop
, because the index loops for this computation.
We can arrive from d
, making a diagonal step, called step_step
as we
advance by one in both dimensions. This leads to an alignment of two
characters, one from each input, at x
, which is combined with the already
calculated alignment at d
.
We can also just step in the first dimension step_loop
, going from l
to
x
. Which means that the first-dim character does not have a partner,
leading to an insertdeletion or indel. We typically do not care in which
of the two dimensions the in/del happens, just that it does.
The third case is an in/del in the other dimension, giving us loop_step
or going from u
to x
.
Of course, if x
happens to be the uppermost, leftmost cell, we have
nowhere to come from, so we need to inititialize (or terminate depending on
your view point) using the nil_nil
case. That one is the base case.
We also want to know which of the three cases is the best case (coming from
d,l,u
), this requires a "choice" function or h
.
We now implement this algorithm using the low-level ADPfusion library.
Follow the code from top to bottom for a tutorial on usage. The
Needleman-Wunsch tutorial for the FormalGrammars
library provides
a higher-level style of implementation.
http://hackage.haskell.org/package/FormalGrammars/docs/FormalLanguage-Tutorial-NeedlemanWunsch.html
We also provide an implementation based on grammar products, which simplify the design of alignment-type algorithms. The corresponding tutorial is here.
http://hackage.haskell.org/package/GrammarProducts/docs/FormalLanguage-Tutorial-NeedlemanWunsch.html
Don't forget to inline basically everything!
Module Header
We start by importing a bunch of modules, including Data.PrimitiveArray
for low-level arrays and automated filling of the arrays or tables in the
correct order.
We also need to import ADP.Fusion
to access the high-level code for
dynamic programs.
module Main where
import Control.Monad (forM_) import System.Environment (getArgs) import Text.Printf
Streams of parses are the streams defined in the vector
package.
import qualified Data.Vector.Fusion.Stream.Monadic as SM
We use unboxed vectors to hold the input sequences to be aligned. The
terminal parses work with any vector in the vector
package.
import qualified Data.Vector.Unboxed as VU
Data.PrimitiveArray
contains data structures, and index structures for
dynamic programming. Notably, the primitive arrays holding the cell data
with Boxed and Unboxed tables. In addition, linear, context-free, and set
data structures are made available.
import Data.PrimitiveArray as PA hiding (map)
ADP.Fusion.Point
exposes everything necessary for higher-level DP
algorithms. Depending on the type of DP algorithm, different top-level
modules can be imported. .Point
for linear grammars, and .Core
are
provided in this package. .Core
exports only the core modules required to
extend ADPfusion.
import ADP.Fusion.Point
Signature
A signature connects the types of all non-terminals and terminals with evaluation (or attribute) functions. In the grammar below, we not only want to create all possible parses of how two strings can be aligned but also evaluate each parse and choose the optimal one based on Bellman's principle of optimality.
We take a close look at the type signatures. step_step :: x -> (Z:.c:.c)
-> x
tells us that step_step
requires the score from the non-terminal,
typed x
for the alignment up to d
, then we get the two characters with
type c
and produce a new non-terminal typed score x
. For our simple
alignment we'll later choose Char
for c
and Int
for x
.
The type of h :: Stream m x -> m r
is more interesting. We get a Stream
of x
's and produce an r
monadically. The left part is clear Stream
m x
are the results from the four rules, but the right part allows us to
maybe not only return the best case, types x == r
, but maybe the two best
cases r == (x,x)
or similar. While we do not use this feature here, it
makes ADPfusion fully "classified DP" capable, which is a really cool
feature for advanced algorithms.
data Signature m x r c = Signature { step_step :: x -> (Z:.c :.c ) -> x , step_loop :: x -> (Z:.c :.()) -> x , loop_step :: x -> (Z:.():.c ) -> x , nil_nil :: (Z:.():.()) -> x , h :: Stream m x -> m r }
Algebra Product
We also want to be able to backtrace the optimal result. Given our
alignment, knowing that we get an alignment score of 29 doesn't help us
much. But with algebra products we can ask for (optimal <** pretty)
and
get the optimal result and the parse for it.
The (<**)
operator is notoriously difficult to write, so we just compute
it ☺.
Note that haddock actually shows (<**)
, while you just write
makeAlgebraProductH ['h] ''Signature
(the primes are TemplateHaskell,
the only TemplateHaskell we need).
makeAlgebraProduct ''Signature
Grammar
This is the linear grammar in two dimensions describing the
Needleman-Wunsch search space. It will, in principle, enumerate the
exponential number of possible alignment, but due to memoization and the
choice function h
, the calculation time is O(N^2)
and if only one
optimal alignment is requested, backtracking works in linear time.
The grammar first requires a Signature
, we use RecordWildCards
as
a language extension to bind step_step
and friends. We also need
a variable for the single non-terminal (a
), and have two inputs i1
and
i2
. Each grammar starts with a "base case" Z:.
followed by one or more
pairs of non-terminal plus rules. Here we have one pair (a, rules)
, where
in rules
we combine the four different rules.
step_step <<< a % (M:>chr i1:>chr i2)
, for example, means that
step_step
first gets the non-terminal a
, which will give the score up
to d
(see the ascii-art above), followed by (%)
the 2-dim terminal for
i1
and i2
.
Multi-dimensional terminals are built up from the zero-dimension M
,
separated by :|
symbols and just bind the input. In this case we want
individual characters from i1
and i2
, so we write chr i1
or chr i2
.
Different rules are combined with (|||)
and the optimal case is selected
via ... h
. Rules can, of course, be co-recursive, each rule can request
all terminal and non-terminal symbols.
Due to the way nil_nil
works on Epsilon
, nil_nil
is actually only
called once, when we start the alignment, not for every cell!
However, due to this being a high-performance library, we do not provide a runtime scheme to detect misbehaving rules. Some debug-code is in place that performs certain checks in non-optimized GHCI sessions, but optimized code is built for raw speed, without any checks.
If you are a "conventional" DP programmer, you might miss the usual
indices. Just as in the spiritual father of ADPfusion
, Robert Giegerichs
ADP
, we hide the actual index calculations.
grammar Signature{..} !a' !i1 !i2 = let a = TW a' ( step_step <<< a % (M:|chr i1:|chr i2) ||| step_loop <<< a % (M:|chr i1:|Deletion ) ||| loop_step <<< a % (M:|Deletion :|chr i2) ||| nil_nil <<< (M:|Epsilon:|Epsilon) ... h ) in Z:.a
{-# INLINE grammar #-}
Scoring Algebra
A grammar alone is not enough, we also need to say what a step_step
means, or what an optimum is. Here, we do exactly that. We create an
instance of the Signature
from above. Since this is a simple example, we
just say that two aligned characters a
and b
yield a score of x+1
(with x
the previous alignment score) if the characters are identical.
Otherwise we lower the score by 2. In/dels incur a cost of -1
, meaning
that a mismatch is actually the same as two in/dels, one in each dimension.
The start of the alignment gives an initial score of 0.
And the optimal choice, of course, is to start with a default of -999999
and find the maximum of that score and the choices we are given.
sScore :: Monad m => Signature m Int Int Char sScore = Signature { step_step = \x (Z:.a:.b) -> if a==b then x+1 else x-2 , step_loop = \x _ -> x-1 , loop_step = \x _ -> x-1 , nil_nil = const 0 , h = SM.foldl' max (-999999) }
{-# INLINE sScore #-}
Pretty-printing Algebra
Scores alone are still not enough, we also want to pretty-print alignments.
An alignment are basically two strings [String 1, String 2]
, being turned
into a whole stream of alignments, using Char
s for the individual
characters being aligned.
We follow the same theme as in sScore
, but this time h = toList
, that
is the choice function h
does not (!) make choice but rather returns all
alignments. You already heard about <**
, we'll use it below.
sPretty :: Monad m => Signature m [String] [[String]] Char sPretty = Signature { step_step = \[x,y] (Z:.a :.b ) -> [a :x, b :y] , step_loop = \[x,y] (Z:.a :.()) -> [a :x, '-':y] , loop_step = \[x,y] (Z:.():.b ) -> ['-':x, b :y] , nil_nil = const ["",""] , h = SM.toList }
{-# Inline sPretty #-}
The Inside
version
Table-filling for the Inside
version
The forward or table-filling phase. It is possible to inline this code
directly into runNeedlemanWunsch
. Here, this phase is separated. If you
use ghc-core
to examine the GHC Core
language, you can search for
nwInsideForward
and check wether the inside code is optimized well. This
is normally not required, and only done here, because these algorithms
are used to gauge efficiency of the fusion framework as well.
For your own code, you can write as done here, or in the way of
runOutsideNeedlemanWunsch
.
nwInsideForward :: VU.Vector Char -> VU.Vector Char -> Z:.TwITbl Id Unboxed (Z:.EmptyOk:.EmptyOk) (Z:.PointL I:.PointL I) Int nwInsideForward i1 i2 = mutateTablesDefault $ grammar sScore (ITbl 0 0 (Z:.EmptyOk:.EmptyOk) (PA.fromAssocs (Z:.PointL 0:.PointL 0) (Z:.PointL n1:.PointL n2) (-999999) [])) i1 i2 where n1 = VU.length i1 n2 = VU.length i2
{-# NoInline nwInsideForward #-}
We normally do not want to inline a fully specified algorithm. Compilation times are rather long, and in this way, we only compile once for the library, not every time the algorithm is called.
Backtracking optimal results for the Inside
version
nwInsideBacktrack :: VU.Vector Char -> VU.Vector Char -> TwITbl Id Unboxed (Z:.EmptyOk:.EmptyOk) (Z:.PointL I:.PointL I) Int -> [[String]] nwInsideBacktrack i1 i2 t = unId $ axiom b where !(Z:.b) = grammar (sScore <|| sPretty) (toBacktrack t (undefined :: Id a -> Id a)) i1 i2 :: Z:.TwITblBt Unboxed (Z:.EmptyOk:.EmptyOk) (Z:.PointL I:.PointL I) Int Id Id [String]
{-# NoInline nwInsideBacktrack #-}
Glueing the Inside
version parts together
The inside grammar, with efficient table-filling (via nwInsideForward
)
and backtracking. Requests k
co-optimal backtrackings, given the inputs
i1
and i2
. The fst
element returned is the score, the snd
are the
co-optimal parses.
runNeedlemanWunsch :: Int -> String -> String -> (Int,[[String]])
runNeedlemanWunsch k i1' i2' = (d, take k bs) where
i1 = VU.fromList i1'
i2 = VU.fromList i2'
n1 = VU.length i1
n2 = VU.length i2
!(Z:.t) = nwInsideForward i1 i2
d = unId $ axiom t
bs = nwInsideBacktrack i1 i2 t
{-# Noinline runNeedlemanWunsch #-}
The Outside
version
Table-filling for the Outside
version
Again, to be able to observe performance, we have extracted the outside-table-filling part.
nwOutsideForward :: VU.Vector Char -> VU.Vector Char -> Z:.TwITbl Id Unboxed (Z:.EmptyOk:.EmptyOk) (Z:.PointL O:.PointL O) Int nwOutsideForward i1 i2 = mutateTablesDefault $ grammar sScore (ITbl 0 0 (Z:.EmptyOk:.EmptyOk) (PA.fromAssocs (Z:.PointL 0:.PointL 0) (Z:.PointL n1:.PointL n2) (-999999) [])) i1 i2 where n1 = VU.length i1 n2 = VU.length i2
{-# Noinline nwOutsideForward #-}
Backtracking optimal results for the Outside
version
Glueing the Outside
version parts together
The outside version of the Needleman-Wunsch alignment algorithm. The outside grammar is identical to the inside grammar! This is not generally the case, but here it is. Hence we may just use outside tables and the grammar from above.
runOutsideNeedlemanWunsch :: Int -> String -> String -> (Int,[[String]]) runOutsideNeedlemanWunsch k i1' i2' = (d, take k . unId $ axiom b) where i1 = VU.fromList i1' i2 = VU.fromList i2' n1 = VU.length i1 n2 = VU.length i2 !(Z:.t) = nwOutsideForward i1 i2 d = unId $ axiom t !(Z:.b) = grammar (sScore <|| sPretty) (toBacktrack t (undefined :: Id a -> Id a)) i1 i2 :: Z:.TwITblBt Unboxed (Z:.EmptyOk:.EmptyOk) (Z:.PointL O:.PointL O) Int Id Id [String]
{-# Noinline runOutsideNeedlemanWunsch #-}
Wrapping everything up
Alignment wrapper
This wrapper takes a list of input sequences and aligns each odd sequence with the next even sequence. We want one alignment for each such pair.
Since we use basic lists during backtracking, the resulting lists have to be reversed for inside-backtracking. Note that because the Outside grammar is quasi-right-linear, it does not require reversing the two strings.
For real applications, consider using Data.Sequence
which has O(1)
append and prepend.
align _ [] = return () align _ [c] = putStrLn "single last line" align k (a:b:xs) = do putStrLn a putStrLn b putStrLn "" let (sI,rsI) = runNeedlemanWunsch k a b let (sO,rsO) = runOutsideNeedlemanWunsch k a b forM_ rsI $ \[u,l] -> printf "%s\n%s %d\n\n" (reverse u) (reverse l) sI forM_ rsO $ \[u,l] -> printf "%s\n%s %d\n\n" (id u) (id l) sO align k xs
Main
And finally have a minimal main that reads from stdio.
If you are brave enough then put this through ghc-core
and look for
nwInsideForward
or nwOutsideForward
in the CORE. Everything coming from
the forward phase should be beautifully optimized and the algorithm should
run quite fast.
main = do as <- getArgs let k = if null as then 1 else read $ head as ls <- lines <$> getContents align k ls