ADPfusion-0.5.2.2: Efficient, high-level dynamic programming.

Safe HaskellNone
LanguageHaskell2010

ADP.Fusion.Tutorial.NeedlemanWunsch

Contents

Description

The Needleman-Wunsch global alignment algorithm. This algorithm is extremely simple but provides a good showcase for what ADPfusion offers.

The Needleman-Wunsch algorithm aligns to strings x = x_1 x_2 x_3 ... and y = y_1 y_2 y_3 ... which may be of differing lengths. Assume that x_1 ... x_{i-1} and y_1 ... y_{j-1} have already been optimally aligned. We can match x_i with y_j, or perform one of two possible insert-deletion pairs. Either x_i is aligned with - or - is aligned with y_j. More general, in each DP step, either one or both inputs are extended by one character.

For the actual implementation, we assume however, that we work backward. The entries d, u, and l have already been calculated. Now we want to compute the entry at x in the lower right corner.

 -----
 |d|u|
 -----
 |l|x|
 -----

We introduce a generic naming scheme for each possible move. If we move in a direction, we call it a step. If we do not move, then we call it a loop, because the index loops for this computation.

We can arrive from d, making a diagonal step, called step_step as we advance by one in both dimensions. This leads to an alignment of two characters, one from each input, at x, which is combined with the already calculated alignment at d.

We can also just step in the first dimension step_loop, going from l to x. Which means that the first-dim character does not have a partner, leading to an insertdeletion or indel. We typically do not care in which of the two dimensions the in/del happens, just that it does.

The third case is an in/del in the other dimension, giving us loop_step or going from u to x.

Of course, if x happens to be the uppermost, leftmost cell, we have nowhere to come from, so we need to inititialize (or terminate depending on your view point) using the nil_nil case. That one is the base case.

We also want to know which of the three cases is the best case (coming from d,l,u), this requires a "choice" function or h.

We now implement this algorithm using the low-level ADPfusion library. Follow the code from top to bottom for a tutorial on usage. The Needleman-Wunsch tutorial for the FormalGrammars library provides a higher-level style of implementation.

http://hackage.haskell.org/package/FormalGrammars/docs/FormalLanguage-Tutorial-NeedlemanWunsch.html

We also provide an implementation based on grammar products, which simplify the design of alignment-type algorithms. The corresponding tutorial is here.

http://hackage.haskell.org/package/GrammarProducts/docs/FormalLanguage-Tutorial-NeedlemanWunsch.html

Don't forget to inline basically everything!

Synopsis

Module Header

We start by importing a bunch of modules, including Data.PrimitiveArray for low-level arrays and automated filling of the arrays or tables in the correct order.

We also need to import ADP.Fusion to access the high-level code for dynamic programs.

module Main where
import           Control.Monad (forM_)
import           System.Environment (getArgs)
import           Text.Printf

Streams of parses are the streams defined in the vector package.

import qualified Data.Vector.Fusion.Stream.Monadic as SM

We use unboxed vectors to hold the input sequences to be aligned. The terminal parses work with any vector in the vector package.

import qualified Data.Vector.Unboxed as VU

Data.PrimitiveArray contains data structures, and index structures for dynamic programming. Notably, the primitive arrays holding the cell data with Boxed and Unboxed tables. In addition, linear, context-free, and set data structures are made available.

import           Data.PrimitiveArray as PA hiding (map)

ADP.Fusion.Point exposes everything necessary for higher-level DP algorithms. Depending on the type of DP algorithm, different top-level modules can be imported. .Point for linear grammars, and .Core are provided in this package. .Core exports only the core modules required to extend ADPfusion.

import           ADP.Fusion.Point

Signature

A signature connects the types of all non-terminals and terminals with evaluation (or attribute) functions. In the grammar below, we not only want to create all possible parses of how two strings can be aligned but also evaluate each parse and choose the optimal one based on Bellman's principle of optimality.

We take a close look at the type signatures. step_step :: x -> (Z:.c:.c) -> x tells us that step_step requires the score from the non-terminal, typed x for the alignment up to d, then we get the two characters with type c and produce a new non-terminal typed score x. For our simple alignment we'll later choose Char for c and Int for x.

The type of h :: Stream m x -> m r is more interesting. We get a Stream of x's and produce an r monadically. The left part is clear Stream m x are the results from the four rules, but the right part allows us to maybe not only return the best case, types x == r, but maybe the two best cases r == (x,x) or similar. While we do not use this feature here, it makes ADPfusion fully "classified DP" capable, which is a really cool feature for advanced algorithms.

data Signature m x r c = Signature
  { step_step :: x -> (Z:.c :.c ) -> x
  , step_loop :: x -> (Z:.c :.()) -> x
  , loop_step :: x -> (Z:.():.c ) -> x
  , nil_nil   ::      (Z:.():.()) -> x
  , h         :: Stream m x -> m r
  }

Algebra Product

We also want to be able to backtrace the optimal result. Given our alignment, knowing that we get an alignment score of 29 doesn't help us much. But with algebra products we can ask for (optimal <** pretty) and get the optimal result and the parse for it.

The (<**) operator is notoriously difficult to write, so we just compute it ☺.

Note that haddock actually shows (<**), while you just write makeAlgebraProductH ['h] ''Signature (the primes are TemplateHaskell, the only TemplateHaskell we need).

makeAlgebraProduct ''Signature

Grammar

This is the linear grammar in two dimensions describing the Needleman-Wunsch search space. It will, in principle, enumerate the exponential number of possible alignment, but due to memoization and the choice function h, the calculation time is O(N^2) and if only one optimal alignment is requested, backtracking works in linear time.

The grammar first requires a Signature, we use RecordWildCards as a language extension to bind step_step and friends. We also need a variable for the single non-terminal (a), and have two inputs i1 and i2. Each grammar starts with a "base case" Z:. followed by one or more pairs of non-terminal plus rules. Here we have one pair (a, rules), where in rules we combine the four different rules.

step_step <<< a % (M:>chr i1:>chr i2), for example, means that step_step first gets the non-terminal a, which will give the score up to d (see the ascii-art above), followed by (%) the 2-dim terminal for i1 and i2.

Multi-dimensional terminals are built up from the zero-dimension M, separated by :| symbols and just bind the input. In this case we want individual characters from i1 and i2, so we write chr i1 or chr i2.

Different rules are combined with (|||) and the optimal case is selected via ... h. Rules can, of course, be co-recursive, each rule can request all terminal and non-terminal symbols.

Due to the way nil_nil works on Epsilon, nil_nil is actually only called once, when we start the alignment, not for every cell!

However, due to this being a high-performance library, we do not provide a runtime scheme to detect misbehaving rules. Some debug-code is in place that performs certain checks in non-optimized GHCI sessions, but optimized code is built for raw speed, without any checks.

If you are a "conventional" DP programmer, you might miss the usual indices. Just as in the spiritual father of ADPfusion, Robert Giegerichs ADP, we hide the actual index calculations.

grammar Signature{..} !a' !i1 !i2 =
  let a = TW a' ( step_step <<< a % (M:|chr i1:|chr i2)     |||
                  step_loop <<< a % (M:|chr i1:|Deletion  ) |||
                  loop_step <<< a % (M:|Deletion  :|chr i2) |||
                  nil_nil   <<< (M:|Epsilon:|Epsilon)       ... h
                )
  in Z:.a
{-# INLINE grammar #-}

Scoring Algebra

A grammar alone is not enough, we also need to say what a step_step means, or what an optimum is. Here, we do exactly that. We create an instance of the Signature from above. Since this is a simple example, we just say that two aligned characters a and b yield a score of x+1 (with x the previous alignment score) if the characters are identical. Otherwise we lower the score by 2. In/dels incur a cost of -1, meaning that a mismatch is actually the same as two in/dels, one in each dimension. The start of the alignment gives an initial score of 0.

And the optimal choice, of course, is to start with a default of -999999 and find the maximum of that score and the choices we are given.

sScore :: Monad m => Signature m Int Int Char
sScore = Signature
  { step_step = \x (Z:.a:.b) -> if a==b then x+1 else x-2
  , step_loop = \x _         -> x-1
  , loop_step = \x _         -> x-1
  , nil_nil   = const 0
  , h = SM.foldl' max (-999999)
  }
{-# INLINE sScore #-}

Pretty-printing Algebra

Scores alone are still not enough, we also want to pretty-print alignments. An alignment are basically two strings [String 1, String 2], being turned into a whole stream of alignments, using Chars for the individual characters being aligned.

We follow the same theme as in sScore, but this time h = toList, that is the choice function h does not (!) make choice but rather returns all alignments. You already heard about <**, we'll use it below.

sPretty :: Monad m => Signature m [String] [[String]] Char
sPretty = Signature
  { step_step = \[x,y] (Z:.a :.b ) -> [a  :x, b  :y]
  , step_loop = \[x,y] (Z:.a :.()) -> [a  :x, '-':y]
  , loop_step = \[x,y] (Z:.():.b ) -> ['-':x, b  :y]
  , nil_nil   = const ["",""]
  , h = SM.toList
  }
{-# Inline sPretty #-}

The Inside version

Table-filling for the Inside version

The forward or table-filling phase. It is possible to inline this code directly into runNeedlemanWunsch. Here, this phase is separated. If you use ghc-core to examine the GHC Core language, you can search for nwInsideForward and check wether the inside code is optimized well. This is normally not required, and only done here, because these algorithms are used to gauge efficiency of the fusion framework as well.

For your own code, you can write as done here, or in the way of runOutsideNeedlemanWunsch.

nwInsideForward :: VU.Vector Char -> VU.Vector Char -> Z:.TwITbl Id Unboxed (Z:.EmptyOk:.EmptyOk) (Z:.PointL I:.PointL I) Int
nwInsideForward i1 i2 =  mutateTablesDefault $
                          grammar sScore
                          (ITbl 0 0 (Z:.EmptyOk:.EmptyOk) (PA.fromAssocs (Z:.PointL 0:.PointL 0) (Z:.PointL n1:.PointL n2) (-999999) []))
                          i1 i2
  where n1 = VU.length i1
        n2 = VU.length i2
{-# NoInline nwInsideForward #-}

We normally do not want to inline a fully specified algorithm. Compilation times are rather long, and in this way, we only compile once for the library, not every time the algorithm is called.

Backtracking optimal results for the Inside version

nwInsideBacktrack :: VU.Vector Char -> VU.Vector Char -> TwITbl Id Unboxed (Z:.EmptyOk:.EmptyOk) (Z:.PointL I:.PointL I) Int -> [[String]]
nwInsideBacktrack i1 i2 t =  unId $ axiom b
  where !(Z:.b) = grammar (sScore <|| sPretty) (toBacktrack t (undefined :: Id a -> Id a)) i1 i2
                    :: Z:.TwITblBt Unboxed (Z:.EmptyOk:.EmptyOk) (Z:.PointL I:.PointL I) Int Id Id [String]
{-# NoInline nwInsideBacktrack #-}

Glueing the Inside version parts together

The inside grammar, with efficient table-filling (via nwInsideForward) and backtracking. Requests k co-optimal backtrackings, given the inputs i1 and i2. The fst element returned is the score, the snd are the co-optimal parses.

runNeedlemanWunsch :: Int -> String -> String -> (Int,[[String]]) runNeedlemanWunsch k i1' i2' = (d, take k bs) where i1 = VU.fromList i1' i2 = VU.fromList i2' n1 = VU.length i1 n2 = VU.length i2 !(Z:.t) = nwInsideForward i1 i2 d = unId $ axiom t bs = nwInsideBacktrack i1 i2 t {-# Noinline runNeedlemanWunsch #-}

The Outside version

Table-filling for the Outside version

Again, to be able to observe performance, we have extracted the outside-table-filling part.

nwOutsideForward :: VU.Vector Char -> VU.Vector Char -> Z:.TwITbl Id Unboxed (Z:.EmptyOk:.EmptyOk) (Z:.PointL O:.PointL O) Int
nwOutsideForward i1 i2 =  mutateTablesDefault $
                           grammar sScore
                           (ITbl 0 0 (Z:.EmptyOk:.EmptyOk) (PA.fromAssocs (Z:.PointL 0:.PointL 0) (Z:.PointL n1:.PointL n2) (-999999) []))
                           i1 i2
  where n1 = VU.length i1
        n2 = VU.length i2
{-# Noinline nwOutsideForward #-}

Backtracking optimal results for the Outside version

Glueing the Outside version parts together

The outside version of the Needleman-Wunsch alignment algorithm. The outside grammar is identical to the inside grammar! This is not generally the case, but here it is. Hence we may just use outside tables and the grammar from above.

runOutsideNeedlemanWunsch :: Int -> String -> String -> (Int,[[String]])
runOutsideNeedlemanWunsch k i1' i2' =  (d, take k . unId $ axiom b) where
  i1 = VU.fromList i1'
  i2 = VU.fromList i2'
  n1 = VU.length i1
  n2 = VU.length i2
  !(Z:.t) = nwOutsideForward i1 i2
  d = unId $ axiom t
  !(Z:.b) = grammar (sScore <|| sPretty) (toBacktrack t (undefined :: Id a -> Id a)) i1 i2
              :: Z:.TwITblBt Unboxed (Z:.EmptyOk:.EmptyOk) (Z:.PointL O:.PointL O) Int Id Id [String]
{-# Noinline runOutsideNeedlemanWunsch #-}

Wrapping everything up

Alignment wrapper

This wrapper takes a list of input sequences and aligns each odd sequence with the next even sequence. We want one alignment for each such pair.

Since we use basic lists during backtracking, the resulting lists have to be reversed for inside-backtracking. Note that because the Outside grammar is quasi-right-linear, it does not require reversing the two strings.

For real applications, consider using Data.Sequence which has O(1) append and prepend.

align _ [] = return ()
align _ [c] = putStrLn "single last line"
align k (a:b:xs) =  do
  putStrLn a
  putStrLn b
  putStrLn ""
  let (sI,rsI) = runNeedlemanWunsch k a b
  let (sO,rsO) = runOutsideNeedlemanWunsch k a b
  forM_ rsI $ \[u,l] -> printf "%s\n%s  %d\n\n" (reverse u) (reverse l) sI
  forM_ rsO $ \[u,l] -> printf "%s\n%s  %d\n\n" (id      u) (id      l) sO
  align k xs

Main

And finally have a minimal main that reads from stdio.

If you are brave enough then put this through ghc-core and look for nwInsideForward or nwOutsideForward in the CORE. Everything coming from the forward phase should be beautifully optimized and the algorithm should run quite fast.

main = do
  as <- getArgs
  let k = if null as then 1 else read $ head as
  ls <- lines <$> getContents
  align k ls