Safe Haskell	None
Language	Haskell2010

Bio.Adna

Description

Things specific to ancient DNA, e.g. damage models.

For aDNA, we need a substitution probability. We have three options: use an empirically determined PSSM, use an arithmetically defined PSSM based on the Johnson model, use a context sensitive PSSM based on the Johnson model and an alignment. Using Dindel, actual substitutions relative to a called haplotype would be taken into account. Since we're not going to do that, taking alignments into account is difficult, somewhat approximate, and therefore not worth the hassle.

Synopsis

data DmgStats a = DmgStats {
- basecompo5 :: CompositionStats
- basecompo3 :: CompositionStats
- substs5 :: SubstitutionStats
- substs3 :: SubstitutionStats
- substs5d5 :: SubstitutionStats
- substs3d5 :: SubstitutionStats
- substs5d3 :: SubstitutionStats
- substs3d3 :: SubstitutionStats
- substs5dd :: SubstitutionStats
- substs3dd :: SubstitutionStats
- substs5cpg :: SubstitutionStats
- substs3cpg :: SubstitutionStats
- stats_more :: a
}
type CompositionStats = [(Maybe Nucleotide, Vector Int)]
type SubstitutionStats = [(Subst, Vector Int)]
addFragType :: BamMeta -> Enumeratee [BamRaw] [(BamRaw, FragType)] m b
damagePatternsIter :: MonadIO m => Int -> Int -> Iteratee [Alignment] m b -> Iteratee [(BamRec, FragType, Vector Word8, Vector NPair)] m (DmgStats b)
damagePatternsIterMD :: MonadIO m => Int -> Iteratee [Alignment] m b -> Iteratee [(BamRaw, FragType)] m (DmgStats b)
damagePatternsIter2Bit :: MonadIO m => Refs -> TwoBitFile -> Int -> Int -> Iteratee [Alignment] m b -> Iteratee [(BamRaw, FragType)] m (DmgStats b)
alnFromMd :: Vector_Nucs_half Nucleotides -> Vector Cigar -> [MdOp] -> Vector NPair
data DamageParameters float = DP {
- ssd_sigma :: !float
- ssd_delta :: !float
- ssd_lambda :: !float
- ssd_kappa :: !float
- dsd_sigma :: !float
- dsd_delta :: !float
- dsd_lambda :: !float
}
data NewDamageParameters vec float = NDP {
- dp_gc_frac :: !float
- dp_mu :: !float
- dp_nu :: !float
- dp_alpha5 :: !(vec float)
- dp_beta5 :: !(vec float)
- dp_alpha :: !float
- dp_beta :: !float
- dp_alpha3 :: !(vec float)
- dp_beta3 :: !(vec float)
}
data GenDamageParameters vec float
- = UnknownDamage
- | OldDamage (DamageParameters float)
- | NewDamage (NewDamageParameters vec float)
type DamageModel = Bool -> Int -> Int -> Mat44D
bang :: Mat44D -> Subst -> Double
nudge :: MMat44D -> Subst -> Double -> IO ()
data Alignment = ALN {
- a_sequence :: !(Vector NPair)
- a_fragment_type :: !FragType
}
data FragType
- = Complete
- | Leading
- | Trailing
data Subst = Nucleotide :-> Nucleotide
data NPair
npair :: Nucleotides -> Nucleotides -> NPair
fst_np :: NPair -> Nucleotides
snd_np :: NPair -> Nucleotides
noDamage :: DamageModel
univDamage :: DamageParameters Double -> DamageModel
empDamage :: NewDamageParameters Vector Double -> DamageModel
newtype Mat44D = Mat44D (Vector Double)
newtype MMat44D = MMat44D (IOVector Double)
scalarMat :: Double -> Mat44D
complMat :: Mat44D -> Mat44D
freezeMats :: MMat44D -> MMat44D -> IO Mat44D
bwa_cal_maxdiff :: Double -> Int -> Int

Documentation

data DmgStats a Source #

Collected "traditional" statistics:

Base composition near 5' end and near 3' end. Each consists of five vectors of counts of A,C,G,T, and everything else. basecompo5 begins with context bases to the left of the 5' end, basecompo3 ends with context bases to the right of the 3' end.
Substitutions. Counted from the reconstructed alignment, once around the 5' end and once around the 3' end. For a total of 2*4*4 different substitutions. Positions where the query has a gap are skipped.
Substitutions at CpG motifs. Also counted from the reconstructed alignment, and a CpG site is simply the sequence CG in the reference. Gaps may confuse that definition, so that CpHpG still counts as CpG, because the H is gapped. That might actually be desirable.
Conditional substitutions. The 5' and 3' ends count as damaged if the very last position has a C-to-T substitution. With that in mind, substs5d5, substs5d3, substs5dd are like substs5, but counting only reads where the 5' end is damaged, where the 3' end is damaged, and where both ends are damaged, respectively.

Constructors

DmgStats

Fields

Instances

Show a => Show (DmgStats a) Source #
Instance details Defined in Bio.Adna Methods showsPrec :: Int -> DmgStats a -> ShowS # show :: DmgStats a -> String # showList :: [DmgStats a] -> ShowS #
Semigroup a => Semigroup (DmgStats a) Source #
Instance details Defined in Bio.Adna Methods (<>) :: DmgStats a -> DmgStats a -> DmgStats a # sconcat :: NonEmpty (DmgStats a) -> DmgStats a # stimes :: Integral b => b -> DmgStats a -> DmgStats a #
Monoid a => Monoid (DmgStats a) Source #
Instance details Defined in Bio.Adna Methods mempty :: DmgStats a # mappend :: DmgStats a -> DmgStats a -> DmgStats a # mconcat :: [DmgStats a] -> DmgStats a #

type CompositionStats = [(Maybe Nucleotide, Vector Int)] Source #

type SubstitutionStats = [(Subst, Vector Int)] Source #

addFragType :: BamMeta -> Enumeratee [BamRaw] [(BamRaw, FragType)] m b Source #

damagePatternsIter :: MonadIO m => Int -> Int -> Iteratee [Alignment] m b -> Iteratee [(BamRec, FragType, Vector Word8, Vector NPair)] m (DmgStats b) Source #

Common logic for statistics. The function get_ref_and_aln reconstructs reference sequence and alignment from a Bam record. It is expected to construct the alignment with respect to the forwards strand of the reference; we reverse-complement it if necessary.

damagePatternsIterMD :: MonadIO m => Int -> Iteratee [Alignment] m b -> Iteratee [(BamRaw, FragType)] m (DmgStats b) Source #

Enumeratee (almost) that computes some statistics from plain BAM with a valid MD field. The Alignment is also reconstructed and passed downstream. The result of any downstream processing is available in the stats_more field of the result.

Reconstruct the alignment from CIGAR, SEQ, and MD.
Filter the alignment to get the reference sequence, accumulate it.
Accumulate everything over the alignment.

The argument is the amount of context inside desired.

For Complete fragments, we cut the read in the middle, so the 5' and 3' plots stay clean from each other's influence. Leading and Trailing fragments count completely towards the appropriate end.

damagePatternsIter2Bit :: MonadIO m => Refs -> TwoBitFile -> Int -> Int -> Iteratee [Alignment] m b -> Iteratee [(BamRaw, FragType)] m (DmgStats b) Source #

Enumeratee (almost) that computes some statistics from plain BAM (no MD field needed) and a 2bit file. The Alignment is also reconstructed and passed downstream. The result of any downstream processing is available in the stats_more field of the result.

Get the reference sequence including both contexts once. If this includes invalid sequence (negative coordinate), pad suitably.
Accumulate counts for the valid parts around 5' and 3' ends as appropriate from flags and config.
Combine the part that was aligned to (so no context) with the read to reconstruct the alignment.

Arguments are the table of reference names, the 2bit file with the reference, the amount of context outside the alignment desired, and the amount of context inside desired.

For Complete fragments, we cut the read in the middle, so the 5' and 3' plots stay clean from each other's influence. Leading and Trailing fragments count completely towards the appropriate end.

alnFromMd :: Vector_Nucs_half Nucleotides -> Vector Cigar -> [MdOp] -> Vector NPair Source #

Reconstructs the alignment from query, cigar, and md. Only positions where the query is not gapped are produced.

data DamageParameters float Source #

Parameters for the universal damage model.

We assume the correct model is either no damage, or single strand damage, or double strand damage. Each of them comes with a probability. It turns out that blending them into one is simply accomplished by multiplying these probabilities onto the deamination probabilities.

For single stranded library prep, only one kind of damage occurs (C frequency (ssd_sigma) in single stranded parts, and the overhang length is distributed exponentially with parameter ssd_lambda at the 5' end and ssd_kappa at the 3' end. (Without UDG treatment, those will be equal. With UDG, those are much smaller and in fact don't literally represent overhangs.)

For double stranded library prep, we get C->T damage at the 5' end and G->A at the 3' end with rate dsd_sigma and both in the interior with rate dsd_delta. Everything is symmetric, and therefore the orientation of the aligned read doesn't matter either. Both overhangs follow a distribution with parameter dsd_lambda.

Constructors

DP
Fields ssd_sigma :: !float ssd_delta :: !float ssd_lambda :: !float ssd_kappa :: !float dsd_sigma :: !float dsd_delta :: !float dsd_lambda :: !float

Instances

Read float => Read (DamageParameters float) Source #
Instance details Defined in Bio.Adna Methods readsPrec :: Int -> ReadS (DamageParameters float) # readList :: ReadS [DamageParameters float] # readPrec :: ReadPrec (DamageParameters float) # readListPrec :: ReadPrec [DamageParameters float] #
Show float => Show (DamageParameters float) Source #
Instance details Defined in Bio.Adna Methods showsPrec :: Int -> DamageParameters float -> ShowS # show :: DamageParameters float -> String # showList :: [DamageParameters float] -> ShowS #
Generic (DamageParameters float) Source #
Instance details Defined in Bio.Adna Associated Types type Rep (DamageParameters float) :: * -> * # Methods from :: DamageParameters float -> Rep (DamageParameters float) x # to :: Rep (DamageParameters float) x -> DamageParameters float #
type Rep (DamageParameters float) Source #
Instance details Defined in Bio.Adna type Rep (DamageParameters float) = D1 (MetaData "DamageParameters" "Bio.Adna" "biohazard-1.0.3-9lbnRlyNTeQ50qXuojKpKS" False) (C1 (MetaCons "DP" PrefixI True) ((S1 (MetaSel (Just "ssd_sigma") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 float) :: (S1 (MetaSel (Just "ssd_delta") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 float) :: S1 (MetaSel (Just "ssd_lambda") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 float))) :: ((S1 (MetaSel (Just "ssd_kappa") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 float) :: S1 (MetaSel (Just "dsd_sigma") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 float)) :: (S1 (MetaSel (Just "dsd_delta") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 float) :: S1 (MetaSel (Just "dsd_lambda") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 float)))))

data NewDamageParameters vec float Source #

Constructors

NDP
Fields dp_gc_frac :: !float dp_mu :: !float dp_nu :: !float dp_alpha5 :: !(vec float) dp_beta5 :: !(vec float) dp_alpha :: !float dp_beta :: !float dp_alpha3 :: !(vec float) dp_beta3 :: !(vec float)

Instances

(Read float, Read (vec float)) => Read (NewDamageParameters vec float) Source #
Instance details Defined in Bio.Adna Methods readsPrec :: Int -> ReadS (NewDamageParameters vec float) # readList :: ReadS [NewDamageParameters vec float] # readPrec :: ReadPrec (NewDamageParameters vec float) # readListPrec :: ReadPrec [NewDamageParameters vec float] #
(Show float, Show (vec float)) => Show (NewDamageParameters vec float) Source #
Instance details Defined in Bio.Adna Methods showsPrec :: Int -> NewDamageParameters vec float -> ShowS # show :: NewDamageParameters vec float -> String # showList :: [NewDamageParameters vec float] -> ShowS #
Generic (NewDamageParameters vec float) Source #
Instance details Defined in Bio.Adna Associated Types type Rep (NewDamageParameters vec float) :: * -> * # Methods from :: NewDamageParameters vec float -> Rep (NewDamageParameters vec float) x # to :: Rep (NewDamageParameters vec float) x -> NewDamageParameters vec float #
type Rep (NewDamageParameters vec float) Source #
Instance details Defined in Bio.Adna type Rep (NewDamageParameters vec float) = D1 (MetaData "NewDamageParameters" "Bio.Adna" "biohazard-1.0.3-9lbnRlyNTeQ50qXuojKpKS" False) (C1 (MetaCons "NDP" PrefixI True) (((S1 (MetaSel (Just "dp_gc_frac") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 float) :: S1 (MetaSel (Just "dp_mu") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 float)) :: (S1 (MetaSel (Just "dp_nu") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 float) :: S1 (MetaSel (Just "dp_alpha5") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 (vec float)))) :: ((S1 (MetaSel (Just "dp_beta5") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 (vec float)) :: S1 (MetaSel (Just "dp_alpha") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 float)) :: (S1 (MetaSel (Just "dp_beta") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 float) :: (S1 (MetaSel (Just "dp_alpha3") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 (vec float)) :: S1 (MetaSel (Just "dp_beta3") NoSourceUnpackedness SourceStrict DecidedStrict) (Rec0 (vec float)))))))

data GenDamageParameters vec float Source #

Constructors

UnknownDamage
OldDamage (DamageParameters float)
NewDamage (NewDamageParameters vec float)

Instances

(Read float, Read (vec float)) => Read (GenDamageParameters vec float) Source #
Instance details Defined in Bio.Adna Methods readsPrec :: Int -> ReadS (GenDamageParameters vec float) # readList :: ReadS [GenDamageParameters vec float] # readPrec :: ReadPrec (GenDamageParameters vec float) # readListPrec :: ReadPrec [GenDamageParameters vec float] #
(Show float, Show (vec float)) => Show (GenDamageParameters vec float) Source #
Instance details Defined in Bio.Adna Methods showsPrec :: Int -> GenDamageParameters vec float -> ShowS # show :: GenDamageParameters vec float -> String # showList :: [GenDamageParameters vec float] -> ShowS #
Generic (GenDamageParameters vec float) Source #
Instance details Defined in Bio.Adna Associated Types type Rep (GenDamageParameters vec float) :: * -> * # Methods from :: GenDamageParameters vec float -> Rep (GenDamageParameters vec float) x # to :: Rep (GenDamageParameters vec float) x -> GenDamageParameters vec float #
type Rep (GenDamageParameters vec float) Source #
Instance details Defined in Bio.Adna type Rep (GenDamageParameters vec float) = D1 (MetaData "GenDamageParameters" "Bio.Adna" "biohazard-1.0.3-9lbnRlyNTeQ50qXuojKpKS" False) (C1 (MetaCons "UnknownDamage" PrefixI False) (U1 :: * -> *) :+: (C1 (MetaCons "OldDamage" PrefixI False) (S1 (MetaSel (Nothing :: Maybe Symbol) NoSourceUnpackedness NoSourceStrictness DecidedLazy) (Rec0 (DamageParameters float))) :+: C1 (MetaCons "NewDamage" PrefixI False) (S1 (MetaSel (Nothing :: Maybe Symbol) NoSourceUnpackedness NoSourceStrictness DecidedLazy) (Rec0 (NewDamageParameters vec float)))))

type DamageModel = Bool -> Int -> Int -> Mat44D Source #

A DamageModel is a function that gives substitution matrices for each position in a read. The DamageModel can depend on whether the alignment is reversed, the length of the read and the position. (In practice, we should probably memoize precomputed damage models somehow.)

bang :: Mat44D -> Subst -> Double infix 8 Source #

Convenience function to access a substitution matrix that has a mnemonic reading.

nudge :: MMat44D -> Subst -> Double -> IO () Source #

data Alignment Source #

Alignment record. The reference sequence is filled with Ns if missing.

Constructors

ALN
Fields a_sequence :: !(Vector NPair) a_fragment_type :: !FragType

data FragType Source #

Constructors

Complete
Leading
Trailing

Instances

Eq FragType Source #
Instance details Defined in Bio.Adna Methods (==) :: FragType -> FragType -> Bool # (/=) :: FragType -> FragType -> Bool #
Show FragType Source #
Instance details Defined in Bio.Adna Methods showsPrec :: Int -> FragType -> ShowS # show :: FragType -> String # showList :: [FragType] -> ShowS #

data Subst Source #

Constructors

Nucleotide :-> Nucleotide infix 9

Instances

Eq Subst Source #
Instance details Defined in Bio.Adna Methods (==) :: Subst -> Subst -> Bool # (/=) :: Subst -> Subst -> Bool #
Ord Subst Source #
Instance details Defined in Bio.Adna Methods compare :: Subst -> Subst -> Ordering # (<) :: Subst -> Subst -> Bool # (<=) :: Subst -> Subst -> Bool # (>) :: Subst -> Subst -> Bool # (>=) :: Subst -> Subst -> Bool # max :: Subst -> Subst -> Subst # min :: Subst -> Subst -> Subst #
Show Subst Source #
Instance details Defined in Bio.Adna Methods showsPrec :: Int -> Subst -> ShowS # show :: Subst -> String # showList :: [Subst] -> ShowS #
Ix Subst Source #
Instance details Defined in Bio.Adna Methods range :: (Subst, Subst) -> [Subst] # index :: (Subst, Subst) -> Subst -> Int # unsafeIndex :: (Subst, Subst) -> Subst -> Int inRange :: (Subst, Subst) -> Subst -> Bool # rangeSize :: (Subst, Subst) -> Int # unsafeRangeSize :: (Subst, Subst) -> Int

data NPair Source #

Compact storage of a pair of ambiguous Nucleotides. Used to represent alignments in a way that is accessible even to assembly code. The first and sencond field are stored in the low and high nybble, respectively. See fst_np, snd_np, npair.

Instances

Eq NPair Source #
Instance details Defined in Bio.Adna Methods (==) :: NPair -> NPair -> Bool # (/=) :: NPair -> NPair -> Bool #
Ord NPair Source #
Instance details Defined in Bio.Adna Methods compare :: NPair -> NPair -> Ordering # (<) :: NPair -> NPair -> Bool # (<=) :: NPair -> NPair -> Bool # (>) :: NPair -> NPair -> Bool # (>=) :: NPair -> NPair -> Bool # max :: NPair -> NPair -> NPair # min :: NPair -> NPair -> NPair #
Show NPair Source #
Instance details Defined in Bio.Adna Methods showsPrec :: Int -> NPair -> ShowS # show :: NPair -> String # showList :: [NPair] -> ShowS #
Storable NPair Source #
Instance details Defined in Bio.Adna Methods sizeOf :: NPair -> Int # alignment :: NPair -> Int # peekElemOff :: Ptr NPair -> Int -> IO NPair # pokeElemOff :: Ptr NPair -> Int -> NPair -> IO () # peekByteOff :: Ptr b -> Int -> IO NPair # pokeByteOff :: Ptr b -> Int -> NPair -> IO () # peek :: Ptr NPair -> IO NPair # poke :: Ptr NPair -> NPair -> IO () #

npair :: Nucleotides -> Nucleotides -> NPair Source #

fst_np :: NPair -> Nucleotides Source #

snd_np :: NPair -> Nucleotides Source #

noDamage :: DamageModel Source #

DamageModel for undamaged DNA. The likelihoods follow directly from the quality score. This needs elaboration to see what to do with amibiguity codes (even though those haven't actually been observed in the wild).

univDamage :: DamageParameters Double -> DamageModel Source #

empDamage :: NewDamageParameters Vector Double -> DamageModel Source #

newtype Mat44D Source #

We represent substitution matrices by the type Mat44D. Internally, this is a vector of packed vectors. Conveniently, each of the packed vectors represents all transitions into the given nucleotide.

Constructors

Mat44D (Vector Double)

Instances

Show Mat44D Source #
Instance details Defined in Bio.Adna Methods showsPrec :: Int -> Mat44D -> ShowS # show :: Mat44D -> String # showList :: [Mat44D] -> ShowS #
Generic Mat44D Source #
Instance details Defined in Bio.Adna Associated Types type Rep Mat44D :: * -> * # Methods from :: Mat44D -> Rep Mat44D x # to :: Rep Mat44D x -> Mat44D #
type Rep Mat44D Source #
Instance details Defined in Bio.Adna type Rep Mat44D = D1 (MetaData "Mat44D" "Bio.Adna" "biohazard-1.0.3-9lbnRlyNTeQ50qXuojKpKS" True) (C1 (MetaCons "Mat44D" PrefixI False) (S1 (MetaSel (Nothing :: Maybe Symbol) NoSourceUnpackedness NoSourceStrictness DecidedLazy) (Rec0 (Vector Double))))

newtype MMat44D Source #

Constructors

MMat44D (IOVector Double)

scalarMat :: Double -> Mat44D Source #

complMat :: Mat44D -> Mat44D Source #

freezeMats :: MMat44D -> MMat44D -> IO Mat44D Source #

Adds the two matrices of a mutable substitution model (one for each strand) appropriately, normalizes the result (to make probabilities from pseudo-counts), and freezes that into one immutable matrix. We add a single count everywhere to avoid getting NaNs from bizarre data.

bwa_cal_maxdiff :: Double -> Int -> Int Source #

Number of mismatches allowed by BWA. bwa_cal_maxdiff thresh len returns the number of mismatches bwa aln -n $tresh would allow in a read of length len. For reference, here is the code from BWA that computes it (we assume err = 0.02, just like BWA):

int bwa_cal_maxdiff(int l, double err, double thres)
  {
     double elambda = exp(-l * err);
     double sum, y = 1.0;
     int k, x = 1;
     for (k = 1, sum = elambda; k < 1000; ++k) {
         y *= l * err;
         x *= k;
         sum += elambda * y / x;
         if (1.0 - sum < thres) return k;
     }
     return 2;
  }