bio-0.4.7: A bioinformatics library

Bio.Sequence.SFF_filters

Contents

Description

This implements a number of filters used in the Titanium pipeline

Synopsis

Discarding filters **

type DiscardFilter = ReadBlock -> BoolSource

DiscardFilters determine whether a read is to be retained or discarded

filter_length :: Int -> DiscardFilterSource

  1. 2.2.1.2 The dots filter discards sequences where the last positive flow is before flow 84, and flows with >5% dots (i.e. three successive noise values) before the last postitive flow. (Interpreted as 5% of called sequence length is Ns?)
  2. 2.2.1.3 The mixed filter discards sequences with more than 70% positive flows. Also, discard with 30% noise, 20% middle (0.45..0.75) or <30% positive.

Discard a read if the number of untrimmed flows is less than n (n=186 for Titanium)

Trimming filters **

type TrimFilter = ReadBlock -> ReadBlockSource

TrimFilters modify the read, typically trimming it for quality

sigint :: ReadBlock -> IntSource

  1. 2.2.1.4 Signal intensity trim - trim back until <3% borderline flows (0.5..0.7). Then trim borderline values or dots from the end (use a window).

qual20 :: ReadBlock -> IntSource

  1. 2.2.1.7 Quality score trimming trims using a 10-base window until a Q20 average is found.

Utility functions **

dlength :: [a] -> DoubleSource

List length as a double (eliminates many instances of fromIntegral)

avg :: Integral a => [a] -> DoubleSource

Calculate average of a list

clipFlows :: ReadBlock -> Int -> ReadBlockSource

Translate a number of flows to position in sequence, and update clipping data accordingly

clipSeq :: ReadBlock -> Int -> ReadBlockSource

Update clip_qual_right if more severe than previous value