bio-0.4.8: A bioinformatics library

Bio.Sequence.SFF

Description

Read (and write?) the SFF file format used by Roche/454 sequencing to store flowgram data.

A flowgram is a series of values (intensities) representing homopolymer runs of A,G,C, and T in a fixed cycle, and usually displayed as a histogram.

The Staden Package contains an io_lib, with a C routine for parsing this format. According to comments in the sources, the io_lib implementation is based on a file called getsff.c, which I've been unable to track down.

It is believed that all values are stored big endian.

Synopsis

Documentation

data SFF Source

The data structure storing the contents of an SFF file (modulo the index)

Constructors

SFF !CommonHeader [ReadBlock] 

Instances

data CommonHeader Source

SFF has a 31-byte common header Todo: remove items that are derivable (counters, magic, etc) cheader_lenght points to the first read header. Also, the format is open to having the index anywhere between reads, we should really keep count and check for each read. In practice, it seems to be places after the reads.

The following two fields are considered part of the header, but as they are static, they are not part of the data structure magic :: Word32 -- ^ 0x2e736666, i.e. the string .sff version :: Word32 -- ^ 0x00000001

Constructors

CommonHeader 

data ReadBlock Source

This contains the actual flowgram for a single read.

Instances

writeSFF :: FilePath -> SFF -> IO ()Source

Write an SFF to the specified file name

writeSFF' :: FilePath -> SFF -> IO IntSource

Write an SFF to the specified file name, but go back and update the read count. Useful if you want to output a lazy stream of ReadBlocks. Returns the number of reads written.

trim :: ReadBlock -> ReadBlockSource

Trim a read according to clipping information

trimFromTo :: Integral i => i -> i -> ReadBlock -> ReadBlockSource

Trim a read to specific sequence position. The current implementation has the unintended side effect of always trimming the flowgram down to a basecalled position.

baseToFlowPos :: Integral i => ReadBlock -> i -> IntSource

Convert a sequence position to the corresponding flow position

flowToBasePos :: Integral i => ReadBlock -> i -> IntSource

Convert a flow position to the corresponding sequence position

test :: FilePath -> IO ()Source

test serialization by output'ing the header and first two reads in an SFF, and the same after a decode + encode cycle.

convert :: FilePath -> IO ()Source

Convert a file by decoding it and re-encoding it This will lose the index (which isn't really necessary)

packFlows :: [Flow] -> ByteStringSource

Pack a list of flows into the corresponding binary structure (the flow_data field)

unpackFlows :: ByteString -> [Flow]Source

Unpack the flow_data field into a list of flow values

type Flow = Int16Source

The type of flowgram value

type Qual = Word8Source

Basic type for quality data. Range 0..255. Typical Phred output is in the range 6..50, with 20 as the line in the sand separating good from bad.

type SeqData = ByteStringSource

The basic data type used in Sequences

type QualData = ByteStringSource

Quality data is a Qual vector, currently implemented as a ByteString.

data ReadName Source

Read names encode various information, as per this struct.

Constructors

ReadName 

Fields

date :: (Int, Int, Int)
 
time :: (Int, Int, Int)
 
region :: Int
 
x_loc :: Int
 
y_loc :: Int
 

Instances