bio-0.5: A bioinformatics library



Read and write the SFF file format used by Roche/454 sequencing to store flowgram data.

A flowgram is a series of values (intensities) representing homopolymer runs of A,G,C, and T in a fixed cycle, and usually displayed as a histogram.

This file is based on information in the Roche FLX manual. Among other sources for information about the format, are The Staden Package, which contains an io_lib with a C routine for parsing this format. According to comments in the sources, the io_lib implementation is based on a file called getsff.c, which I've been unable to track down. Other software parsing SFFs are QIIME, sff_extract, and Celera's sffToCa.

It is believed that all values are stored big endian.



data SFF Source

The data structure storing the contents of an SFF file (modulo the index)


SFF !CommonHeader [ReadBlock] 


data CommonHeader Source

SFF has a 31-byte common header

The format is open to having the index anywhere between reads, we should really keep count and check for each read. In practice, it seems to be places after the reads.

The following two fields are considered part of the header, but as they are static, they are not part of the data structure

     magic   :: Word32   -- 0x2e736666, i.e. the string ".sff"
     version :: Word32   -- 0x00000001



data ReadHeader Source

Each Read has a fixed read header, containing various information.

data ReadBlock Source

This contains the actual flowgram for a single read.


readSFF :: FilePath -> IO SFFSource

Read an SFF file.

writeSFF :: FilePath -> SFF -> IO ()Source

Write an SFF to the specified file name

writeSFF' :: FilePath -> SFF -> IO IntSource

Write an SFF to the specified file name, but go back and update the read count. Useful if you want to output a lazy stream of ReadBlocks. Returns the number of reads written.

recoverSFF :: FilePath -> IO SFFSource

Read an SFF file, but be resilient against errors.

sffToSequence :: SFF -> [Sequence Nuc]Source

Extract the sequences from an SFF data structure.

rbToSequence :: ReadBlock -> Sequence NucSource

Extract the sequence information from a ReadBlock.

trim :: ReadBlock -> ReadBlockSource

Trim a read according to clipping information

trimFromTo :: Integral i => i -> i -> ReadBlock -> ReadBlockSource

Trim a read to specific sequence position, inclusive bounds The current implementation has the unintended side effect of always trimming the flowgram down to a basecalled position. Note that you can't (easily) write trimmed ReadBlocks to a file, since they need to have the same number of flows as given in the CommmonHeader.

trimKey :: CommonHeader -> Sequence Nuc -> Maybe (Sequence Nuc)Source

Extract the read without the initial (TCAG) key.

baseToFlowPos :: Integral i => ReadBlock -> i -> IntSource

Convert a sequence position to the corresponding flow position

flowToBasePos :: Integral i => ReadBlock -> i -> IntSource

Convert a flow position to the corresponding sequence position

trimFlows :: Integral i => i -> ReadBlock -> ReadBlockSource

Trim a ReadBlock limiting the number of flows. If writing to an SFF file, make sure you update the CommonHeader accordingly. See examples/Flx.hs for how to use this.

test :: FilePath -> IO ()Source

test serialization by output'ing the header and first two reads in an SFF, and the same after a decode + encode cycle.

convert :: FilePath -> IO ()Source

Convert a file by decoding it and re-encoding it This will lose the index (which isn't really necessary)

flowgram :: ReadBlock -> [Flow]Source

Helper function to access the flowgram

masked_bases :: ReadBlock -> SeqDataSource

Extract the sequence with masked bases in lower case

cumulative_index :: ReadBlock -> [Int]Source

Extract the index as absolute coordinates, not relative.

packFlows :: [Flow] -> ByteStringSource

Pack a list of flows into the corresponding binary structure (the flow_data field)

unpackFlows :: ByteString -> [Flow]Source

Unpack the flow_data field into a list of flow values

type Flow = Int16Source

The type of flowgram value

type Qual = Word8Source

Basic type for quality data. Range 0..255. Typical Phred output is in the range 6..50, with 20 as the line in the sand separating good from bad.

type SeqData = ByteStringSource

The basic data type used in Sequences

type QualData = ByteStringSource

Quality data is a Qual vector, currently implemented as a ByteString.

data ReadName Source

Read names encode various information, as per this struct.




date :: (Int, Int, Int)
time :: (Int, Int, Int)
region :: Int
x_loc :: Int
y_loc :: Int