Bio.Sequence.SFF

Description

Read (and write?) the SFF file format used by Roche/454 sequencing to store flowgram data.

A flowgram is a series of values (intensities) representing homopolymer runs of A,G,C, and T in a fixed cycle, and usually displayed as a histogram.

The Staden Package contains an io_lib, with a C routine for parsing this format. According to comments in the sources, the io_lib implementation is based on a file called getsff.c, which I've been unable to track down.

It is believed that all values are stored big endian.

Synopsis

Documentation

data SFF Source

The data structure storing the contents of an SFF file (modulo the index)

Constructors

SFF !CommonHeader [ReadBlock]

Instances

Show SFF
Binary SFF

data CommonHeader Source

SFF has a 31-byte common header Todo: remove items that are derivable (counters, magic, etc) cheader_lenght points to the first read header. Also, the format is open to having the index anywhere between reads, we should really keep count and check for each read. In practice, it seems to be places after the reads.

The following two fields are considered part of the header, but as they are static, they are not part of the data structure magic :: Word32 -- ^ 0x2e736666, i.e. the string .sff version :: Word32 -- ^ 0x00000001

Constructors

CommonHeader
Fields index_offset :: Int64 Points to a text(?) section index_length :: Int32 num_reads :: Int32 key_length :: Int16 flow_length :: Int16 flowgram_fmt :: Word8 flow :: ByteString key :: ByteString

Instances

Show CommonHeader
Binary CommonHeader

data ReadHeader Source

Each Read has a fixed read header

Constructors

ReadHeader
Fields name_length :: Int16 num_bases :: Int32 clip_qual_left :: Int16 clip_qual_right :: Int16 clip_adapter_left :: Int16 clip_adapter_right :: Int16 read_name :: ByteString

Instances

Show ReadHeader
Binary ReadHeader

data ReadBlock Source

This contains the actual flowgram for a single read.

Constructors

ReadBlock
Fields read_header :: !ReadHeader flow_data :: !ByteString flow_index :: !ByteString bases :: !SeqData quality :: !QualData

Instances

Show ReadBlock

readSFF :: FilePath -> IO SFF Source

writeSFF :: FilePath -> SFF -> IO ()Source

Write an SFF to the specified file name

writeSFF' :: FilePath -> SFF -> IO Int Source

Write an SFF to the specified file name, but go back and update the read count. Useful if you want to output a lazy stream of ReadBlocks. Returns the number of reads written.

recoverSFF :: FilePath -> IO SFF Source

sffToSequence :: SFF -> [Sequence Nuc]Source

rbToSequence :: ReadBlock -> Sequence Nuc Source

trim :: ReadBlock -> ReadBlock Source

Trim a read according to clipping information

trimFromTo :: Integral i => i -> i -> ReadBlock -> ReadBlock Source

Trim a read to specific sequence position. The current implementation has the unintended side effect of always trimming the flowgram down to a basecalled position.

trimKey :: CommonHeader -> Sequence Nuc -> Maybe (Sequence Nuc)Source

baseToFlowPos :: Integral i => ReadBlock -> i -> Int Source

Convert a sequence position to the corresponding flow position

flowToBasePos :: Integral i => ReadBlock -> i -> Int Source

Convert a flow position to the corresponding sequence position

test :: FilePath -> IO ()Source

test serialization by output'ing the header and first two reads in an SFF, and the same after a decode + encode cycle.

convert :: FilePath -> IO ()Source

Convert a file by decoding it and re-encoding it This will lose the index (which isn't really necessary)

flowgram :: ReadBlock -> [Flow]Source

masked_bases :: ReadBlock -> SeqData Source

cumulative_index :: ReadBlock -> [Int]Source

packFlows :: [Flow] -> ByteString Source

Pack a list of flows into the corresponding binary structure (the flow_data field)

unpackFlows :: ByteString -> [Flow]Source

Unpack the flow_data field into a list of flow values

type Flow = Int16 Source

The type of flowgram value

type Qual = Word8 Source

Basic type for quality data. Range 0..255. Typical Phred output is in the range 6..50, with 20 as the line in the sand separating good from bad.

type Index = Word8 Source

type SeqData = ByteString Source

The basic data type used in Sequences

type QualData = ByteString Source

Quality data is a Qual vector, currently implemented as a ByteString.

data ReadName Source

Read names encode various information, as per this struct.

Constructors

ReadName
Fields date :: (Int, Int, Int) time :: (Int, Int, Int) region :: Int x_loc :: Int y_loc :: Int

Instances

Show ReadName

decodeReadName :: ByteString -> Maybe ReadName Source

encodeReadName :: ReadName -> ByteString Source