bio-0.3.5: A bioinformatics librarySource codeContentsIndex
Bio.Sequence.SFF
Description

Read (and write?) the SFF file format used by Roche/454 sequencing to store flowgram data.

A flowgram is a series of values (intensities) representing homopolymer runs of A,G,C, and T in a fixed cycle, and usually displayed as a histogram.

The Staden Package contains an io_lib, with a C routine for parsing this format. According to comments in the sources, the io_lib implementation is based on a file called getsff.c, which I've been unable to track down.

It is believed that all values are stored big endian.

Synopsis
data SFF = SFF !CommonHeader [ReadBlock]
data CommonHeader = CommonHeader {
index_offset :: Int64
index_length :: Int32
num_reads :: Int32
key_length :: Int16
flow_length :: Int16
flowgram_fmt :: Word8
flow :: ByteString
key :: ByteString
}
data ReadHeader = ReadHeader {
name_length :: Int16
num_bases :: Int32
clip_qual_left :: Int16
clip_qual_right :: Int16
clip_adapter_left :: Int16
clip_adapter_right :: Int16
read_name :: ByteString
}
data ReadBlock = ReadBlock {
read_header :: ReadHeader
flowgram :: [Flow]
flow_index :: ByteString
bases :: ByteString
quality :: ByteString
}
readSFF :: FilePath -> IO SFF
writeSFF :: FilePath -> SFF -> IO ()
sffToSequence :: SFF -> [Sequence]
test :: FilePath -> IO ()
convert :: FilePath -> IO ()
type Flow = Int16
type Qual = Word8
type Index = Word8
Documentation
data SFF Source
The data structure storing the contents of an SFF file (modulo the index)
Constructors
SFF !CommonHeader [ReadBlock]
show/hide Instances
data CommonHeader Source

SFF has a 31-byte common header Todo: remove items that are derivable (counters, magic, etc) cheader_lenght points to the first read header. Also, the format is open to having the index anywhere between reads, we should really keep count and check for each read. In practice, it seems to be places after the reads.

The following two fields are considered part of the header, but as they are static, they are not part of the data structure magic :: Word32 -- ^ 0x2e736666, i.e. the string .sff version :: Word32 -- ^ 0x00000001

Constructors
CommonHeader
index_offset :: Int64Points to a text(?) section
index_length :: Int32
num_reads :: Int32
key_length :: Int16
flow_length :: Int16
flowgram_fmt :: Word8
flow :: ByteString
key :: ByteString
show/hide Instances
data ReadHeader Source
Each Read has a fixed read header
Constructors
ReadHeader
name_length :: Int16
num_bases :: Int32
clip_qual_left :: Int16
clip_qual_right :: Int16
clip_adapter_left :: Int16
clip_adapter_right :: Int16
read_name :: ByteString
show/hide Instances
data ReadBlock Source
This contains the actual flowgram for a single read.
Constructors
ReadBlock
read_header :: ReadHeader
flowgram :: [Flow]
flow_index :: ByteString
bases :: ByteString
quality :: ByteString
show/hide Instances
readSFF :: FilePath -> IO SFFSource
writeSFF :: FilePath -> SFF -> IO ()Source
sffToSequence :: SFF -> [Sequence]Source
test :: FilePath -> IO ()Source
test serialization by output'ing the header and first two reads in an SFF, and the same after a decode + encode cycle.
convert :: FilePath -> IO ()Source
Convert a file by decoding it and re-encoding it This will lose the index (which isn't really necessary)
type Flow = Int16Source
The type of flowgram value
type Qual = Word8Source
Basic type for quality data. Range 0..255. Typical Phred output is in the range 6..50, with 20 as the line in the sand separating good from bad.
type Index = Word8Source
Produced by Haddock version 2.4.2