Read (and write?) the SFF file format used by Roche/454 sequencing to store flowgram data.
A flowgram is a series of values (intensities) representing homopolymer runs of A,G,C, and T in a fixed cycle, and usually displayed as a histogram.
The Staden Package contains an io_lib, with a C routine for parsing this format. According to comments in the sources, the io_lib implementation is based on a file called getsff.c, which I've been unable to track down.
It is believed that all values are stored big endian.
- data SFF = SFF !CommonHeader [ReadBlock]
- data CommonHeader = CommonHeader {
- index_offset :: Int64
- index_length :: Int32
- num_reads :: Int32
- key_length :: Int16
- flow_length :: Int16
- flowgram_fmt :: Word8
- flow :: ByteString
- key :: ByteString
- data ReadHeader = ReadHeader {}
- data ReadBlock = ReadBlock {
- read_header :: !ReadHeader
- flow_data :: !ByteString
- flow_index :: !ByteString
- bases :: !SeqData
- quality :: !QualData
- readSFF :: FilePath -> IO SFF
- writeSFF :: FilePath -> SFF -> IO ()
- writeSFF' :: FilePath -> SFF -> IO Int
- recoverSFF :: FilePath -> IO SFF
- sffToSequence :: SFF -> [Sequence Nuc]
- rbToSequence :: ReadBlock -> Sequence Nuc
- trim :: ReadBlock -> ReadBlock
- trimFromTo :: Integral i => i -> i -> ReadBlock -> ReadBlock
- trimKey :: CommonHeader -> Sequence Nuc -> Maybe (Sequence Nuc)
- baseToFlowPos :: Integral i => ReadBlock -> i -> Int
- flowToBasePos :: Integral i => ReadBlock -> i -> Int
- test :: FilePath -> IO ()
- convert :: FilePath -> IO ()
- flowgram :: ReadBlock -> [Flow]
- masked_bases :: ReadBlock -> SeqData
- cumulative_index :: ReadBlock -> [Int]
- packFlows :: [Flow] -> ByteString
- unpackFlows :: ByteString -> [Flow]
- type Flow = Int16
- type Qual = Word8
- type Index = Word8
- type SeqData = ByteString
- type QualData = ByteString
- data ReadName = ReadName {}
- decodeReadName :: ByteString -> Maybe ReadName
- encodeReadName :: ReadName -> ByteString
Documentation
The data structure storing the contents of an SFF file (modulo the index)
data CommonHeader Source
SFF has a 31-byte common header Todo: remove items that are derivable (counters, magic, etc) cheader_lenght points to the first read header. Also, the format is open to having the index anywhere between reads, we should really keep count and check for each read. In practice, it seems to be places after the reads.
The following two fields are considered part of the header, but as they are static, they are not part of the data structure magic :: Word32 -- ^ 0x2e736666, i.e. the string .sff version :: Word32 -- ^ 0x00000001
CommonHeader | |
|
data ReadHeader Source
Each Read has a fixed read header
This contains the actual flowgram for a single read.
ReadBlock | |
|
recoverSFF :: FilePath -> IO SFFSource
sffToSequence :: SFF -> [Sequence Nuc]Source
trimFromTo :: Integral i => i -> i -> ReadBlock -> ReadBlockSource
Trim a read to specific sequence position. The current implementation has the unintended side effect of always trimming the flowgram down to a basecalled position.
baseToFlowPos :: Integral i => ReadBlock -> i -> IntSource
Convert a sequence position to the corresponding flow position
flowToBasePos :: Integral i => ReadBlock -> i -> IntSource
Convert a flow position to the corresponding sequence position
test :: FilePath -> IO ()Source
test serialization by output'ing the header and first two reads in an SFF, and the same after a decode + encode cycle.
convert :: FilePath -> IO ()Source
Convert a file by decoding it and re-encoding it This will lose the index (which isn't really necessary)
cumulative_index :: ReadBlock -> [Int]Source
packFlows :: [Flow] -> ByteStringSource
Pack a list of flows into the corresponding binary structure (the flow_data field)
unpackFlows :: ByteString -> [Flow]Source
Unpack the flow_data field into a list of flow values
Basic type for quality data. Range 0..255. Typical Phred output is in the range 6..50, with 20 as the line in the sand separating good from bad.
type SeqData = ByteStringSource
The basic data type used in Sequence
s
type QualData = ByteStringSource
Quality data is a Qual
vector, currently implemented as a ByteString
.
Read names encode various information, as per this struct.