Safe Haskell | None |
---|
Bio.Sequence.SFF
Description
Read and write the SFF file format used by Roche/454 sequencing to store flowgram data.
A flowgram is a series of values (intensities) representing homopolymer runs of A,G,C, and T in a fixed cycle, and usually displayed as a histogram.
This file is based on information in the Roche FLX manual. Among other sources for information about the format, are The Staden Package, which contains an io_lib with a C routine for parsing this format. According to comments in the sources, the io_lib implementation is based on a file called getsff.c, which I've been unable to track down. Other software parsing SFFs are QIIME, sff_extract, and Celera's sffToCa.
It is believed that all values are stored big endian.
- data SFF = SFF !CommonHeader [ReadBlock]
- data CommonHeader = CommonHeader {
- index_offset :: Int64
- index_length :: Int32
- num_reads :: Int32
- key_length :: Int16
- flow_length :: Int16
- flowgram_fmt :: Word8
- flow :: ByteString
- key :: ByteString
- data ReadHeader = ReadHeader {}
- data ReadBlock = ReadBlock {
- read_header :: !ReadHeader
- flow_data :: !ByteString
- flow_index :: !ByteString
- bases :: !SeqData
- quality :: !QualData
- readSFF :: FilePath -> IO SFF
- writeSFF :: FilePath -> SFF -> IO ()
- writeSFF' :: FilePath -> SFF -> IO Int
- recoverSFF :: FilePath -> IO SFF
- trim :: ReadBlock -> ReadBlock
- trimFromTo :: Integral i => i -> i -> ReadBlock -> ReadBlock
- trimKey :: ReadBlock -> ReadBlock
- trimAdapter :: ReadBlock -> ReadBlock
- baseToFlowPos :: Integral i => ReadBlock -> i -> Int
- flowToBasePos :: Integral i => ReadBlock -> i -> Int
- trimFlows :: Integral i => i -> ReadBlock -> ReadBlock
- test :: FilePath -> IO ()
- convert :: FilePath -> IO ()
- flowgram :: ReadBlock -> [Flow]
- masked_bases :: ReadBlock -> SeqData
- cumulative_index :: ReadBlock -> [Int]
- packFlows :: [Flow] -> ByteString
- unpackFlows :: ByteString -> [Flow]
- type Flow = Int16
- data Qual
- type Index = Word8
- data SeqData
- data QualData
- data ReadName = ReadName {}
- decodeReadName :: ByteString -> Maybe ReadName
- encodeReadName :: ReadName -> ByteString
- putRB :: Int -> ReadBlock -> Put
- getRB :: Int -> ReadHeader -> Get ReadBlock
Documentation
The data structure storing the contents of an SFF file (modulo the index)
Constructors
SFF !CommonHeader [ReadBlock] |
data CommonHeader Source
SFF has a 31-byte common header
The format is open to having the index anywhere between reads, we should really keep count and check for each read. In practice, it seems to be places after the reads.
The following two fields are considered part of the header, but as they are static, they are not part of the data structure
magic :: Word32 -- 0x2e736666, i.e. the string ".sff" version :: Word32 -- 0x00000001
Constructors
CommonHeader | |
Fields
|
Instances
data ReadHeader Source
Each Read has a fixed read header, containing various information.
Constructors
ReadHeader | |
Fields
|
Instances
This contains the actual flowgram for a single read.
Constructors
ReadBlock | |
Fields
|
recoverSFF :: FilePath -> IO SFFSource
Read an SFF file, but be resilient against errors.
trimFromTo :: Integral i => i -> i -> ReadBlock -> ReadBlockSource
Trim a read to specific sequence position, inclusive bounds.
trimAdapter :: ReadBlock -> ReadBlockSource
Trim adapters from a read
baseToFlowPos :: Integral i => ReadBlock -> i -> IntSource
Convert a sequence position to the corresponding flow position
flowToBasePos :: Integral i => ReadBlock -> i -> IntSource
Convert a flow position to the corresponding sequence position
trimFlows :: Integral i => i -> ReadBlock -> ReadBlockSource
Trim a ReadBlock
limiting the number of flows. If writing to
an SFF file, make sure you update the CommonHeader
accordingly.
See examples/Flx.hs
for how to use this.
test :: FilePath -> IO ()Source
test serialization by output'ing the header and first two reads in an SFF, and the same after a decode + encode cycle.
convert :: FilePath -> IO ()Source
Convert a file by decoding it and re-encoding it This will lose the index (which isn't really necessary)
masked_bases :: ReadBlock -> SeqDataSource
Extract the sequence with masked bases in lower case
cumulative_index :: ReadBlock -> [Int]Source
Extract the index as absolute coordinates, not relative.
packFlows :: [Flow] -> ByteStringSource
Pack a list of flows into the corresponding binary structure (the flow_data field)
unpackFlows :: ByteString -> [Flow]Source
Unpack the flow_data field into a list of flow values
data Qual
A quality value is in the range 0..255.
data SeqData
Sequence data are lazy bytestrings of ASCII characters.
Read names encode various information, as per this struct.
Constructors
ReadName | |