biosff-0.3.7.1: Library and executables for working with SFF files

Safe HaskellNone

Bio.Sequence.SFF

Description

Read and write the SFF file format used by Roche/454 sequencing to store flowgram data.

A flowgram is a series of values (intensities) representing homopolymer runs of A,G,C, and T in a fixed cycle, and usually displayed as a histogram.

This file is based on information in the Roche FLX manual. Among other sources for information about the format, are The Staden Package, which contains an io_lib with a C routine for parsing this format. According to comments in the sources, the io_lib implementation is based on a file called getsff.c, which I've been unable to track down. Other software parsing SFFs are QIIME, sff_extract, and Celera's sffToCa.

It is believed that all values are stored big endian.

Synopsis

Documentation

data SFF Source

The data structure storing the contents of an SFF file (modulo the index)

Constructors

SFF !CommonHeader [ReadBlock] 

Instances

data CommonHeader Source

SFF has a 31-byte common header

The format is open to having the index anywhere between reads, we should really keep count and check for each read. In practice, it seems to be places after the reads.

The following two fields are considered part of the header, but as they are static, they are not part of the data structure

        
     magic   :: Word32   -- 0x2e736666, i.e. the string ".sff"
     version :: Word32   -- 0x00000001

Constructors

CommonHeader 

data ReadHeader Source

Each Read has a fixed read header, containing various information.

data ReadBlock Source

This contains the actual flowgram for a single read.

readSFF :: FilePath -> IO SFFSource

Read an SFF file.

writeSFF :: FilePath -> SFF -> IO ()Source

Write an SFF to the specified file name

writeSFF' :: FilePath -> SFF -> IO IntSource

Write an SFF to the specified file name, but go back and update the read count. Useful if you want to output a lazy stream of ReadBlocks. Returns the number of reads written.

recoverSFF :: FilePath -> IO SFFSource

Read an SFF file, but be resilient against errors.

trim :: ReadBlock -> ReadBlockSource

Trim a read according to clipping information

trimFromTo :: Integral i => i -> i -> ReadBlock -> ReadBlockSource

Trim a read to specific sequence position, inclusive bounds.

trimKey :: ReadBlock -> ReadBlockSource

Trim the key (i.e. first four bases)

trimAdapter :: ReadBlock -> ReadBlockSource

Trim adapters from a read

baseToFlowPos :: Integral i => ReadBlock -> i -> IntSource

Convert a sequence position to the corresponding flow position

flowToBasePos :: Integral i => ReadBlock -> i -> IntSource

Convert a flow position to the corresponding sequence position

trimFlows :: Integral i => i -> ReadBlock -> ReadBlockSource

Trim a ReadBlock limiting the number of flows. If writing to an SFF file, make sure you update the CommonHeader accordingly. See examples/Flx.hs for how to use this.

test :: FilePath -> IO ()Source

test serialization by output'ing the header and first two reads in an SFF, and the same after a decode + encode cycle.

convert :: FilePath -> IO ()Source

Convert a file by decoding it and re-encoding it This will lose the index (which isn't really necessary)

flowgram :: ReadBlock -> [Flow]Source

Helper function to access the flowgram

masked_bases :: ReadBlock -> SeqDataSource

Extract the sequence with masked bases in lower case

cumulative_index :: ReadBlock -> [Int]Source

Extract the index as absolute coordinates, not relative.

packFlows :: [Flow] -> ByteStringSource

Pack a list of flows into the corresponding binary structure (the flow_data field)

unpackFlows :: ByteString -> [Flow]Source

Unpack the flow_data field into a list of flow values

type Flow = Int16Source

The type of flowgram value

data Qual

A quality value is in the range 0..255.

data SeqData

Sequence data are lazy bytestrings of ASCII characters.

data QualData

Quality data are lazy bytestrings of Quals.

data ReadName Source

Read names encode various information, as per this struct.

Constructors

ReadName 

Fields

date :: (Int, Int, Int)
 
time :: (Int, Int, Int)
 
region :: Int
 
x_loc :: Int
 
y_loc :: Int
 

Instances

putRB :: Int -> ReadBlock -> PutSource

A ReadBlock can't be an instance of Binary directly, since it depends on information from the CommonHeader.

getRB :: Int -> ReadHeader -> Get ReadBlockSource

Helper function for decoding a ReadBlock.