This module defines common data structures for biosequences, i.e. data that represents nucleotide or protein sequences.
Basically, anything resembling or wrapping a sequence should
implement the BioSeq
class (and BioSeqQual
if quality information
is available).
The data types are mostly wrappers from lazy bytestrings from
Data.ByteString.Lazy
and Data.ByteString.Lazy.Char8
.
- newtype Qual = Qual {}
- newtype Offset = Offset {}
- newtype SeqData = SeqData {
- unSD :: ByteString
- newtype SeqLabel = SeqLabel {
- unSL :: ByteString
- newtype QualData = QualData {
- unQD :: ByteString
- class BioSeq s where
- class BioSeq sq => BioSeqQual sq where
- toFasta :: BioSeq s => s -> ByteString
- toFastaQual :: BioSeqQual s => s -> ByteString
- toFastQ :: BioSeqQual s => s -> ByteString
Data definitions
A quality value is in the range 0..255.
An Offset
is a zero-based index into a sequence
Sequence data are lazy bytestrings of ASCII characters.
Sequence data are lazy bytestrings of ASCII characters.
Quality data are lazy bytestrings of Qual
s.
Class definitions
The BioSeq
class models sequence data, and any data object that
represents a biological sequence should implement it.
class BioSeq sq => BioSeqQual sq whereSource
The BioSeqQual class extends BioSeq with quality data. Any correspondig data object
should be an instance, this will allow Fasta formatted quality data toFastaQual
, as
well as the combined FastQ format (via toFastQ
).
Helper functions
toFasta :: BioSeq s => s -> ByteStringSource
Any BioSeq
can be formatted as Fasta, 60-char lines.
toFastaQual :: BioSeqQual s => s -> ByteStringSource
Output Fasta-formatted quality data (.qual files), where quality values are output as whitespace-separated integers.
toFastQ :: BioSeqQual s => s -> ByteStringSource
Output FastQ-formatted data. For simplicity, only the Sanger quality format is supported, and only four lines per sequence (i.e. no line breaks in sequence or quality data).