biocore-0.2: A bioinformatics library

Bio.Core.Sequence

Contents

Description

This module defines common data structures for biosequences, i.e. data that represents nucleotide or protein sequences.

Basically, anything resembling or wrapping a sequence should implement the BioSeq class (and BioSeqQual if quality information is available).

The data types are mostly wrappers from lazy bytestrings from Data.ByteString.Lazy and Data.ByteString.Lazy.Char8.

Synopsis

Data definitions

newtype Qual Source

A quality value is in the range 0..255.

Constructors

Qual 

Fields

unQual :: Word8
 

Instances

newtype Offset Source

An Offset is a zero-based index into a sequence

Constructors

Offset 

Fields

unOff :: Int64
 

newtype SeqData Source

Sequence data are lazy bytestrings of ASCII characters.

Constructors

SeqData 

Fields

unSD :: ByteString
 

newtype SeqLabel Source

Sequence data are lazy bytestrings of ASCII characters.

Constructors

SeqLabel 

Fields

unSL :: ByteString
 

newtype QualData Source

Quality data are lazy bytestrings of Quals.

Constructors

QualData 

Fields

unQD :: ByteString
 

Class definitions

class BioSeq s whereSource

The BioSeq class models sequence data, and any data object that represents a biological sequence should implement it.

class BioSeq sq => BioSeqQual sq whereSource

The BioSeqQual class extends BioSeq with quality data. Any correspondig data object should be an instance, this will allow Fasta formatted quality data toFastaQual, as well as the combined FastQ format (via toFastQ).

Methods

seqqual :: sq -> QualDataSource

Helper functions

toFasta :: BioSeq s => s -> ByteStringSource

Any BioSeq can be formatted as Fasta, 60-char lines.

toFastaQual :: BioSeqQual s => s -> ByteStringSource

Output Fasta-formatted quality data (.qual files), where quality values are output as whitespace-separated integers.

toFastQ :: BioSeqQual s => s -> ByteStringSource

Output FastQ-formatted data. For simplicity, only the Sanger quality format is supported, and only four lines per sequence (i.e. no line breaks in sequence or quality data).