Bio.Sequence

Contents

Description

This is a meta-module importing and re-exporting sequence-related stuff.

It encompasses the Bio.Sequence.SeqData, Bio.Sequence.Fasta, and Bio.Sequence.TwoBit modules.

Synopsis

Data structures etc (Bio.Sequence.SeqData)

A sequence consists of a header, the sequence data itself, and optional quality data. The type parameter is a phantom type to separate nucleotide and amino acid sequences

Constructors

Seq !SeqData !SeqData !(Maybe QualData)

header and actual sequence

Instances

Eq (Sequence t)
Show (Sequence a)	A more arranged show instance for Sequences reassembling the display of the fasta-format

data Unknown Source

type Offset = Int64 Source

An offset, index, or length of a SeqData

type SeqData = ByteString Source

The basic data type used in Sequences

type Qual = Word8 Source

Basic type for quality data. Range 0..255. Typical Phred output is in the range 6..50, with 20 as the line in the sand separating good from bad.

type QualData = ByteString Source

Quality data is a Qual vector, currently implemented as a ByteString.

Accessor functions

seqlength :: Sequence a -> Offset Source

Return sequence length.

seqlabel :: Sequence a -> SeqData Source

Return sequence label (first word of header)

seqheader :: Sequence a -> SeqData Source

Return full header.

seqdata :: Sequence a -> SeqData Source

Return the sequence data.

seqqual :: Sequence a -> QualData Source

Return the quality data, or error if none exist. Use hasqual if in doubt.

(!) :: Sequence a -> Offset -> Char Source

Read the character at the specified position in the sequence.

appendHeader :: Sequence a -> String -> Sequence aSource

setHeader :: Sequence a -> String -> Sequence aSource

Modify the header by appending text, or by replacing all but the sequence label (i.e. first word).

Converting to and from String.

fromStr :: String -> SeqData Source

Convert a String to SeqData

toStr :: SeqData -> String Source

Convert a SeqData to a String

Nucleotide functionality.

compl :: Char -> Char Source

Complement a single character. I.e. identify the nucleotide it can hybridize with. Note that for multiple nucleotides, you usually want the reverse complement (see revcompl for that).

revcompl :: Sequence Nuc -> Sequence Nuc Source

Calculate the reverse complement. This is only relevant for the nucleotide alphabet, and it leaves other characters unmodified.

revcompl' :: SeqData -> SeqData Source

Calculate the reverse complent for SeqData only.

data Nuc Source

For type tagging sequences (protein sequences use Amino below)

castToNuc :: Sequence a -> Sequence Nuc Source

Protein sequence functionality

data Amino Source

Constructors

Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Tyr
Trp
Val
STP
Asx
Glx
Xle
Xaa

Instances

Eq Amino
Show Amino

translate :: Sequence Nuc -> Offset -> [Amino]Source

Translate a nucleotide sequence into the corresponding protein sequence. This works rather blindly, with no attempt to identify ORFs or otherwise QA the result.

fromIUPAC :: SeqData -> [Amino]Source

Convert a sequence in IUPAC format to a list of amino acids.

toIUPAC :: [Amino] -> SeqData Source

Convert a list of amino acids to a sequence in IUPAC format.

castToAmino :: Sequence a -> Sequence Amino Source

Other utility functions

defragSeq :: Sequence t -> Sequence tSource

Returns a sequence with all internal storage freshly copied and with sequence and quality data present as a single chunk.

By freshly copying internal storage, defragSeq allows garbage collection of the original data source whence the sequence was read; otherwise, use of just a short sequence name can cause an entire sequence file buffer to be retained.

By compacting sequence data into a single chunk, defragSeq avoids linear-time traversal of sequence chunks during random access into sequence data.

seqmap :: ((Char, Qual) -> (Char, Qual)) -> Sequence t -> Sequence tSource

map over sequences, treating them as a sequence of (char,word8) pairs. This will work on sequences without quality, as long as the function doesn't try to examine it. The current implementation is not very efficient.

File IO

Generic sequence reading

readNuc :: FilePath -> IO [Sequence Nuc]Source

Read nucleotide sequences in any format - Fasta, SFF, FastQ, 2bit, PHD... Todo: detect Illumina vs Sanger FastQ, transparent compression

readProt :: FilePath -> IO [Sequence Amino]Source

Read protein sequences in any supported format (i.e. Fasta)

The Fasta file format (Bio.Sequence.Fasta)

readFasta :: FilePath -> IO [Sequence Unknown]Source

Lazily read sequences from a FASTA-formatted file

hReadFasta :: Handle -> IO [Sequence Unknown]Source

Lazily read sequence from handle

writeFasta :: FilePath -> [Sequence a] -> IO ()Source

Write sequences to a FASTA-formatted file. Line length is 60.

hWriteFasta :: Handle -> [Sequence a] -> IO ()Source

Write sequences in FASTA format to a handle.

Quality data

Not part of the Fasta format, and treated separately.

readQual :: FilePath -> IO [Sequence Unknown]Source

Read quality data for sequences to a file.

writeQual :: FilePath -> [Sequence a] -> IO ()Source

Write quality data for sequences to a file.

hWriteQual :: Handle -> [Sequence a] -> IO ()Source

readFastaQual :: FilePath -> FilePath -> IO [Sequence Unknown]Source

Read sequence and associated quality. Will error if the sequences and qualites do not match one-to-one in sequence.

writeFastaQual :: FilePath -> FilePath -> [Sequence a] -> IO ()Source

Write sequence and quality data simulatnously This may be more laziness-friendly.

hWriteFastaQual :: Handle -> Handle -> [Sequence a] -> IO ()Source

The FastQ format (Bio.Sequence.FastQ)

readFastQ :: FilePath -> IO [Sequence Nuc]Source

writeFastQ :: FilePath -> [Sequence Nuc] -> IO ()Source

hReadFastQ :: Handle -> IO [Sequence Nuc]Source

hWriteFastQ :: Handle -> [Sequence Nuc] -> IO ()Source

readSangerQ :: FilePath -> IO [Sequence Nuc]Source

writeSangerQ :: FilePath -> [Sequence Nuc] -> IO ()Source

hReadSangerQ :: Handle -> IO [Sequence Nuc]Source

hWriteSangerQ :: Handle -> [Sequence Nuc] -> IO ()Source

readIllumina :: FilePath -> IO [Sequence Nuc]Source

writeIllumina :: FilePath -> [Sequence Nuc] -> IO ()Source

hReadIllumina :: Handle -> IO [Sequence Nuc]Source

hWriteIllumina :: Handle -> [Sequence Nuc] -> IO ()Source

The phd file format (Bio.Sequence.Phd)

These contain base (nucleotide) calling information, and are generated by phred.

readPhd :: FilePath -> IO (Sequence Nuc)Source

Parse a .phd file, extracting the contents as a Sequence

hReadPhd :: Handle -> IO (Sequence Nuc)Source

Parse .phd contents from a handle

TwoBit file format support (Bio.Seqeunce.TwoBit)

Used by BLAT and related tools.

decode2Bit :: ByteString -> [Sequence Nuc]Source

Parse a (lazy) ByteString as sequences in the 2bit format.

read2Bit :: FilePath -> IO [Sequence Nuc]Source

Read sequences from a file in 2bit format and | unmarshall/deserialize into Sequence format.

hRead2Bit :: Handle -> IO [Sequence Nuc]Source

Read sequences from a file handle in the 2bit format and | unmarshall/deserialze into Sequence format.

Hashing functionality (Bio.Sequence.HashWord)

Packing words from sequences into integral data types

data HashF k Source

This is a struct for containing a set of hashing functions

Constructors

HF
Fields hash :: SeqData -> Offset -> Maybe k calculates the hash at a given offset in the sequence hashes :: SeqData -> [(k, Offset)] calculate all hashes from a sequence, and their indices ksort :: [k] -> [k] for sorting hashes

contigous :: Integral k => Int -> HashF kSource

Contigous constructs an int/eger from a contigous k-word.

rcontig :: Integral k => Int -> HashF kSource

Like contigous, but returns the same hash for a word and its reverse complement.

rcpacked :: Integral k => Int -> HashF kSource

Like rcontig, but ignoring monomers (i.e. arbitrarily long runs of a single nucelotide are treated the same a single nucleotide.

Entropy calculations

class KWords s whereSource

Methods

kwords :: Int -> s -> [s]Source

Instances

KWords [a]

entropy :: (Ord str, KWords str) => Int -> str -> Double Source

Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Tyr
Trp
Val
STP
Asx
Glx
Xle
Xaa

Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Tyr
Trp
Val
STP
Asx
Glx
Xle
Xaa

Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Tyr
Trp
Val
STP
Asx
Glx
Xle
Xaa