bio-0.4.5: A bioinformatics librarySource codeContentsIndex
Bio.Sequence
Contents
Data structures etc (Bio.Sequence.SeqData)
Accessor functions
Converting to and from String.
Nucleotide functionality.
Protein sequence functionality
Other utility functions
File IO
Generic sequence reading
The Fasta file format (Bio.Sequence.Fasta)
Quality data
The FastQ format (Bio.Sequence.FastQ)
The phd file format (Bio.Sequence.Phd)
TwoBit file format support (Bio.Seqeunce.TwoBit)
Hashing functionality (Bio.Sequence.HashWord)
Entropy calculations
Description

This is a meta-module importing and re-exporting sequence-related stuff.

It encompasses the Bio.Sequence.SeqData, Bio.Sequence.Fasta, and Bio.Sequence.TwoBit modules.

Synopsis
data Sequence t = Seq !SeqData !SeqData !(Maybe QualData)
data Unknown
type Offset = Int64
type SeqData = ByteString
type Qual = Word8
type QualData = ByteString
seqlength :: Sequence a -> Offset
seqlabel :: Sequence a -> SeqData
seqheader :: Sequence a -> SeqData
seqdata :: Sequence a -> SeqData
seqqual :: Sequence a -> QualData
(!) :: Sequence a -> Offset -> Char
appendHeader :: Sequence a -> String -> Sequence a
setHeader :: Sequence a -> String -> Sequence a
fromStr :: String -> SeqData
toStr :: SeqData -> String
compl :: Char -> Char
revcompl :: Sequence Nuc -> Sequence Nuc
revcompl' :: SeqData -> SeqData
data Nuc
castToNuc :: Sequence a -> Sequence Nuc
data Amino
= Ala
| Arg
| Asn
| Asp
| Cys
| Gln
| Glu
| Gly
| His
| Ile
| Leu
| Lys
| Met
| Phe
| Pro
| Ser
| Thr
| Tyr
| Trp
| Val
| STP
| Asx
| Glx
| Xle
| Xaa
translate :: Sequence Nuc -> Offset -> [Amino]
fromIUPAC :: SeqData -> [Amino]
toIUPAC :: [Amino] -> SeqData
castToAmino :: Sequence a -> Sequence Amino
defragSeq :: Sequence t -> Sequence t
seqmap :: ((Char, Qual) -> (Char, Qual)) -> Sequence t -> Sequence t
readNuc :: FilePath -> IO [Sequence Nuc]
readProt :: FilePath -> IO [Sequence Amino]
readFasta :: FilePath -> IO [Sequence Unknown]
hReadFasta :: Handle -> IO [Sequence Unknown]
writeFasta :: FilePath -> [Sequence a] -> IO ()
hWriteFasta :: Handle -> [Sequence a] -> IO ()
readQual :: FilePath -> IO [Sequence Unknown]
writeQual :: FilePath -> [Sequence a] -> IO ()
hWriteQual :: Handle -> [Sequence a] -> IO ()
readFastaQual :: FilePath -> FilePath -> IO [Sequence Unknown]
writeFastaQual :: FilePath -> FilePath -> [Sequence a] -> IO ()
hWriteFastaQual :: Handle -> Handle -> [Sequence a] -> IO ()
readFastQ :: FilePath -> IO [Sequence Nuc]
writeFastQ :: FilePath -> [Sequence a] -> IO ()
hReadFastQ :: Handle -> IO [Sequence Nuc]
hWriteFastQ :: Handle -> [Sequence a] -> IO ()
readPhd :: FilePath -> IO (Sequence Nuc)
hReadPhd :: Handle -> IO (Sequence Nuc)
decode2Bit :: ByteString -> [Sequence Unknown]
read2Bit :: FilePath -> IO [Sequence Unknown]
hRead2Bit :: Handle -> IO [Sequence Unknown]
data HashF k = HF {
hash :: SeqData -> Offset -> Maybe k
hashes :: SeqData -> [(k, Offset)]
ksort :: [k] -> [k]
}
contigous :: Integral k => Int -> HashF k
rcontig :: Integral k => Int -> HashF k
rcpacked :: Integral k => Int -> HashF k
class KWords s where
kwords :: Int -> s -> [s]
entropy :: (Ord str, KWords str) => Int -> str -> Double
Data structures etc (Bio.Sequence.SeqData)
data Sequence t Source
A sequence consists of a header, the sequence data itself, and optional quality data. The type parameter is a phantom type to separate nucleotide and amino acid sequences
Constructors
Seq !SeqData !SeqData !(Maybe QualData)header and actual sequence
show/hide Instances
data Unknown Source
type Offset = Int64Source
An offset, index, or length of a SeqData
type SeqData = ByteStringSource
The basic data type used in Sequences
type Qual = Word8Source
Basic type for quality data. Range 0..255. Typical Phred output is in the range 6..50, with 20 as the line in the sand separating good from bad.
type QualData = ByteStringSource
Quality data is a Qual vector, currently implemented as a ByteString.
Accessor functions
seqlength :: Sequence a -> OffsetSource
Return sequence length.
seqlabel :: Sequence a -> SeqDataSource
Return sequence label (first word of header)
seqheader :: Sequence a -> SeqDataSource
Return full header.
seqdata :: Sequence a -> SeqDataSource
Return the sequence data.
seqqual :: Sequence a -> QualDataSource
Return the quality data, or error if none exist. Use hasqual if in doubt.
(!) :: Sequence a -> Offset -> CharSource
Read the character at the specified position in the sequence.
appendHeader :: Sequence a -> String -> Sequence aSource
setHeader :: Sequence a -> String -> Sequence aSource
Modify the header by appending text, or by replacing all but the sequence label (i.e. first word).
Converting to and from String.
fromStr :: String -> SeqDataSource
Convert a String to SeqData
toStr :: SeqData -> StringSource
Convert a SeqData to a String
Nucleotide functionality.
compl :: Char -> CharSource
Complement a single character. I.e. identify the nucleotide it can hybridize with. Note that for multiple nucleotides, you usually want the reverse complement (see revcompl for that).
revcompl :: Sequence Nuc -> Sequence NucSource
Calculate the reverse complement. This is only relevant for the nucleotide alphabet, and it leaves other characters unmodified.
revcompl' :: SeqData -> SeqDataSource
Calculate the reverse complent for SeqData only.
data Nuc Source
For type tagging sequences (protein sequences use Amino below)
castToNuc :: Sequence a -> Sequence NucSource
Protein sequence functionality
data Amino Source
Constructors
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Tyr
Trp
Val
STP
Asx
Glx
Xle
Xaa
show/hide Instances
translate :: Sequence Nuc -> Offset -> [Amino]Source
Translate a nucleotide sequence into the corresponding protein sequence. This works rather blindly, with no attempt to identify ORFs or otherwise QA the result.
fromIUPAC :: SeqData -> [Amino]Source
Convert a sequence in IUPAC format to a list of amino acids.
toIUPAC :: [Amino] -> SeqDataSource
Convert a list of amino acids to a sequence in IUPAC format.
castToAmino :: Sequence a -> Sequence AminoSource
Other utility functions
defragSeq :: Sequence t -> Sequence tSource

Returns a sequence with all internal storage freshly copied and with sequence and quality data present as a single chunk.

By freshly copying internal storage, defragSeq allows garbage collection of the original data source whence the sequence was read; otherwise, use of just a short sequence name can cause an entire sequence file buffer to be retained.

By compacting sequence data into a single chunk, defragSeq avoids linear-time traversal of sequence chunks during random access into sequence data.

seqmap :: ((Char, Qual) -> (Char, Qual)) -> Sequence t -> Sequence tSource
map over sequences, treating them as a sequence of (char,word8) pairs. This will work on sequences without quality, as long as the function doesn't try to examine it. The current implementation is not very efficient.
File IO
Generic sequence reading
readNuc :: FilePath -> IO [Sequence Nuc]Source
Read nucleotide sequences in any format - Fasta, SFF, FastQ, 2bit, PHD...
readProt :: FilePath -> IO [Sequence Amino]Source
Read protein sequences in any supported format (i.e. Fasta)
The Fasta file format (Bio.Sequence.Fasta)
readFasta :: FilePath -> IO [Sequence Unknown]Source
Lazily read sequences from a FASTA-formatted file
hReadFasta :: Handle -> IO [Sequence Unknown]Source
Lazily read sequence from handle
writeFasta :: FilePath -> [Sequence a] -> IO ()Source
Write sequences to a FASTA-formatted file. Line length is 60.
hWriteFasta :: Handle -> [Sequence a] -> IO ()Source
Write sequences in FASTA format to a handle.
Quality data
Not part of the Fasta format, and treated separately.
readQual :: FilePath -> IO [Sequence Unknown]Source
Read quality data for sequences to a file.
writeQual :: FilePath -> [Sequence a] -> IO ()Source
Write quality data for sequences to a file.
hWriteQual :: Handle -> [Sequence a] -> IO ()Source
readFastaQual :: FilePath -> FilePath -> IO [Sequence Unknown]Source
Read sequence and associated quality. Will error if the sequences and qualites do not match one-to-one in sequence.
writeFastaQual :: FilePath -> FilePath -> [Sequence a] -> IO ()Source
Write sequence and quality data simulatnously This may be more laziness-friendly.
hWriteFastaQual :: Handle -> Handle -> [Sequence a] -> IO ()Source
The FastQ format (Bio.Sequence.FastQ)
readFastQ :: FilePath -> IO [Sequence Nuc]Source
writeFastQ :: FilePath -> [Sequence a] -> IO ()Source
hReadFastQ :: Handle -> IO [Sequence Nuc]Source
hWriteFastQ :: Handle -> [Sequence a] -> IO ()Source
The phd file format (Bio.Sequence.Phd)
These contain base (nucleotide) calling information, and are generated by phred.
readPhd :: FilePath -> IO (Sequence Nuc)Source
Parse a .phd file, extracting the contents as a Sequence
hReadPhd :: Handle -> IO (Sequence Nuc)Source
Parse .phd contents from a handle
TwoBit file format support (Bio.Seqeunce.TwoBit)
Used by BLAT and related tools.
decode2Bit :: ByteString -> [Sequence Unknown]Source
Parse a (lazy) ByteString as sequences in the 2bit format.
read2Bit :: FilePath -> IO [Sequence Unknown]Source
Read sequences from a file in 2bit format and | unmarshall/deserialize into Sequence format.
hRead2Bit :: Handle -> IO [Sequence Unknown]Source
Read sequences from a file handle in the 2bit format and | unmarshall/deserialze into Sequence format.
Hashing functionality (Bio.Sequence.HashWord)
Packing words from sequences into integral data types
data HashF k Source
This is a struct for containing a set of hashing functions
Constructors
HF
hash :: SeqData -> Offset -> Maybe kcalculates the hash at a given offset in the sequence
hashes :: SeqData -> [(k, Offset)]calculate all hashes from a sequence, and their indices
ksort :: [k] -> [k]for sorting hashes
contigous :: Integral k => Int -> HashF kSource
Contigous constructs an int/eger from a contigous k-word.
rcontig :: Integral k => Int -> HashF kSource
Like contigous, but returns the same hash for a word and its reverse complement.
rcpacked :: Integral k => Int -> HashF kSource
Like rcontig, but ignoring monomers (i.e. arbitrarily long runs of a single nucelotide are treated the same a single nucleotide.
Entropy calculations
class KWords s whereSource
Methods
kwords :: Int -> s -> [s]Source
show/hide Instances
KWords [a]
entropy :: (Ord str, KWords str) => Int -> str -> DoubleSource
Produced by Haddock version 2.6.1