Safe Haskell | Safe-Inferred |
---|
Data structures for manipulating (biological) sequences.
Generally supports both nucleotide and protein sequences, some functions,
like revcompl
, only makes sense for nucleotides.
- data Sequence t = Seq !SeqData !SeqData !(Maybe QualData)
- type Offset = Int64
- type SeqData = ByteString
- type Qual = Word8
- type QualData = ByteString
- (!) :: Sequence a -> Offset -> Char
- seqlength :: Sequence a -> Offset
- seqlabel :: Sequence a -> SeqData
- seqheader :: Sequence a -> SeqData
- seqdata :: Sequence a -> SeqData
- (?) :: Sequence a -> Offset -> Qual
- hasqual :: Sequence a -> Bool
- seqqual :: Sequence a -> QualData
- appendHeader :: Sequence a -> String -> Sequence a
- setHeader :: Sequence a -> String -> Sequence a
- fromStr :: String -> SeqData
- toStr :: SeqData -> String
- defragSeq :: Sequence t -> Sequence t
- seqmap :: ((Char, Qual) -> (Char, Qual)) -> Sequence t -> Sequence t
- castSeq :: Sequence a -> Sequence b
- compl :: Char -> Char
- revcompl :: Sequence Nuc -> Sequence Nuc
- revcompl' :: SeqData -> SeqData
- data Nuc
- castToNuc :: Sequence a -> Sequence Nuc
- data Amino
- translate :: Sequence Nuc -> Offset -> [Amino]
- fromIUPAC :: SeqData -> [Amino]
- toIUPAC :: [Amino] -> SeqData
- castToAmino :: Sequence a -> Sequence Amino
- putSeqLn :: Sequence a -> Int -> Int -> [(Int, Int)] -> IO ()
- seqToStr :: Sequence a -> Int -> Int -> [(Int, Int)] -> [Char]
- data Unknown
Data structure
A sequence is a header, sequence data itself, and optional quality data. Sequences are type-tagged to identify them as nucleotide, amino acids, or unknown type. All items are lazy bytestrings. The Offset type can be used for indexing.
If you use overloaded strings (e.g., ghc -XOverloadedString), you can easily construct sequences from string literals.
A sequence consists of a header, the sequence data itself, and optional quality data. The type parameter is a phantom type to separate nucleotide and amino acid sequences
type SeqData = ByteStringSource
The basic data type used in Sequence
s
Quality data is normally associated with nucleotide sequences
Basic type for quality data. Range 0..255. Typical Phred output is in the range 6..50, with 20 as the line in the sand separating good from bad.
type QualData = ByteStringSource
Quality data is a Qual
vector, currently implemented as a ByteString
.
Accessor functions
(!) :: Sequence a -> Offset -> CharSource
Read the character at the specified position in the sequence.
seqqual :: Sequence a -> QualDataSource
Return the quality data, or error if none exist. Use hasqual if in doubt.
Adding information to header
appendHeader :: Sequence a -> String -> Sequence aSource
Modify the header by appending text, or by replacing all but the sequence label (i.e. first word).
setHeader :: Sequence a -> String -> Sequence aSource
Modify the header by appending text, or by replacing all but the sequence label (i.e. first word).
Converting to and from [Char]
It is probably better to use the IsString
class from String
for this.
Sequence utilities
defragSeq :: Sequence t -> Sequence tSource
Returns a sequence with all internal storage freshly copied and with sequence and quality data present as a single chunk.
By freshly copying internal storage, defragSeq
allows garbage
collection of the original data source whence the sequence was
read; otherwise, use of just a short sequence name can cause an
entire sequence file buffer to be retained.
By compacting sequence data into a single chunk, defragSeq
avoids
linear-time traversal of sequence chunks during random access into
sequence data.
seqmap :: ((Char, Qual) -> (Char, Qual)) -> Sequence t -> Sequence tSource
map over sequences, treating them as a sequence of (char,word8) pairs. This will work on sequences without quality, as long as the function doesn't try to examine it. The current implementation is not very efficient.
castSeq :: Sequence a -> Sequence bSource
Phantom type functionality, unchecked conversion between sequence types
Nucleotide functionality
Nucleotide sequences contain the alphabet [A,C,G,T]. IUPAC specifies an extended nucleotide alphabet with wildcards, but it is not supported at this point.
Complement a single character. I.e. identify the nucleotide it
can hybridize with. Note that for multiple nucleotides, you usually
want the reverse complement (see revcompl
for that).
revcompl :: Sequence Nuc -> Sequence NucSource
Calculate the reverse complement. This is only relevant for the nucleotide alphabet, and it leaves other characters unmodified.
Protein functionality
Proteins are chains of amino acids, represented by the IUPAC alphabet.
translate :: Sequence Nuc -> Offset -> [Amino]Source
Translate a nucleotide sequence into the corresponding protein sequence. This works rather blindly, with no attempt to identify ORFs or otherwise QA the result.
castToAmino :: Sequence a -> Sequence AminoSource
Display a nicely formated sequence.
putSeqLn :: Sequence a -> Int -> Int -> [(Int, Int)] -> IO ()Source
A simple function to display a sequence: we generate the sequence string and | call putStrLn
seqToStr :: Sequence a -> Int -> Int -> [(Int, Int)] -> [Char]Source
Returns a properly formatted and probably highlighted string | representation of a sequence. Highlighting is done using ANSI-Escape | sequences.