\n      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~  Workaround, the current Data.ByteString.Lazy.Char8 contains a bug in  Data.ByteString.Lazy.Char8.lines. ,Break a list of bytestrings on a predicate. :Output (to stderr) progress while evaluating a lazy list. L Useful for generating output while (conceptually, at least) in pure code A lazier version of Control.Monad.sequence in  Control.Monad , needed by  above. 1Data structure for storing hierarchical clusters )Single linkage agglomerative clustering. T Cluster elements by slurping a sorted list of pairs with score (i.e. triples :-) 3 Keeps a set of contained elements at each branch's root, so O(n log n), ' and requires elements to be in Ord. Z For this to work, the triples must be sorted on score. Earlier scores in the list will W make up the lower nodes, so sort descending for similarity, ascending for distance.     A 7 may contain multiple separate matches (typcially when B an indel causes a frameshift that blastx is unable to bridge). >Each match between a query and a target sequence (or subject)  is a .  Each query sequence generates a  A  is the root of the hierarchy. (JThe Aux field in the BLAST output includes match information that depends R on the BLAST flavor (blastn, blastx, or blastp). This data structure captures  those variations. )blastx *blastn +The +B indicates the direction of the match, i.e. the plain sequence or  its reverse complement. .:The sequence id, i.e. the first word of the header field. %  !"#$%&'()*+,-.%.+-,(*) !"#$%&' %    !"#$%&' !"#$%&'(*))*+-,,-.///0"Parse BLAST results in XML format #breaks p = groupBy (const (not.p)) 0001GThe BlastFlat data structure contains information about a single match @SConvert BlastRecords into BlastFlats (representing a depth-first traversal of the  BlastRecord structure.)  !"#$%&'()*+,-123456789:;<=>?@123456789:;<=>?@ !"#$%&'(*)+-,1 23456789:;<=>23456789:;<=>?@A>Evidence codes describe the type of support for an annotation   -http://www.geneontology.org/GO.evidence.shtml BNot Recorded CTraceable Author Statement D.Inferred from Reviewed Computational Analysis ENo biological Data available FNon-traceable Author Statement G0Inferred from Sequence or Structural Similarity H#Inferred from Physical Interaction IInferred from Mutant Phenotype J"Inferred from Genetic Interaction KInferred from Genomic Context L!Inferred from Expression Pattern M$Inferred from Electronic Annotation NInferred from Direct Assay OInferred by Curator PRA GOA annotation, containing a UniProt identifier, a GoTerm and an evidence code. R=A UniProt identifier (short string of capitals and numbers). SA GoDef maps a GoTerm to a description and a GoClass. Y A GO term is a positive integer [SA list of Go definitions, with pointers to parent nodes. Read from the .obo file. U The user may construct the explicit hierachy by storing these in a Map or similar \XRead the GO hierarchy from the obo file. Note that this is not quite a tree structure. ]7Read the goa_uniprot file (warning: this one is huge!) ^9Read GO term definitions, from the GO.terms_and_ids file Parse a GoDef+ from a line in the GO.terms_and_ids file.  Reading an  Annotation& from a line in the association file. ?Read the evidence code from a ByteString (no error checking!). `JThe vast majority of GOA data is IEA, while the most reliable information L is manually curated. Filtering on this is useful to keep data set sizes  manageable, too. ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_` YZST[\^PQRUXWVAONMLKJIHGFEDCB]`_ AONMLKJIHGFEDCBBCDEFGHIJKLMNOPQQRSTTUXWVVWXYZZ[\]^_` !ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`aaa bcdbcdbccd VA sequence consists of a header, the sequence data itself, and optional quality data. header and actual sequence Quality data is a  $ vector, currently implemented as a  ByteString. HBasic type for quality data. Range 0..255. Typical Phred output is in N the range 6..50, with 20 as the line in the sand separating good from bad. The basic data type used in  s !An offset, index, or length of a   Convert a String to    Convert a   to a String >Read the character at the specified position in the sequence. Return sequence length. -Return sequence label (first word of header) Return full header. Return the sequence data. KReturn the quality data, or error if none exist. Use hasqual if in doubt. 8Check whether the sequence has associated quality data. 5Modify the header by appending text, or by replacing 1 all but the sequence label (i.e. first word). "Calculate the reverse complement. 7 This is only relevant for the nucleotide alphabet, 0 and it leaves other characters unmodified. AComplement a single character. I.e. identify the nucleotide it H can hybridize with. Note that for multiple nucleotides, you usually $ want the reverse complement (see   for that). ?Translate a nucleotide sequence into the corresponding protein J sequence. This works rather blindly, with no attempt to identify ORFs  or otherwise QA the result. =Convert a list of amino acids to a sequence in IUPAC format. =Convert a sequence in IUPAC format to a list of amino acids. 1efghijklmnopqrstuvwxyz{|}~1e~}|{zyxwvutsrqponmlkjihgf1e~}|{zyxwvutsrqponmlkjihgffghijklmnopqrstuvwxyz{|}~ 2Lazily read sequences from a FASTA-formatted file +Write sequences to a FASTA-formatted file.  Line length is 60. +Read quality data for sequences to a file. ,Write quality data for sequences to a file. 5Read sequence and associated quality. Will error if C the sequences and qualites do not match one-to-one in sequence. .Write sequence and quality data simulatnously ' This may be more laziness-friendly. !Lazily read sequence from handle -Write sequences in FASTA format to a handle. BConvert a list of FASTA-formatted lines into a list of sequences.  Blank lines are ignored.  Comment lines start with #. are allowed between sequences (and ignored).  Lines starting with > initiate a new sequence. &Split lines into blocks starting with  characters  Filter out # comments (but not semicolons?) ;Parse a (lazy) ByteString as sequences in the 2bit format. .Extract sequences from a file in 2bit format. 4Extract sequences in the 2bit format from a handle. ,Write sequences to file in the 2bit format. 0Write sequences to a handle in the 2bit format. 9Parse a .phd file, extracting the contents as a Sequence "Parse .phd contents from a handle The actual phd parser. 2Pack bytestring segments into a single bytestring 2 Allows the (rest of the) file contents to be GC'ed ;This is a struct for containing a set of hashing functions 6calculates the hash at a given offset in the sequence 8calculate all hashes from a sequence, and their indices for sorting hashes Adds a default hashes function to a HashF, when hash is defined. Contigous constructs an int/eger from a contigous k-word. Like C, but returns the same hash for a word and its reverse complement. Like rcontigK, but ignoring monomers (i.e. arbitrarily long runs of a single nucelotide - are treated the same a single nucleotide.  7A Selector consists of a zero element, and a funcition L that chooses a possible Edit operation, and generates an updated result. KA substitution matrix gives scores for replacing a character with another. $ Typically, it will be symmetric. %An alignment is a sequence of edits. /An Edit is either the insertion, the deletion, & or the replacement of a character. /The sequence element type, used in alignments. Gaps are coded as +s, this function removes them, and returns 6 the sequence along with the list of gap positions. &turn an alignment into sequences with  representing gaps " (for checking, filtering out the  characters should return " the original sequences, provided  isn't part of the sequence  alphabet) True if the Edit is a Repl. 2Evaluate an Edit based on SubstMx and gap penalty -Calculate a set of columns containing scores [ This represents the columns of the alignment matrix, but will only require linear space  for score calculation. :BLOSUM45 matrix, suitable for distantly related sequences The standard BLOSUM62 matrix. :BLOSUM80 matrix, suitable for closely related sequences. The standard PAM30 matrix The standard PAM70 matrix. 7Blast defaults, use with gap_open = -5 gap_extend = -3 G This should really check for valid nucleotides, and perhaps be more ( lenient in the case of Ns. Oh well. Construct a simple matrix from match score/mismatch penalty BCalculate global edit distance (Needleman-Wunsch alignment score) Scoring/(selection function for global alignment ?Calculate local edit distance (Smith-Waterman alignment score) Scoring/'selection funciton for local alignmnet Calculate alignments. -Minus infinity (or an approximation thereof) BCalculate global edit distance (Needleman-Wunsch alignment score) ?Calculate local edit distance (Smith-Waterman alignment score) DGeneric scoring and selection function for global and local scoring .Calculate global alignment (Needleman-Wunsch) +Calculate local alignmnet (Smith-Waterman) =Generic scoring and selection for global and local alignment  AThe selector must take into account the quality of the sequences  on Ins/FDel, the average of qualities surrounding the gap is (should be) used -Minus infinity (or an approximation thereof) BCalculate global edit distance (Needleman-Wunsch alignment score) ?Calculate local edit distance (Smith-Waterman alignment score) ?Calucalte best overlap score, where gaps at the edges are free K The starting point is like for local score (0 cost for initial indels), V the result is the maximum anywhere in the last column or bottom row of the matrix. DGeneric scoring and selection function for global and local scoring .Calculate global alignment (Needleman-Wunsch) +Calculate local alignment (Smith-Waterman)  (can we replace uncurry max'? with fst - a local alignment must always end on a subst, no?) ?Calucalte best overlap score, where gaps at the edges are free K The starting point is like for local score (0 cost for initial indels), V the result is the maximum anywhere in the last column or bottom row of the matrix. HVariant that retains indels to retain the entire sequence in the result =Generic scoring and selection for global and local alignment  The Parsec parser type !ACE header lines with parameters F The tokenizer (scanner) should convert input into a list of these, ) which in turn can be parsed by Parsec  'Parse a single token, primitive parser (Test parser p on a list of ACE elements  %Add SourcePoses to a stream of ACEs.  2Parse a complete ACE file as a set of assemblies.  parse the initial header  2parse the contig and quality information (CO, BQ) 'Read a list of Ints in the Maybe monad Given the CO info, get the AFS'es =Parse a list of AFS, followed by actual read, and merge them ' afs :: Sequence -> AceParser [Sequence] -- plus some auxiliary info? parse each read (RD, QA, DS) Y Vector NTI appears to insert solitary RDs, sometimes even without any sequence data!? ( This is not supported at this point. Reading an ACE file. Ibcdefghijklmnopqrstuvwxyz{|}~Ie~}|{zyxwvutsrqponmlkjihgfbcd Progressive multiple alignment. > Calculate a tree from agglomerative clustering, then align G at each branch going bottom up. Returns a list of columns (rows?). ODerive alignments indirectly, i.e. calculate A|C using alignments A|B and B|C.  This is central for Coffee5 evaluation of alignments, and T-Coffee construction  of alignments.  !"#$%&&'()*+,-.//012334567789:;<=>?@ABCDEFGHII4501'()*+,-.HJKLMNOPQRSTUVWXYZ[\]]^_`abcdefghi j k l m n o p q r s t u v w x y z { | } ~                        bio-0.3.3.4Bio.Util.ParsexBio.UtilBio.ClusteringBio.Alignment.BlastDataBio.Alignment.BlastBio.Alignment.BlastXMLBio.Alignment.BlastFlatBio.Sequence.GeneOntologyBio.Sequence.GOABio.Sequence.EntropyBio.Sequence.SeqDataBio.Sequence.FastaBio.Sequence.TwoBitBio.Sequence.PhdBio.Sequence.HashWordBio.Alignment.AlignDataBio.Alignment.MatricesBio.Alignment.SAlignBio.Alignment.AAlignBio.Alignment.QAlignBio.Alignment.ACEBio.Alignment.MultipleBio.Sequence.KEGGbaseData.OrdPrelude Bio.SequencelazyManylinesmylines splitWhencountIO sequence' ClusteredLeafBranch cluster_sl BlastMatchbitse_validentityq_fromq_toh_fromh_toauxBlastHitsubjectslengthmatches BlastRecordqueryqlengthhits BlastResult blastprogram blastversion blastdateblastreferencesdatabase dbsequencesdbcharsresultsAuxFrameStrandsStrandMinusPlusSeqIdparsereadXML BlastFlatflatten EvidenceCodeNRTASRCANDNASISSIPIIMPIGIIGCIEPIEAIDAIC AnnotationAnn UniProtAccGoDefGoClassCompProcFuncGoTermGO GoHierarchyreadOboreadGOA readTerms decomment isCuratedreadGOKWordskwordsentropyAminoXaaXleGlxAsxSTPValTrpTyrThrSerProPheMetLysLeuIleHisGlyGluGlnCysAspAsnArgAlaSequenceSeqQualDataQualSeqDataOffsetfromStrtoStr!? seqlengthseqlabel seqheaderseqdataseqqualhasqual appendHeader setHeaderrevcomplcompl translatetoIUPAC fromIUPAC readFasta writeFastareadQual writeQual readFastaQualwriteFastaQualhWriteFastaQual hReadFasta hWriteFasta hWriteQualmkSeqs countSeqs decode2Bitread2Bit hRead2BitreadPhdhReadPhdShapeHashFHFhashhashesksortgenkeys contigousrcontigcompactrcpackedgappedisNn2kn2i'k2nvalunval complementSelectorSubstMxEditListEditReplDelInsChr AlignmentGapsDirRevFwd extractGaps insertGaps toStringsisReplevalcolumnsblosum45blosum62blosum80pam30pam70blastn_defaultsimpleMx global_score local_score global_align local_alignqualMx overlap_score overlap_aligntestAssemblyAsmcontig fragmentsreadsptestreadACEwriteACE progressiveindirectbreaksmkGoDefmkAnngetECblocks GHC.Classes> write2Bit hWrite2BitmkPhdk2n'GHC.Num*- showalignong_scorel_scoreminf score_select align_select QSelectoroverlap_align' AceParserACEparse1sourceaceasctgreadIntsasmafrd