dI|      !"#$%&'()*+,-./0123456789:;<=>?@ABCDE F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~        !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr s t u v w x y z { | } ~!!!!!!!!!!""""""""""""""""""""""""""""######$$$$$$$$$$$$$$$$$$$$$$$$$$%%%%%%%%%%%%&&&&&&&&&&&&&&''''''''''''(()Workaround, the current Data.ByteString.Lazy.Char8 contains a bug in  Data.ByteString.Lazy.Char8.lines. ,Break a list of bytestrings on a predicate. :Output (to stderr) progress while evaluating a lazy list. L Useful for generating output while (conceptually, at least) in pure code A lazier version of Control.Monad.sequence in  Control.Monad , needed by  above.     1Data structure for storing hierarchical clusters )Single linkage agglomerative clustering. T Cluster elements by slurping a sorted list of pairs with score (i.e. triples :-) 3 Keeps a set of contained elements at each branch's root, so O(n log n), ' and requires elements to be in Ord. Z For this to work, the triples must be sorted on score. Earlier scores in the list will W make up the lower nodes, so sort descending for similarity, ascending for distance.      A 7 may contain multiple separate matches (typcially when B an indel causes a frameshift that blastx is unable to bridge). >Each match between a query and a target sequence (or subject)  is a .  Each query sequence generates a  "A #" is the root of the hierarchy. ,JThe Aux field in the BLAST output includes match information that depends R on the BLAST flavor (blastn, blastx, or blastp). This data structure captures  those variations. -blastx .blastn /The /B indicates the direction of the match, i.e. the plain sequence or  its reverse complement. 2:The sequence id, i.e. the first word of the header field. % !"#$%&'()*+,-./012%2/10,.-"#$%&'()*+ !%  ! !" #$%&'()*+#$%&'()*+,.--./100123334"Parse BLAST results in XML format #breaks p = groupBy (const (not.p)) 4445GThe BlastFlat data structure contains information about a single match DSConvert BlastRecords into BlastFlats (representing a depth-first traversal of the  BlastRecord structure.) $%&'()*+,-./0156789:;<=>?@ABCD56789:;<=>?@ABCD$%&'()*+,.-/105 6789:;<=>?@AB6789:;<=>?@ABCD E>Evidence codes describe the type of support for an annotation   -http://www.geneontology.org/GO.evidence.shtml FNot Recorded GTraceable Author Statement H.Inferred from Reviewed Computational Analysis INo biological Data available JNon-traceable Author Statement K0Inferred from Sequence or Structural Similarity L#Inferred from Physical Interaction MInferred from Mutant Phenotype N"Inferred from Genetic Interaction OInferred from Genomic Context P!Inferred from Expression Pattern Q$Inferred from Electronic Annotation RInferred from Direct Assay SInferred by Curator TRA GOA annotation, containing a UniProt identifier, a GoTerm and an evidence code. V=A UniProt identifier (short string of capitals and numbers). WA GoDef maps a GoTerm to a description and a GoClass. ] A GO term is a positive integer _SA list of Go definitions, with pointers to parent nodes. Read from the .obo file. U The user may construct the explicit hierachy by storing these in a Map or similar `XRead the GO hierarchy from the obo file. Note that this is not quite a tree structure. a7Read the goa_uniprot file (warning: this one is huge!) b9Read GO term definitions, from the GO.terms_and_ids file Parse a GoDef+ from a line in the GO.terms_and_ids file.  Reading an  Annotation& from a line in the association file. ?Read the evidence code from a ByteString (no error checking!). dJThe vast majority of GOA data is IEA, while the most reliable information L is manually curated. Filtering on this is useful to keep data set sizes  manageable, too. EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcd ]^WX_`bTUVY\[ZESRQPONMLKJIHGFadc ESRQPONMLKJIHGFFGHIJKLMNOPQRSTUUVWXXY\[ZZ[\]^^_`abcd gJMost KEGG files that contain associations, have one association per line, R consisting of two items separated by whitespace. This is a generalized reader  function. h'Convert UniProt IDs (up:xxxxxx) to the  UniProtAcc type. i!Convert KO IDs (ko:xxxxx) to the KO data type. jRKEGG uses strings with an identifying prefix for IDs. This helper function checks 2 and removes prefix to construct native values. efghijgefhijeffghij !EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdkkk lmnlmnlmmn VA sequence consists of a header, the sequence data itself, and optional quality data. header and actual sequence Quality data is a  $ vector, currently implemented as a  ByteString. HBasic type for quality data. Range 0..255. Typical Phred output is in N the range 6..50, with 20 as the line in the sand separating good from bad. The basic data type used in  s !An offset, index, or length of a   Convert a String to    Convert a   to a String >Read the character at the specified position in the sequence. Return sequence length. -Return sequence label (first word of header) Return full header. Return the sequence data. KReturn the quality data, or error if none exist. Use hasqual if in doubt. 8Check whether the sequence has associated quality data. 5Modify the header by appending text, or by replacing 1 all but the sequence label (i.e. first word). "Calculate the reverse complement. 7 This is only relevant for the nucleotide alphabet, 0 and it leaves other characters unmodified. AComplement a single character. I.e. identify the nucleotide it H can hybridize with. Note that for multiple nucleotides, you usually $ want the reverse complement (see   for that). ?Translate a nucleotide sequence into the corresponding protein J sequence. This works rather blindly, with no attempt to identify ORFs  or otherwise QA the result. =Convert a list of amino acids to a sequence in IUPAC format. =Convert a sequence in IUPAC format to a list of amino acids. 1opqrstuvwxyz{|}~1o~}|{zyxwvutsrqp1o~}|{zyxwvutsrqppqrstuvwxyz{|}~ 2Lazily read sequences from a FASTA-formatted file +Write sequences to a FASTA-formatted file.  Line length is 60. +Read quality data for sequences to a file. ,Write quality data for sequences to a file. 5Read sequence and associated quality. Will error if C the sequences and qualites do not match one-to-one in sequence. .Write sequence and quality data simulatnously ' This may be more laziness-friendly. !Lazily read sequence from handle -Write sequences in FASTA format to a handle. BConvert a list of FASTA-formatted lines into a list of sequences.  Blank lines are ignored.  Comment lines start with #. are allowed between sequences (and ignored).  Lines starting with > initiate a new sequence. &Split lines into blocks starting with )* characters  Filter out # comments (but not semicolons?) -Parse one FastQ entry, suitable for using in )+ over  ,- from a file ;Parse a (lazy) ByteString as sequences in the 2bit format. .Extract sequences from a file in 2bit format. 4Extract sequences in the 2bit format from a handle. ,Write sequences to file in the 2bit format. 0Write sequences to a handle in the 2bit format. 9Parse a .phd file, extracting the contents as a Sequence "Parse .phd contents from a handle The actual phd parser. 2Pack bytestring segments into a single bytestring 2 Allows the (rest of the) file contents to be GC'ed ;This is a struct for containing a set of hashing functions 6calculates the hash at a given offset in the sequence 8calculate all hashes from a sequence, and their indices for sorting hashes Adds a default hashes function to a HashF, when hash is defined. Contigous constructs an int/eger from a contigous k-word. Like C, but returns the same hash for a word and its reverse complement. Like rcontigK, but ignoring monomers (i.e. arbitrarily long runs of a single nucelotide - are treated the same a single nucleotide.  5This contains the actual flowgram for a single read. "Each Read has a fixed read header  SFF has a 31-byte common header @ Todo: remove items that are derivable (counters, magic, etc) 3 cheader_lenght points to the first read header. H Also, the format is open to having the index anywhere between reads, I we should really keep count and check for each read. In practice, it ' seems to be places after the reads. CThe following two fields are considered part of the header, but as < they are static, they are not part of the data structure : magic :: Word32 -- ^ 0x2e736666, i.e. the string .sff ) version :: Word32 -- ^ 0x00000001 Points to a text(?) section JThe data structure storing the contents of an SFF file (modulo the index) The type of flowgram value test serialization by output'$ing the header and first two reads : in an SFF, and the same after a decode + encode cycle. 1Convert a file by decoding it and re-encoding it & This will lose the index (which isn't really necessary) !Generalized function for padding %Generalized function to skip padding A ReadBlock can'9t be an instance of Binary directly, since it depends on & information from the CommonHeader. What the name and type says. $$#  7A Selector consists of a zero element, and a funcition L that chooses a possible Edit operation, and generates an updated result. KA substitution matrix gives scores for replacing a character with another. $ Typically, it will be symmetric. %An alignment is a sequence of edits. /An Edit is either the insertion, the deletion, & or the replacement of a character. /The sequence element type, used in alignments. Gaps are coded as ).+s, this function removes them, and returns 6 the sequence along with the list of gap positions. &turn an alignment into sequences with ). representing gaps " (for checking, filtering out the ). characters should return " the original sequences, provided ). isn't part of the sequence  alphabet) True if the Edit is a Repl. 2Evaluate an Edit based on SubstMx and gap penalty -Calculate a set of columns containing scores [ This represents the columns of the alignment matrix, but will only require linear space  for score calculation. :BLOSUM45 matrix, suitable for distantly related sequences The standard BLOSUM62 matrix. :BLOSUM80 matrix, suitable for closely related sequences. The standard PAM30 matrix The standard PAM70 matrix. 7Blast defaults, use with gap_open = -5 gap_extend = -3 G This should really check for valid nucleotides, and perhaps be more ( lenient in the case of Ns. Oh well. Construct a simple matrix from match score/mismatch penalty BCalculate global edit distance (Needleman-Wunsch alignment score) Scoring/(selection function for global alignment ?Calculate local edit distance (Smith-Waterman alignment score) Scoring/'selection funciton for local alignmnet  Calculate alignments.       -Minus infinity (or an approximation thereof)  BCalculate global edit distance (Needleman-Wunsch alignment score)  ?Calculate local edit distance (Smith-Waterman alignment score) DGeneric scoring and selection function for global and local scoring  .Calculate global alignment (Needleman-Wunsch) +Calculate local alignmnet (Smith-Waterman) =Generic scoring and selection for global and local alignment           AThe selector must take into account the quality of the sequences  on Ins/FDel, the average of qualities surrounding the gap is (should be) used -Minus infinity (or an approximation thereof) BCalculate global edit distance (Needleman-Wunsch alignment score) ?Calculate local edit distance (Smith-Waterman alignment score) ?Calucalte best overlap score, where gaps at the edges are free K The starting point is like for local score (0 cost for initial indels), V the result is the maximum anywhere in the last column or bottom row of the matrix. DGeneric scoring and selection function for global and local scoring .Calculate global alignment (Needleman-Wunsch) +Calculate local alignment (Smith-Waterman)  (can we replace uncurry max'? with fst - a local alignment must always end on a subst, no?) ?Calucalte best overlap score, where gaps at the edges are free K The starting point is like for local score (0 cost for initial indels), V the result is the maximum anywhere in the last column or bottom row of the matrix.  HVariant that retains indels to retain the entire sequence in the result  =Generic scoring and selection for global and local alignment   The Parsec parser type  !ACE header lines with parameters F The tokenizer (scanner) should convert input into a list of these, ) which in turn can be parsed by Parsec  'Parse a single token, primitive parser (Test parser p on a list of ACE elements %Add SourcePoses to a stream of ACEs. 2Parse a complete ACE file as a set of assemblies. parse the initial header 2parse the contig and quality information (CO, BQ) 'Read a list of Ints in the Maybe monad Given the CO info, get the AFS'es =Parse a list of AFS, followed by actual read, and merge them ' afs :: Sequence -> AceParser [Sequence] -- plus some auxiliary info? parse each read (RD, QA, DS) Y Vector NTI appears to insert solitary RDs, sometimes even without any sequence data!? ( This is not supported at this point. Reading an ACE file. # For benchmarking, fixed lengths )For testing, variable lengths 3-Take time (CPU and wall clock) and report it 4Print a CPUTime difference 5Shamelessly stolen from FPS 6 Constrained position generators  !"#$%&'()*+,-./012345678/0-.+,12)*'(%&#$!" 345678  !""#$$%&&'(()**+,,-../00123456789<Anything, such as a location or a sequence, which lies on a . strand and can thus be reverse complemented. ;Sequence strand 9:;<=>;=<9:>9::;=<<=>?Position in a sequence A0-based index of the position B Optional strand of the position CSlide a position by an offset ?@ABCDEF?@ABCDEF?@AB@ABCDEFG*Contiguous set of positions in a sequence I5'1 end of region on target sequence, 0-based index J$length of region on target sequence Kstrand of region L Create a HG- from 0-based starting and ending positions.  When start is less than end the position will be on the =  ;, otherwise it will be on the < strand. M Create a HG# from a Pos.Pos defining the start  (HG7 5 prime end) position on the sequence and the length. NThe bounds of a HG#, a pair of the lowest and highest D sequence indices covered by the region, which ignores the strand  of the HG0. The first element of the pair will always be  lower than the second. O0-based starting (5'% in the region orientation) position P0-based ending (3'% in the region orientation) position QMove a HG region by a specified offset R Subsequence   for a HG, provided that the region $ is entirely within the sequence. S Subsequence   for a HG, padded as needed with Ns TFor a Pos and a HG on the same sequence, find the  corresponding Pos relative to the HG, provided it is  within the HG. UFor a Pos specified relative to a HG , find the  corresponding Pos/ relative to the outer sequence, provided that  the Pos is within the bounds of the HG. VHG extended on the 5' and 3' ends. WFor a Pos and a HG on the same sequence, is the Pos  within the HG. XFor a pair of HG) regions on the same sequence, indicates  if they overlap at all. GHIJKLMNOPQRSTUVWXYGHIJKLMNOPQVTURSWXYGHIJKHIJKLMNOPQRSTUVWXY Z@General (disjoint) sequence region consisting of a concatenated  set of contiguous regions \Length of the region ]The bounds of a [Z, consisting of the lowest & highest C sequence indices lying within the region. The first element of 2 the pair will always be lower than the second. ^0-based starting (5'* in the region orientation) offset of the  region on its sequence. _0-based ending (3'* in the region orientation) offset of the  region on its sequence. ` Subsequence   for a [Z, provided that the region is ! entirely within the sequence. bFor a Pos and a [Z' region on the same sequence, find the  corresponding Pos relative to the region, if the Pos is  within the region. If the [Z region has redundant positions 9 for a given sequence position, the first is returned. cFor a [Z region on a sequence and a Pos relative to the " region, find the corresponding Pos on the sequence, provided 9 that the position is within the bounds of the region. d Extend a [Z. region by incorporating contigous nucleotide , regions of the specified lengths on the 5' and 3' ends eFor a Pos and a [Z) on the same sequence, does the position  fall within the [Z region? fFor a pair of [Z/ regions on the same sequence, do they overlap  at all? Z[\]^_`abcdefgZ[]\^_dbcef`agZ[[\]^_`abcdefgh#Collection mapping a collection of Loc locations, possibly 9 overlapping, binned for efficient lookup by position. jCreate an empty h$ with a specified position bin size k Create a h from an associated list. l!Add an object with an associated Loc sequence region mBFind the (possibly empty) list of sequence regions and associated  objects that contain a Pos position, in the sense of  withinLoc nBFind the (possibly empty) list of sequence regions and associated  objects that overlap a Loc region, in the sense of  overlapsLoc oRemove a region /+ object association from the map, if it is ; present. If it is present multiple times, only the first  occurrence will be deleted. pRemove the first region /! object association satisfying a  predicate function. hijklmnopq hjikmnoplq hijklmnopq rstuvwxyz{|} wstuvxyzr{|} rstuvtuvwxyz{|}! ~ ~ ~"(Read nt in reference strand orientation -Reference nt in reference strand orientation Offset from reference strand 5'% end in reference strand orientation Quality score of read nt Alignment output from SOAP &Reference strand orientation sequence *Reference strand orientation quality data 71-based index, as output by SOAP, of reference strand 5' end  #$ %   &'   /Mlmnopqrstuvwxyz{|}~Mo~}|{zyxwvutsrqplmn( Progressive multiple alignment. > Calculate a tree from agglomerative clustering, then align G at each branch going bottom up. Returns a list of columns (rows?). ODerive alignments indirectly, i.e. calculate A|C using alignments A|B and B|C.  This is central for Coffee5 evaluation of alignments, and T-Coffee construction  of alignments. 0123456789:;<=>>?@ABCDEFGGHIJKKLMNOOPQRSTUVWXYZ[\]^_`aaLMHI?@ABCDEF`b c d e f g h i j k l m n o p q r s t u u v w x y z { | } ~                                     _      ! !"# !$%&'()*+,-./0123456789:;<=>?@ABCDEFGH[IJKKLMNOPQRRSTMUVWXYNZ[\]^_`QaaTWXYZ[\]^_`Qbcdefghijk l m m n o p q r s t u v!w!x!y!z!{!|!_!`!Z!Q"}"~"""L"""""""""T"M"""""""""_""""###e#f#g#h$$$$$$$$$M$$$$$$$T$_$$$$$$$$$%%%%%e%f%i%%%%%k&&_&&&&&&&Z&&&&&''''''''''''((   ))+,-5)) bio-0.3.5Bio.GFF3.EscapeBio.Util.ParsexBio.UtilBio.ClusteringBio.Alignment.BlastDataBio.Alignment.BlastBio.Alignment.BlastXMLBio.Alignment.BlastFlatBio.Sequence.GeneOntologyBio.Sequence.KEGGBio.Sequence.GOABio.Sequence.EntropyBio.Sequence.SeqDataBio.Sequence.FastaBio.Sequence.FastQBio.Sequence.TwoBitBio.Sequence.PhdBio.Sequence.HashWordBio.Sequence.SFFBio.Alignment.AlignDataBio.Alignment.MatricesBio.Alignment.SAlignBio.Alignment.AAlignBio.Alignment.QAlignBio.Alignment.ACEBio.Util.TestBaseBio.Location.StrandBio.Location.PositionBio.Location.ContigLocationBio.Location.LocationBio.Location.LocMapBio.Location.OnSeqBio.Location.SeqLocationBio.Alignment.SoapBio.Location.SeqLocMapBio.GFF3.FeatureBio.GFF3.FeatureHierBio.GFF3.FeatureHierSequences Bio.GFF3.SGDBio.Alignment.MultiplebaseData.Ord Data.Listbytestring-0.9.1.4Data.ByteString.Lazy.Char8Prelude Bio.SequenceunEscapeByteStringescapeByteString escapeAllBut escapeAllOflazyManylinesmylines splitWhencountIO sequence' ClusteredLeafBranch cluster_sl BlastMatchbitse_validentityq_fromq_toh_fromh_toauxBlastHitsubjectslengthmatches BlastRecordqueryqlengthhits BlastResult blastprogram blastversion blastdateblastreferencesdatabase dbsequencesdbcharsresultsAuxFrameStrandsStrandMinusPlusSeqIdparsereadXML BlastFlatflatten EvidenceCodeNRTASRCANDNASISSIPIIMPIGIIGCIEPIEAIDAIC AnnotationAnn UniProtAccGoDefGoClassCompProcFuncGoTermGO GoHierarchyreadOboreadGOA readTerms decomment isCuratedKO genReadKeggdecodeUPdecodeKO removePrefixreadGOKWordskwordsentropyAminoXaaXleGlxAsxSTPValTrpTyrThrSerProPheMetLysLeuIleHisGlyGluGlnCysAspAsnArgAlaSequenceSeqQualDataQualSeqDataOffsetfromStrtoStr!? seqlengthseqlabel seqheaderseqdataseqqualhasqual appendHeader setHeaderrevcomplcompl translatetoIUPAC fromIUPAC readFasta writeFastareadQual writeQual readFastaQualwriteFastaQualhWriteFastaQual hReadFasta hWriteFasta hWriteQualmkSeqs countSeqs readFastQ hReadFastQ writeFastQ hWriteFastQunparse decode2Bitread2Bit hRead2BitreadPhdhReadPhdShapeHashFHFhashhashesksortgenkeys contigousrcontigcompactrcpackedgappedisNn2kn2i'k2nvalunval complement ReadBlock read_headerflowgram flow_indexbasesquality ReadHeader name_length num_basesclip_qual_leftclip_qual_rightclip_adapter_leftclip_adapter_right read_name CommonHeader index_offset index_length num_reads key_length flow_length flowgram_fmtflowkeySFFIndexFlowreadSFF sffToSequencewriteSFFtestconvertSelectorSubstMxEditListEditReplDelInsChr AlignmentGapsDirRevFwd extractGaps insertGaps toStringsisReplevalcolumnsblosum45blosum62blosum80pam30pam70blastn_defaultsimpleMx global_score local_score global_align local_alignqualMx overlap_score overlap_alignAssemblyAsmcontig fragmentsreadsptestreadACEwriteACEEST_setESetEST_longEL EST_shortESProteinPESTqEqESTEQualityQ NucleotideNTestTfromNfromQtimeshowTintegralRandomR genOffsetgenNonNegOffsetgenPositiveOffsetStrandedrevComplRevComplstrandedPosoffsetstrandslideseqNt seqNtPaddeddisplay ContigLocoffset5length fromStartEnd fromPosLenboundsstartPosendPosseqData seqDataPaddedposIntoposOutofextendisWithinoverlapsLocLocMapdefaultZonesizemkLocMapfromListinsert lookupWithinlookupOverlapsdeletedeleteBycheckInvariantsOnSeqsOnSeq onSeqNameonSeqObjSeqName withSeqData andSameSeq onSameSeqperSeq perSeqUpdatewithNameAndSeqSeqLoc ContigSeqLocSeqPos displaySeqPoswithinContigSeqLocdisplayContigSeqLocSoapAlignMismatchSAMreadntrefntqualnt SoapAlignSAnameseququalnhitpairendrefnamerefstart nmismatch mismatchesmismatchSeqPos refSeqPos refCSeqLoc refSeqLoc parseMismatchunparseMismatchgroup SeqLocMapemptyFeatureseqidsourceftypestartendscorephase attributesGFFAttrattrTag attrValuesparseWithFasta attrByTagids parentIds contigLoclocseqLoc FeatureHierfeatureslookupIdlookupIdChildrenparentschildrenparentsM childrenMFeatureHierSequences fromLists sequences getSequencefeatureSequencerunGFFrunGFFIOasksGFF chromosomesgenesrRNAs geneSequence geneSeqLoc geneCDSesnoncodingSequencenoncodingSeqLocnoncodingExons sortExonsnamedSLM geneCDS_SLM progressiveindirectbreaksmkGoDefmkAnngetECblocks GHC.Classes>unfoldr write2Bit hWrite2BitmkPhdk2n'padskipputRB decodeArrayGHC.Num*- showalignong_scorel_scoreminf score_select align_select QSelectoroverlap_align' AceParserACEparse1aceasctgreadIntsasmafrd