ly5>      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-./0123456789: ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e fghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Safe  Safe:A Priority Queue that can fall back to external storage. Note that such a Priority Queue automatically gives rise to an external sorting algorithm: enqueue everything, dequeue until empty.WWhatever is to be stored in this queue needs to be in Binary, because it may need to be moved to external storage on demand. We also need a way to estimate the memory consumption of an enqueued object. When constructing the queue, the maximum amount of RAM to consume is set. Note that open input streams use memory for buffering, too. Enqueued objects are kept in an in memory heap until the memory consumption becomes too high. At that point, the whole heap is sorted and dumped to external storage. If necessary, the file to do so is created and kept open. The newly created stream is added to a heap so that dequeueing objects amounts to performing a merge sort on multiple external streams. To conserve on file descriptors, we concatenate multiple streams into a single file, then use pread(2) on that as appropriate. If too many streams are open (how do we set that limit?), we do exactly that: merge-sort all streams and the in-memory heap into a single new stream. One file is created for each generation of streams, so that mergind handles streams of roughly equal length.xXXX Truth be told, this queue isn't backed externally, and ignores all limits. It *is* a Priority Queue, though!hXXX May want to add callbacks for significant events (new file, massive merge, deletion of file?)7XXX Need to track memory consumption of input buffers.XXX Need a way to decide when too many streams are open. That point is reached when seeking takes about as much time as reading (which depends on buffer size and system characteristics), so that an additional merge pass becomes economical.XXX These will be useful: unix-bytestring:System.Posix.IO.ByteString.fdPread temporary:System.IO.Temp.openBinaryTempFile lz4:Codec.Compression.LZ4 memory limit[path to temporary files (a directory will be created) functions to report progress go here|Creates a priority queue. Note that the priority queue creates files, which will only be cleaned up if deletePQ is called.>Deletes the priority queue and all associated temporary files.|Enqueues an element. This operation may result in the creation of a file or in an enormous merge of already created files. Removes the minimum element from the queue. If the queue is already empty, nothing happens. As a result, it is possible that one or more file become empty and are deleted.!Returns the minimum element from the queue. If the queue is empty, Nothing is returned. Else the minimum element currently in the queue. !"# !"# "!# !"#None $Mode argument for (*, determines where free gaps are allowed.%*align globally, without gaps at either end&:align so that the second sequence is a prefix of the first':align so that the first sequence is a prefix of the second(Align two strings. myersAlign maxd seqA mode seqB tries to align seqA to seqB*, which will work as long as no more than maxd( gaps or mismatches are incurred. The modeX argument determines if either of the sequences is allowed to have an overhanging tail.The result is the triple of the actual distance (gaps + mismatches) and the two padded sequences. These sequences are the original sequences with dashes inserted for gaps.DThe algorithm is the O(nd) algorithm by Myers, implemented in C. A gap and a mismatch score the same. The strings are supposed to code for DNA, the code understands IUPAC ambiguity codes. Two characters match iff there is at least one nucleotide both can code for. Note that N is a wildcard, while X matches nothing.)Nicely print an alignment. An alignment is simply a list of strings with inserted gaps to make them align. We split them into manageable chunks, stack them vertically and add a line showing asterisks in every column where all aligned strings agree. The result is almost the Clustal format.$%&'()$%&'()$%&'()$%&'()Safe */Random useful stuff I didn't know where to put.*calculates the Wilson Score interval. If (l,m,h) = wilson c x n, then m is the binary proportion and (l,h) it's c-confidence interval for x positive examples out of n observations. c" is typically something like 0.05..JTry to estimate complexity of a whole from a sample. Suppose we sampled total things and among those singles9 occured only once. How many different things are there?Let the total number be mA. The copy number follows a Poisson distribution with paramter lambda. Let z := e^{lambda}, then we have:jP( 0 ) = e^{-lambda} = 1/z P( 1 ) = lambda e^{-lambda} = ln z / z P(>=1) = 1 - e^{-lambda} = 1 - 1/z-singles = m ln z / z total = m (1 - 1/z) D := totalsingles = (1 - 1(z) * z / ln z f := z - 1 - D ln z = 0To get z>, we solve using Newton iteration and then substitute to get m:df dz = 1 - DEz z' := z - z (z - 1 - D ln z) / (z - D) m = singles * z /log z$It converges as long as the initial z is large enough, and 10D (in the line for zz below) appears to work well./ Computes ,-10 * log_10 (10 ** (-x/10) + 10 ** (-y/10))g without losing precision. Used to add numbers on "the Phred scale", otherwise known as (deci-)bans.0 Computes ,-10 * log_10 (10 ** (-x/10) - 10 ** (-y/10))l without losing precision. Used to subtract numbers on "the Phred scale", otherwise known as (deci-)bans.1 Computes ,-10 * log_10 (sum [10 ** (-x/10) | x <- xs]) without losing precision.3 Computes 1-p" without leaving the "Phred scale"4 Computes  log (1+x) to a relative precision of 10^-8 even for very small x. Stolen from 0http://www.johndcook.com/cpp_log_one_plus_x.html5 Computes  exp x - 1 to a relative precision of 10^-10 even for very small x. Stolen from 'http://www.johndcook.com/cpp_expm1.html6Binomial coefficient: n 6 k == n! / ((n-k)! k!)7Conversion to 0.4.4 format minifloat: This minifloat fits into a byte. It has no sign, four bits of precision, and the range is from 0 to 63488, initially in steps of 1/8. Nice to store quality scores with reasonable precision and range.8,Conversion from 0.4.4 format minifloat, see 7.*+,-./012345678*+,-./012345678*-6.+,7845/0123*+,-./012345678/02None 357>CL9.A nucleotide base. We only represent A,C,G,T.<A nucleotide base in an alignment. Experience says we're dealing with Ns and gaps all the type, so purity be damned, they are included as if they were real bases. To allow  Nucleotidess to be unpacked and incorparated into containers, we choose to represent them the same way as the BAM file format: as a 4 bit wide field. Gaps are encoded as 0 where they make sense, N is 15.@WQualities are stored in deciban, also known as the Phred scale. To represent a value p , we store -10 * log_10 pa. Operations work directly on the "Phred" value, as the name suggests. The same goes for the W instance: greater quality means higher "Phred" score, meand lower error probability.D A positive  value stored in log domain. We store the natural logarithm (makes computation easier), but allow conversions to the familiar "Phred" scale used for @ values.JMRanges in genomes We combine a position with a length. In 'Range pos len', pos- is always the start of a stretch of length len. Positions therefore move in the opposite direction on the reverse strand. To get the same stretch on the reverse strand, shift r_pos by r_length, then reverse direction (or call reverseRange).NCoordinates in a genome. The position is zero-based, no questions about it. Think of the position as pointing to the crack between two bases: looking forward you see the next base to the right, looking in the reverse direction you see the complement of the first base to the left.CTo encode the strand, we (virtually) reverse-complement any sequence and prepend it to the normal one. That way, reversed coordinates have a negative sign and automatically make sense. Position 0 could either be the beginning of the sequence or the end on the reverse strand... that ambiguity shouldn't really matter.Psequence (e.g. some chromosome)Qoffset, zero-basedRSequence identifiers are ASCII strings Since we tend to store them for a while, we use strict byte strings. If you get a lazy bytestring from somewhere, use shelve" to convert it for storage. Use  unpackSeqid and  packSeqid to avoid the import of Data.ByteString.b Unpacks a Seqid into a StringcPacks a String into a Seqid. Only works for ASCII subset.eConverts a character into a <5. The usual codes for A,C,G,T and U are understood,  and * become gaps and everything else is an N.fConverts a character into a <5. The usual codes for A,C,G,T and U are understood,  and * become gaps and everything else is an N.g Tests if a < is a base. Returns  for everything but gaps.h Tests if a < is a proper base. Returns  for A,C,G,T only.j Tests if a <* is a gap. Returns true only for the gap.mComplements a Nucleotides.nComplements a Nucleotides.oMoves a Positionf. The position is moved forward according to the strand, negative indexes move backward accordingly.pMoves a Range. This is just  shiftPosition lifted.q Reverses a J to give the same Range on the opposite strand.r>Extends a range. The length of the range is simply increased.sExpands a subrange. (range1 s range2) interprets range1 as a subrange of range2? and computes its absolute coordinates. The sequence name of range1 is ignored.tWraps a range to a region. This simply normalizes the start position to be in the interval '[0,n)', which only makes sense if the Rangem is to be mapped onto a circular genome. This works on both strands and the strand information is retained.uJFinds a file by searching the environment variable BIOHAZARD like a PATH.S9:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu@ 9:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu@9:;<=>@ABGHIWDEFTUVSXYZ[]^_`a\efCkljghimn?RbcNOPQodJKLMpqrst u=9:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuSNone zA list of reference sequences.{Reference sequence in Bam Bam enumerates the reference sequences and then sorts by index. We need to track that index if we want to reproduce the sorting order.Possible sorting orders from bam header. Thanks to samtools, which doesn't declare sorted files properly, we have to have the stupid  state, too.6Exactly two characters, for the "named" fields in bam.ZTests whether a reference sequence is valid. Returns true unless the the argument equals  invalidRefseq.PThe invalid Refseq. Bam uses this value to encode a missing reference sequence.HThe invalid position. Bam uses this value to encode a missing position.PTests whether a position is valid. Returns true unless the the argument equals  invalidPos.WThe empty list of references. Needed for BAM files that don't really store alignments.Compares two sequence names the way samtools does. samtools sorts by "strnum_cmp": . if both strings start with a digit, parse the initial sequence of digits and compare numerically, if equal, continue behind the numbers . else compare the first characters (possibly NUL), if equal continue behind them . else both strings ended and the shorter one counts as smaller (and that part is stupid)Normalizes a series of v5s and encodes them in the way BAM and SAM expect it.]Computes the "distinct bin" according to the BAM binning scheme. If an alignment starts at pos# and its CIGAR implies a length of len* on the reference, then it goes into bin distinctBin pos len.Fvwxyz{|}~?vwxyz{|}~?~{|}zvwxy-vwxyz{|}~None^A mostly contiguous subset of a sequence, stored as a set of non-overlapping intervals in an IntMapG from start position to end position (half-open intervals, naturally).cA subset of a genome. The idea is to map the reference sequence (represented by its number) to a  Subseqeunce. None+G Grouping on s. groupStreamOn proj inner outer executes inner (proj e), where e+ is the first input element, to obtain an  i, then passes elements e to i as long as proj e produces the same result. If proj e) changes or the input ends, the pair of proj e and the result of run i is passed to outer#. At end of input, the resulting outer is returned. Grouping on s. groupStreamBy cmp inner outer executes inner to obtain an  i, then passes elements e to i as long as cmp e0 e, where e0< is some preceeding element, is true. Else, the result of run i is passed to outer and + restarts. At end of input, the resulting outer is returned.-Take a prefix of a stream, the equivalent of .'Take first element of a stream or fail.Run an Iteratee, collect the input. When it finishes, return the result along with *all* input. Effectively allows lookahead. Be careful, this will eat memory if the Iteratee doesn't return speedily.LCollects a string of a given length. Don't use this for long strings, use  instead.=Lifts a monadic action and combines it with a continuation.  mBind m f is the same as  lift m >>= f, but does not require a  constraint on the stream type.RLifts a monadic action, ignored the result and combines it with a continuation.  mBind_ m f is the same as  lift m >>= f, but does not require a  constraint on the stream type.9Lifts an IO action and combines it with a continuation.  ioBind m f is the same as liftIO m >>= f, but does not require a  constraint on the stream type.OLifts an IO action, ignores its result, and combines it with a continuation.  ioBind_ m f is the same as  liftIO m >> f, but does not require a  constraint on the stream type. Compose an 'Enumerator\'' with an , giving a new 'Enumerator\''. Merge two 'Enumerator\''/s into one. The header provided by the inner 'Enumerator\''E is passed to the output iterator, the header provided by the outer 'Enumerator\''" is passed to the merging iterateetXXX Something about those headers is unsatisfactory... there should be an unobtrusive way to combine headers.~Apply a function to the elements of a stream, concatenate the results into a stream. No giant intermediate list is produced.Apply a monadic function to the elements of a stream, concatenate the results into a stream. No giant intermediate list is produced.Apply a filter predicate to an .'Apply a monadic filter predicate to an .Map a monadic function over an .Map a monadic function over an , discarding the results. Fold a monadic function over an .Fold a function over an .2Default buffer size in elements. This is 1024 in  Data.Iteratee, which is obviously too small. Since we want to merge many files, a read should take more time than a seek. This sets the sensible buffer size to more than about one MB.:Parallel map of an IO action over the elements of a streamThis  applies an 4 action to every chunk of the input stream. These M actions are run asynchronously in a limited parallel way. Don't forget to evaluate,Protects the terminal from binary junk. If i is an  that might write binary to , then  protectTerm i is the same , but it will abort if  is a terminal device.>A simple progress indicator that prints the number of records.!A function to convert attoparsec Parsers into s.Equivalent to $joinI $ takeStream n $ stream2vector, but more efficient.Reads the whole stream into a .6inner enumeratorouter enumeratormerging enumeratee        !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrsG  /None3457>GL@A collection of extension fields. The key is actually only two Charns, but that proved impractical. (Hmm... we could introduce a Key type that is a 16 bit int, then give it an instance IsString... practical?)=Bam record in its native encoding along with virtual address.t A mutable vector that packs two <# into one byte, just like Bam does.A vector that packs two <# into one byte, just like Bam does.'internal representation of a BAM record$virtual offset for indexing purposes{Cigar line in BAM coding Bam encodes an operation and a length into a single integer, we keep those integers in an array.extracts the aligned length from a cigar line This gives the length of an alignment as measured on the reference, which is different from the length on the query or the length of the alignment."?Smart constructor. Makes sure we got a at least a full record.$/Deletes all occurences of some extension field.%yBlindly inserts an extension field. This can create duplicates (and there is no telling how other tools react to that).&eDeletes all occurences of an extension field, then inserts it with a new value. This is safer than %, but also more expensive.'1Adjusts a named extension by applying a function.9AA simple progress indicator that prints sequence id and position.Lutvw      !"#$%&'x()*+,-./0123456789yz{|}~E<=>      !"#$%&'()*+,-./0123456789G"     # !<=>678$%&'()*+,-./0123459*utvw      !"#$%&'x()*+,-./0123456789yz{|}~  None4: Parser for  FastA/FastQ,  style, based on Data.Attoparsec6, and written such that it is compatible with module . This gives import of  FastA/FastQ* while respecting some local conventions.Reader for DNA (not protein) sequences in FastA and FastQ. We read everything vaguely looking like FastA or FastQ, then shoehorn it into a BAM record. We strive to extract information following more or less established conventions from the header, but we won't support everything under the sun. The recognized syntactical warts are converted into appropriate flags and removed. Only the canonical variant of FastQ is supported (qualities stored as raw bytes with base 33).!Supported additional conventions:A name suffix of /1 or /2X is turned into the first mate or second mate flag and the read is flagged as paired.Same for name prefixes of F_ or R_, respectively.A name prefix of M_* flags the sequence as unpaired and mergedA name prefix of T_+ flags the sequence as unpaired and trimmedA name prefix of C_U, either before or after any of the other prefixes, is turned into the extra flag XP:i:-1? (result of duplicate removal with unknown duplicate count).dA collection of tags separated from the name by an octothorpe is removed and put into the fields XI and XJ as text.In ; only, if the first word of the description has at least four colon separated subfields, the first if used to flag first/second mate, the second is the "QC failed" flag, and the fourth is the index sequence.QEverything before the first sequence header is ignored. Headers can start with > or @f, we treat both equally. The first word of the header becomes the read name, the remainder of the header is ignored. The sequence can be split across multiple lines; whitespace, dashes and dots are ignored, IUPAC ambiguity codes are accepted as bases, anything else causes an error. The sequence ends at a line that is either a header or starts with +, in the latter case, that line is ignored and must be followed by quality scores. There must be exactly as many Q-scores as there are bases, followed immediately by a header or end-of-file. Whitespace is ignored.<Same as :, but a custom function can be applied to the description string (the part of the header after the sequence name), which can modify the parsed record. Note that the quality field can end up empty.:;<:;<:<;:;< None4=/A quality filter is simply a transformation on BamRec/s. By convention, quality filters should set  flagFailsQC#, a further step can then remove the failed reads. Filtering of individual reads tends to result in mate pairs with inconsistent flags, which in turn will result in lone mates and all sort of troubles with programs that expect non-broken BAM files. It is therefore recommended to use  pairFilter4 with suitable predicates to do the post processing.>*Quality filters adapted from old pipeline.A filter/transformation applied to pairs of reads. We supply a predicate to be applied to single reads and one to be applied to pairs, tha latter can get incomplete pairs, too, if mates have been separated or filtered asymmetrically.?Simple complexity filter aka "Nancy Filter". A read is considered not-sufficiently-complex if the most common base accounts for greater than the cutoff fraction of all non-N bases.@WFilter on order zero empirical entropy. Entropy per base must be greater than cutoff.A>Filter on average quality. Reads without quality string pass.BFilter on minimum quality. In qualityMinimum n q(, a read passes if it has no more than n bases with quality less than q&. Reads without quality string pass.C[Convert quality scores from old Illumina scale (different formula and offset 64 in FastQ).DZConvert quality scores from new Illumina scale (standard formula but offset 64 in FastQ). =>?@ABCD=>?@ABCD>=?@ABCD =>?@ABCD None4E&Trims from the 3' end of a sequence.  trim_3' p b% trims the 3' end of the sequence in b% at the earliest position such that p evaluates to true on every suffix that was trimmed off. Note that the 3' end may be the beginning of the sequence if it happens to be stored in reverse-complemented form. Also note that trimming from the 3' end may not make sense for reads that were constructed by merging paired end data (but we cannot take care of that here). Further note that trimming may break dependent information, notably the "mate" information of the mate and many optional fields.zTODO: The MD field is currently removed. It should be repaired instead. Many other fields should be trimmed if present.G4Trim predicate to get rid of low quality sequence. trim_low_quality q ns qs' evaluates to true if all qualities in qs are smaller (i.e. worse) than q.EFGEFGEFGEFG NoneHFixes abuse of flags valued 0x800 and 0x1000. We used them for low quality and low complexity, but they have since been redefined. If set, we clear them and store them into the ZD field. Also fixes abuse of the combination of the paired, 1st mate and 2nd mate flags used to indicate merging or trimming. These are canonicalized and stored into the FF field. This function is unsafe on BAM files of unclear origin!IWFixes typical inconsistencies produced by Bwa: sometimes, 'mate unmapped' should be set, and we can see it, because we match the mate's coordinates. Sometimes 'properly paired' should not be set, because one mate in unmapped. This function is generally safe, but needs to be called only on the output of affected (older?) versions of Bwa.JSRemoves syntactic warts from old read names or the read names used in FastQ files.HIJHIJHIJHIJ None 7=TOne BGZF block: virtual offset and contents. Could also be a block of an uncompressed file, if we want to support indexing of uncompressed BAM or some silliness like that.X Decompressesn a plain file. What's actually happening is that the offset in the input stream is tracked and added to the  ByteString s giving BlockEs. This results in the same interface as decompressing actual Bgzf.Y*Decompress a BGZF stream into a stream of s.[*Decompress a BGZF stream into a stream of Ts, np fold parallel.FDecodes a BGZF block header and returns the block size if successful.\GTests whether a stream is in BGZF format. Does not consume any input.]8Tests whether a stream is in GZip format. Also returns True@ on a Bgzf stream, which is technically a special case of GZip.^UMaximum block size for Bgzf: 64k with some room for headers and uncompressible stuff_The EOF marker for BGZF files. This is just an empty string compressed as BGZF. Appended to BAM files to indicate their end.<Decompress a collection of strings into a single BGZF block.Ideally, we receive one decode chunk from a BGZF file, decompress it, and return it, in the process attaching the virtual address. But we might actually get more than one chunk, depending on the internals of the IterateeKs used. If so, we concatenate them; the first gets to assign the address.Now allocate space for uncompressed data, decompress the chunks we got, compute crc for each and check it, finally convert to ByteString and emit. We could probably get away with unsafePerformIOo'ing everything in here, but then again, we only do this when we're writing output anyway. Hence, run in IO.:Compress a collection of strings into a single BGZF block.tOkay, performance was lacking... let's do it again, in a more direct style. We build our block manually. First check if the compressed data is going to fit---if not, that's a bug. Then alloc a buffer, fill with a dummy header, alloc a ZStream, compress the pieces we were handed one at a time. Calculate CRC32, finalize header, construct a byte string, return it. We could probably get away with unsafePerformIOo'ing everything in here, but then again, we only do this when we're writing output anyway. Hence, run in IO.`Get the current virtual offset. The virtual address in a BGZF stream contains the offset of the current block in the upper 48 bits and the current offset into that block in the lower 16 bits. This scheme is compatible with the way BAM files are indexed.aRuns an Iteratee for  ByteString7s when decompressing BGZF. Adds internal bookkeeping.bCompresses a stream of  ByteString,s into a stream of BGZF blocks, in parallelvBreaks a stream into chunks suitable to be compressed individually. Each chunk on output is represented as a list of Ds, each list must be reversed and concatenated to be compressed. ( does that.)cLike b, with sensible defaults.0KLMNOPQRSTUVWXYZ[\]^_`abcdeKLMNOPQRSTUVWXYZ[\]^_`abcdeTUVW[ZYX^_a`OPQRS\]cdbKLMNe&KLMNOPQRSTUVWXYZ[\]^_`abcdeNone +5fATunes the compress algorithm but does not affact the correctness.gDefault strategyhpUse the filtered compression strategy for data produced by a filter (or predictor). Filtered data consists mostly of small values with a somewhat random distribution. In this case, the compression algorithm is tuned to compress them better. The effect of this strategy is to force more Huffman coding and less string matching; it is somewhat intermediate between g and i.i\Use the Huffman-only compression strategy to force Huffman encoding only (no string match).jThe j specifies how much memory should be allocated for the internal state. It is a tradeoff between memory usage, speed and compression. Using more memory allows faster and better compression.-The memory used for interal state, excluding o0, is 512 bits times 2 to power of memory level..The total amount of memory use depends on the o and j.kDefault memory level set to 8.lUse the small amount of memory (equivalent to memory level 1) - i.e. 1024b or 256 B. It slow and reduces the compresion ratio.m{Maximum memory level for optimal compression speed (equivalent to memory level 9). The internal state is 256kib or 32kiB.n0A specific level. It have to be between 1 and 9.oThis specify the size of compression level. Larger values result in better compression at the expense of highier memory usage.OThe compression window size is 2 to the power of the value of the window bits.2The total memory used depends on windows bits and j.p/The size of window bits. It have to be between 8+ (which corresponds to 256b i.e. 32B) and 15) (which corresponds to 32 kib i.e. 4kiB).q%The default window size which is 4kiBrSpecify the compression method.s.'Deflate' is so far the only method supported.tIThe compression level specify the tradeoff between speed and compression.u#Default compression level set at 6.v"No compression, just a block copy.w9The fastest compression method (however less compression)x-The best compression method (however slowest)y+Compression level set by number from 1 to 9z4Specify the format for compression and decompression{sThe gzip format is widely used and uses a header with checksum and some optional metadata about the compress file.oIt is intended primarily for compressing individual files but is also used for network protocols such as HTTP.%The format is described in RFC 1952  #http://www.ietf.org/rfc/rfc1952.txt.|zThe zlib format uses a minimal header with a checksum but no other metadata. It is designed for use in network protocols.%The format is described in RFC 1950 #http://www.ietf.org/rfc/rfc1950.txt}[The 'raw' format is just the DEFLATE compressed data stream without and additionl headers.%Thr format is described in RFC 1951 #http://www.ietf.org/rfc/rfc1951.txt~Format for decompressing a | or { stream.<Set of parameters for decompression. For sane defaults see .2Window size - it have to be at least the size of  the stream was compressed with. Default in Y is the maximum window size - please do not touch it unless you know what you are doing./The size of output buffer. That is the size of L;s that will be emitted to inner iterator (except the last L).:Set of parameters for compression. For sane defaults use /The size of output buffer. That is the size of L;s that will be emitted to inner iterator (except the last L).,Denotes the flush that can be sent to streamYAll pending output is flushed and all input that is available is sent to inner Iteratee.Flush all pending output and reset the compression state. It allows to restart from this point if compression was damaged but it can seriously affect the compression rate.'It may be only used during compression.If the iteratee is compressing it requests to stop when next block is emmited. On the beginning it skips only header if and only if it exists..Denotes error in compression and decompression?Decompression requires user-supplied dictionary (not supported)7Buffer error - denotes a library error | File ErrorState of steam inconsistentInput data corruptedNot enough memory Version error2Unexpected or unknown error - please report as bug*Incorrect state - denotes error in library(Denotes error is user-supplied parameter&Incorrect compression level was chosen*Incorrect number of window bits was chosen!Incorrect memory level was chosen.Compress the input and send to inner iteratee.xDecompress the input and send to inner iteratee. If there is data after the end of zlib stream, it is left unprocessed.=Inflate if Gzip format is recognized, otherwise pass through.~Enumerate synchronise flush. It cause the all pending output to be flushed and all available input is sent to inner Iteratee.Enumerate full flush. It flushes all pending output and reset the compression. It allows to restart from this point if compressed data was corrupted but it can affect the compression rate.'It may be only used during compression.Enumerate block flush. If the enumerator is compressing it allows to finish current block. If the enumerator is decompressing it forces to stop on next block boundary.yfghijklmnopqrstuvwxyz{|}~Format of inputParameters of compression<fghijklmnopqrstuvwxyz{|}~<z{|}~tuvwxyrsopqjklmnfghiEfghijklmnopqrstuvwxyz{|}~None4 'Parsers for BAM and SAM. We employ an Iteratee interface, and we strive to support everything possible in BAM. The implementation of nucleotides is somewhat lacking: the "=" symbol is not understood.TONOTDO: - Reader for gzippedbzippedjbgzf'ed SAM. Storing SAM is a bad idea, so why would anyone ever want to compress, much less index it?*Checks if a file contains BAM in any of the common forms, then decompresses it appropriately. If the stream doesn't contain BAM at all, it is instead decoded as SAM. Since SAM is next to impossible to recognize reliably, we don't even try. Any old junk is decoded as SAM and will fail later.@Iteratee-style parser for SAM files, designed to be compatible with the BAM parsers. Parses plain uncompressed SAM, nothing else. Since it is supposed to work the same way as the BAM parser, it requires the presense of the SQ header lines. These are stripped from the header text and turned into the symbol table.Parser for SAM that doesn't look for a header. Has the advantage that it doesn't stall on a pipe that never delivers data. Has the disadvantage that it never reads the header and therefore needs a list of allowed RNAMEs.Tests if a data stream is a Bam file. Recognizes plain Bam, gzipped Bam and bgzf'd Bam. If a file is recognized as Bam, a decoder (suitable Enumeratee) for it is returned. This uses ? internally, so it shouldn't consume anything from the stream.Tests if a data stream is a Bam file. Recognizes plain Bam, gzipped Bam and bgzf'd Bam. If a file is recognized as Bam, a decoder (suitable Enumeratee) for it is returned. This uses ? internally, so it shouldn't consume anything from the stream.Tests if a data stream is a Bam file. Recognizes plain Bam, gzipped Bam and bgzf'd Bam. If a file is recognized as Bam, a decoder (suitable Enumeratee) for it is returned. This uses ? internally, so it shouldn't consume anything from the stream.Tests if a data stream is a Bam file. Recognizes plain Bam, gzipped Bam and bgzf'd Bam. If a file is recognized as Bam, a decoder (suitable Enumeratee) for it is returned. This uses ? internally, so it shouldn't consume anything from the stream.Tests if a data stream is a Bam file. Recognizes plain Bam, gzipped Bam and bgzf'd Bam. If a file is recognized as Bam, a decoder (suitable Enumeratee) for it is returned. This uses ? internally, so it shouldn't consume anything from the stream.Checks if a file contains BAM in any of the common forms, then decompresses it appropriately. We support plain BAM, Bgzf'd BAM, and Gzip'ed BAM.1The recommendation for these functions is to use  decodeAnyBam (or decodeAnyBamFile) for any code that can handle BamRaw input, but decodeAnyBamOrSam (or decodeAnyBamOrSamFile) for code that needs BamRecX. That way, SAM is supported automatically, and seeking will be supported if possible.TDecode a BAM stream into raw entries. Note that the entries can be unpacked using decodeBamEntry=. Also note that this is an Enumeratee in spirit, only the BamMeta and Refs need to get passed separately.TUVWYZcTUVWZYcNone4GDecode only those reads that fall into one of several regions. Strategy: We will scan the file mostly linearly, but only those regions that are actually needed. We filter the decoded stuff so that it actually overlaps our regions.:From the binning index, we get a list of segments per requested region. Using the checkpoints, we prune them: if we have a checkpoint to the left of the beginning of the interesting region, we can move the start of each segment forward to the checkpoint. If that makes the segment empty, it can be droppped.tThe resulting segment lists are merged, then traversed. We seek to the beginning of the earliest segment and start decoding. Once the virtual file position leaves the segment or the alignment position moves past the end of the requested region, we move to the next. Moving is a seek if it spans a sufficiently large gap or points backwards, else we just keep going.A U has a start and an end offset, and an "end coordinate" from the originating region.Checkpoints. Each checkpoint is a position with the virtual offset where the first alignment crossing the position is found. In BAI, we get this from the ioffset$ vector, in CSI we get it from the loffsetg field: "Given a region [beg,end), we only need to visit chunks whose end file offset is larger than ioffset of the 16kB window containing beg*." (Sounds like a marginal gain, though.).Mapping from bin number to vector of clusters.Full index, unifying BAI and CSI style. In both cases, we have the binning scheme, parameters are fixed in BAI, but variable in CSI. Checkpoints are created from the linear index in BAI or from the loffset field in CSI.Minshift parameter from CSIDepth parameter from CSI/Best guess at where the unaligned records start!Room for stuff (needed for tabix)VRecords for the binning index, where each bin has a list of segments belonging to it.jKnown checkpoints of the form (pos,off) where off is the virtual offset of the first record crossing pos.jMerges two lists of segments. Lists must be sorted, the merge sort merges overlapping segments into one./Reads any index we can find for a file. If the file name has a .bai or .csi extension, we read it. Else we look for the index by adding such an extension and by replacing the extension with these two, and finally in the file itself. The first file that exists and can actually be parsed, is used.Read an index in BAI or CSI format, recognized automatically. Note that TBI is supposed to be compressed using bgzip; it must be decompressed before being passed to .UReads a Tabix index. Note that tabix indices are compressed, this is taken care of.KReads the list of segments from an index file and makes sure it is sorted."Seeks to a given sequence in a Bam file and enumerates only those records aligning to that reference. We use the first checkpoint available for the sequence. This requires an appropriate index, and the file must have been opened in such a way as to allow seeking. Enumerates over the BamRaw] records of the correct sequence only, doesn't enumerate at all if the sequence isn't found.jSeeks to the part of a Bam file that contains unaligned reads and enumerates those. Sort of the dual to u. We use the best guess at where the unaligned stuff starts. If no such guess is available, we decode everything.Enumerates one . Seeks to the start offset, unless reading over the skipped part looks cheaper. Enumerates until we either cross the end offset or the max position.Subsample randomly from a BAM file. If an index exists, this produces an infinite stream taken from random locations in the file./None4The  is garbage collected, so we don't get leaks. Once it has grown to a practical size (and the initial 128k should be very practical), we don't get fragmentation either. We also avoid copies for the most part, since no intermediate  ByteString/s, either lazy or strict have to be allocated.0Creates a buffer with initial capacity of ~128k.PEnsures a given free space in the buffer by doubling its capacity if necessary.]Sets a mark. This can later be filled in with a record length (used to create BAM records).Ends a record by filling the length into the field that was previously marked. Terrible things will happen if this wasn't preceded by a corresponding .None4 Printers for BAM. We employ an Iteratee interface, and we strive to keep BAM records in their encoded form. This is most compact and often faster, since it saves the time for repeated decoding and encoding, if that's not strictly needed.write in SAM format to stdout This is useful for piping to other tools (say, AWK scripts) or for debugging. No convenience function to send SAM to a file exists, because that's a stupid idea.mEncodes BAM records straight into a dynamic buffer, the BGZF's it. Should be fairly direct and perform well. writes BAM encoded stuff to a file XXX This should(!) write indexes on the side---a simple block index for MapReduce style slicing, a standard BAM index or a name index would be possible. When writing to a file, this makes even more sense than when writing to a Handle.lwrite BAM encoded stuff to stdout This send uncompressed BAM to stdout. Useful for piping to other tools.writes BAM encoded stuff to a HandleH We generate BAM with dynamic blocks, then stream them out to the file.XXX This could write indexes on the side---a simple block index for MapReduce style slicing, a standard BAM index or a name index would be possible.   None        !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs<=>vwxyz{|}~      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGTUVWYZcNone345This is the class of types we can embed into the Avro infrastructure. Right now, we can derive a schema, encode to the Avro binary format, and encode to the Avro JSON encoding.Produces the schema for this type. Schemas are represented as JSON values. The monad is used to keep a table of already defined types, so the schema can refer to them by name. (The concrete argument serves to specify the type, it is not actually used.)|Serializes a value to the binary representation. The schema is implied, serialization to related schemas is not supported.Deserializzes a value from binary representation. Right now, no attempt at schema matching is done, the schema must match the expected one exactly.Serializes a value to the JSON representation. Note that even the JSON format needs a schema for successful deserialization, and here we support only the one implied schema.DSupport for Avro. Current status is that we can generate schemas for certain Haskell values, serialize to binary and JSON representations, and write Container files using the null codec. The C implementation likes some, but not all of these containers; it's unclear if that's the fault of the C implementation, though.Meanwhile, serialization works for nested sums-of-products, as long as the product uses record syntax and the top level is a plain record. The obvious primitives are supported.<Implements Zig-Zag-Coding like in Protocol Buffers and Avro.:Reverses Zig-Zag-Coding like in Protocol Buffers and Avro.HEncodes a word of any size using a variable length "base 128" encoding.WEncodes an int of any size by combining the zig-zag coding with the base 128 encoding.YDecodes an int of any size by combining the zig-zag decoding with the base 128 decoding.Repeatedly apply an : to a value until end of stream. Returns the final value. $A map from Text becomes an Avro map. 'An unboxed vector becomes an Avro array &A generic vector becomes an Avro array A list becomes an Avro array The chunked encoding for lists may come in handy. How to select the chunk size is not obvious, though.7The Avro "null" type is represented as the empty tuple.2     # 2     )     None4-Iterates over a GLF file. In get_glf_file per_seq per_file, the enumerator per_file genome_name, where  genome_nameI is the name stored in the GLF header, is run once, then the enumeratee per_seq glfseq/ is iterated over the records in each sequence..Enumerate the contents of a GLF file, apply suitable Enumeratees to both sequences and records, resulting in an Enumerator of whatever), typically output Strings or records...^This type is positively weird and I'm not entirely sure this is the right way to go about it. !"#$%&'()*+,-./ !"#$%&'()*+,-./% !"#$ !"#&'()*+,-./% !"#$ !"#&'()*+,-./None0*Parameters for the universal damage model.We assume the correct model is either no damage, or single strand damage, or double strand damage. Each of them comes with a probability. It turns out that blending them into one is simply accomplished by multiplying these probabilities onto the deamination probabilities.hFor single stranded library prep, only one kind of damage occurs (C to T), it occurs at low frequency (3") everywhere, at high frequency (2a) in single stranded parts, and the overhang length is distributed exponentially with parameter 4 at the 5' end and 5 at the 3' end. (Without UDG treatment, those will be equal. With UDG, those are much smaller and in fact don't literally represent overhangs.)eFor double stranded library prep, we get C->T damage at the 5' end and G->A at the 3' end with rate 6% and both in the interior with rate 7. Everything is symmetric, and therefore the orientation of the aligned read doesn't matter either. Both overhangs follow a distribution with parameter 8.;3Things specific to ancient DNA, e.g. damage models.For aDNA, we need a substitution probability. We have three options: use an empirically determined PSSM, use an arithmetically defined PSSM based on the Johnson3 model, use a context sensitive PSSM based on the Johnson model and an alignment. Using Dindel, actual substitutions relative to a called haplotype would be taken into account. Since we're not going to do that, taking alignments into account is difficult, somewhat approximate, and therefore not worth the hassle./We represent substitution matrices by the type y. Internally, this is a vector of packed vectors. Conveniently, each of the packed vectors represents all transition into the given nucleotide.A ;S is a function that gives substitution matrices for each position in a read. The ; can depend on the length of the read and whether its alignment is reversed. In practice, we should probably memoize precomputed damage models somehow.<RConvenience function to access a substitution matrix that has a mnemonic reading.=; for undamaged DNA. The likelihoods follow directly from the quality score. This needs elaboration to see what to do with amibiguity codes (even though those haven't actually been observed in the wild).>TGeneric substitution matrix, has C->T and G->A deamination as parameters. Setting p or qP to 0 as appropriate makes this apply to the single stranded or undamaged case.0123456789:;<=>?@A0123456789:;<=>?@A;9:<=012345678>?@A 0123456789:;<=>?@A: <NoneINBMWe need a simple priority queue. Here's a skew heap (specialized to strict   priorities and g values).EThe things we drag along in F. Notes: * The active queue is a simple stack. We add at the front when we encounter reads, which reverses them. When traversing it, we traverse reads backwards, but since we accumulate the T%, it gets reversed back. The new active queue, however, is no longer reversed (as it should be). So after the traversal, we reverse it again. (Yes, it is harder to understand than using a proper deque type, but it is cheaper. There may not be much point in the reversing, though.)F`The pileup logic keeps a current coordinate (just two integers) and two running queues: one of active g9s that contribute to current genotype calling and on of waiting g)s that will contribute at a later point.Oppan continuation passing style! Not only is the CPS version of the state monad (we have five distinct pieces of state) somewhat faster, we also need CPS to interact with the mechanisms of . It makes implementing }, ~, and  straight forward.K0Running pileup results in a series of piles. A J has the basic statistics of a VarCallj, but no GL values and a pristine list of variants instead of a proper call. We emit one pile with two T s (one for each strand) and one S, (the one immediately following) at a time.[Genotype likelihood values. A variant call consists of a position, some measure of qualities, genotype likelihood values, and a representation of variants. A note about the GL values: VCF would normalize them so that the smallest one becomes zero. We do not do that here, since we might want to compare raw values for a model test. We also store them in a N to make arithmetics easier. Normalization is appropriate when converting to VCF.wIf GL is given, we follow the same order used in VCF: "the ordering of genotypes for the likelihoods is given by: F(j k) = (k*(k+1)2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc."\Statistics about a genotype call. Probably only useful for fitlering (so not very useful), but we keep them because it's easy to track them.bRepresents our knowledge about a certain base, which consists of the base itself (A,C,G,T, encoded as 0..3; no Ns), the quality score (anything that isn't A,C,G,T becomes A with quality 0), and a substitution matrix representing post-mortem but pre-sequencing substitutions.Unfortunately, none of this can be rolled into something more simple, because damage and sequencing error behave so differently.h more chunksi)number of bases to wait due to a deletionjfour likelihoodsk map qualitylreverse strand?n-The primitive pieces for genotype calling: A position, a base represented as four likelihoods, an inserted sequence, and the length of a deleted sequence. The logic is that we look at a base followed by some indel, and all those indels are combined into a single insertion and a single deletion.o0skip to position (at start or after N operation)p1observed deletion and insertion between two basesqnothing anymorerDecomposes a BAM record into chunks suitable for piling up. We pick apart the CIGAR field, and combine it with sequence and quality as appropriate. We ignore the MD= field, even if it is present. Clipped bases are removed/skipped as appropriate. We also ignore the reference allele, in fact, we don't even know it, which nicely avoids any possible reference bias by construction. But we do apply a substitution matrix to each base, which must be supplied along with the read.sThe pileup enumeratee takes Is, decomposes them, interleaves the pieces appropriately, and generates J)s. The output will contain at most one T and one S2 for each position, piles are sorted by position.This top level driver receives s. Unaligned reads and duplicates are skipped (but not those merely failing quality checks). Processing stops when the first read with invalid br_rname$ is encountered or a t end of file.~-Inspect next input element, if any. Returns Just b if b is the next input element, NothingW if no such element exists. Waits for more input if nothing is available immediately.Discard next input element, if any. Does nothing if input has already ended. Waits for input to discard if nothing is available immediately.The actual pileup algorithm.LBCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~FBCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Lnopqghijklmbcdefr\]^_`a[YZUVWXTSKLMNOPQRJIsFGHEtuvwxyz{|}~BCD+BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~NoneSimple indel calling. We don't bother with it too much, so here's the gist: We collect variants (simply different variants, details don't matter), so n@ variants give rise to (n+1)*n/2 GL values. (That's two out of (n+1), the reference allele, represented here as no deletion and no insertion, is there, too.) To assign these, we need a likelihood for an observed variant given an assumed genotype.For variants of equal length, the likelihood is the sum of qualities of mismatching bases, but no higher than the mapping quality. That is roughly the likelihood of getting the observed sequence even though the real sequence is a different variant. For variants of different length, the likelihood is the map quality. This corresponds to the assumption that indel errors in sequencing are much less likely than mapping errors. Since this hardly our priority, the approximations are declared good enough.Naive SNP call; essentially the GATK model. We create a function that computes a likelihood for a given base, then hand over to simple call. Since everything is so straight forward, this works even in the face of damage.Compute GLB values for the simple case. The simple case is where we sample ploidy\ alleles with equal probability and assume that errors occur independently from each other. The argument pls is a function that computes the likelihood for getting the current read, for every variant assuming that variant was sampled.NOTE, this may warrant specialization to diploidy and four alleles (common SNPs) and diploidy and two alleles (common indels).wMake a list of genotypes, each represented as a vector of allele probabilities, from ploidy and four possible alleles.dThis makes the most sense for SNPs. The implied order of alleles is A,C,G,T, and the resulting genotype vectors can straight forwardly be mutiplied with a substitution matrix to give a sensible result. (Something similar for indels could be imagined, but doesn't seem all that useful. We specialize for SNPs to get simpler types and efficient code.)o"For biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc."SNP call according to maqsamtoolsIbsnp model. The matrix k counts how many errors we made, approximately.None4;N  Argh, this business with the CIGAR operations is a mess, it gets worse when combined with MD. Okay, we will support CIGAR (no "=" and Xx operations) and MD. If we have MD on input, we generate it on output, too. And in between, we break everything into  very smallI operations. (Yes, the two terminating constructors are a weird hack.)6Removes duplicates from an aligned, sorted BAM stream.8The incoming stream must be sorted by coordinate, and we check for violations of that assumption. We cannot assume that length was taken into account when sorting (samtools doesn't do so), so duplicates may be separated by reads that start at the same position but have different length or different strand.We are looking at three different kinds of reads: paired reads, true single ended reads, merged or trimmed reads. They are somewhat different, but here's the situation if we wanted to treat them separately. These conditions define a set of duplicates:Merged or trimmed: We compare the leftmost coordinates and the aligned length. If the library prep is strand-preserving, we also compare the strand.jPaired: We compare both left-most coordinates (b_pos and b_mpos). If the library prep is strand-preserving, only first-mates can be duplicates of first-mates. Else a first-mate can be the duplicate of a second-mate. There may be pairs with one unmapped mate. This is not a problem as they get assigned synthetic coordinates and will be handled smoothly.True singles: We compare only the leftmost coordinate. It does not matter if the library prep is strand-preserving, the strand always matters.1Across these classes, we can see more duplicates:Merged/trimmed and paired: these can be duplicates if the merging failed for the pair. We would need to compare the outer coordinates of the merged reads to the two 5' coordinates of the pair. However, since we don't have access to the mate, we cannot actually do anything right here. This case should be solved externally by merging those pairs that overlap in coordinate space..Single and paired: in the single case, we only have one coordinate to compare. This will inevitably lead to trouble, as we could find that the single might be the duplicate of two pairs, but those two pairs are definitely not duplicates of each other. We solve it by removing the single read(s).jSingle and merged/trimmed: same trouble as in the single+paired case. We remove the single to solve it.9In principle, we might want to allow some wiggle room in the coordinates. So far, this has not been implemented. It adds the complication that groups of separated reads can turn into a set of duplicates because of the appearance of a new reads. Needs some thinking about... or maybe it's not too important.Once a set of duplicates is collected, we perform a majority vote on the correct CIGAR line. Of all those reads that agree on this CIGAR line, a consensus is called, quality scores are adjusted and clamped to a maximum, the MD field is updated and the XP field is assigned the number of reads in the original cluster. The new MAPQ becomes the RMSQ of the map qualities of all reads.Treatment of Read Groups: We generalize by providing a "label" function; only reads that have the same label are considered duplicates of each other. The typical label function would extract read groups, libraries or samples.  Workhorse for duplicate removal.ZUnmapped fragments should not be considered to be duplicates of mapped fragments. The unmapped= flag can serve for that: while there are two classes of unmapped reads (those that are not mapped and those that are mapped to an invalid position), the two sets will always have different coordinates. (Unfortunately, correct duplicate removal now relies on correct unmapped and  mate unmappedp flags, and we don't get them from unmodified BWA. So correct operation requires patched BWA or a run of  bam-fixpair.) sOther definitions (e.g. lack of CIGAR) don't work, because that information won't be available for the mate. This would amount to making the unmappedk flag part of the coordinate, but samtools is not going to take it into account when sorting.'Instead, both flags become part of the mate pos grouping criterion.~First Mates should (probably) not be considered duplicates of Second Mates. This is unconditionally true for libraries with A/B-style adapters (definitely 454, probably Mathias' ds protocol) and the ss protocol, it is not true for fork adapter protocols (vanilla Illumina protocol). So it has to be an option, which would ideally be derived from header information.HThis code ignores read groups, but it will do a majority vote on the RG field and call consensi for the index sequences. If you believe that duplicates across read groups are impossible, you must call it with an appropriately filtered stream.Half-Aligned Pairs (meaning one known coordinate, while the validity of the alignments is immaterial) are rather complicated: 2Given that only one coordinate is known (5' of the aligned mate), we want to treat them like true singles. But the unaligned mate should be kept if possible, though it should not contribute to a consensus sequence. We assume nothing about the unaligned mate, not even that it  shouldn't0 be aligned, never mind the fact that it couldn't[ be. (The difference is in the finite abilities of real world aligners, naturally.)8Therefore, aligned reads with unaligned mates go to the same potential duplicate set as true singletons. If at least one pair exists that might be a duplicate of those, all singletons and half-aligned mates are discarded. Else a consensus is computed and replaces the aligned mates.The unaligned mates end up in the same place in a BAM stream as the aligned mates (therefore we see them and can treat them locally). We cannot call a consensus, since these molecules may well have different length, so we select one. It doesn't really matter which one is selected, and since we're treating both mates at the same time, it doesn't even need to be reproducible without local information. This is made to be the mate of the consensus.See   for how it's actually done. dMerging information about true singles, merged singles, half-aligned pairs, actually aligned pairs.UWe collected aligned reads with unaligned mates together with aligned true singles (singlesq). We collected the unaligned mates, which necessarily have the exact same alignment coordinates, separately ( unaligned ). If we don't find a matching true pair (that case is already handled smoothly), we keep the highest quality unaligned mate, pair it with the consensus of the aligned mates and aligned singletons, and give it the lexically smallest name of the half-aligned pairs. Merging of half-aligned reads. The first argument is a map of unaligned reads (their mates are aligned to the current position), the second is a list of reads that are aligned (their mates are not aligned).So, suppose we're looking at a  that was passed through. We need to emit it along with its mate, which may be hidden inside a list. (Alternatively, we could force it to single, but that fails if we're passing everything along somehow.)Suppose we're looking at a . We could pair it with some mate (which we'd need to duplicate), or we could turn it into a singleton. Duplication is ugly, so in this case, we force it to singleton.Normalize a read's alignment to fall into the canonical region of [0..l]. Takes the name of the reference sequence and its length.Wraps a read to be fully contained in the canonical interval [0..l]. If the read overhangs, it is duplicated and both copies are suitably masked. Split an   into two at some position. The position is counted in terms of the reference (therefore, deletions count, insertions don't). The parts that would be skipped if we were splitting lists are replaced by soft masks._Create an MD field from an extended CIGAR and place it in a record. We build it piecemeal (in go), call out to addNum, addRep, addDel to make sure the operations are not generated in a degenerate manner, and finally check if we're even supposed to create an MD field.0  !"#   $%&'()*+,-./0     !"#   $%&'()*+,-./0NoneFile format for genotype calls.To output a container file, we need to convert calls into a stream of sensible objects. To cut down on redundancy, the object will have a header that names the reference sequence and the start, followed by calls. The calls themselves have contiguous coordinates, we start a new block if we have to skip; we also start a new block when we feel the current one is getting too large. None ]Would you believe it? The 2bit format stores blocks of Ns in a table at the beginning of a sequence, then packs four bases into a byte. So it is neither possible nor necessary to store Ns in the main sequence, and you would think they aren't stored there, right? And they aren't. Instead Ts are stored which the reader has to replace with Ns.The sensible way to treat these is probably to just say there are two kinds of implied annotation (repeats and large gaps for a typical genome), which can be interpreted in whatever way fits. And that's why we have  and .3TODO: use binary search for the Int->Int mappings?Brings a 2bit file into memory. The file is mmap'ed, so it will not work on streams that are not actual files. It's also unsafe if the file is modified in any way.1Repeat monadic action n, times. Returns result in reverse(!) order.2}Merge blocks of Ns and blocks of Ms into single list of blocks with masking annotation. Gaps remain. Used internally only.Extract a subsequence and apply masking. TwoBit file can represent two kinds of masking (hard and soft), where hard masking is usually realized by replacing everything by Ns and soft masking is done by lowercasing. Here, we take a user supplied function to apply masking.&Extract a subsequence without masking.pExtract a subsequence with typical masking: soft masking is ignored, hard masked regions are replaced with Ns.xExtract a subsequence with masking for biologists: soft masking is done by lowercasing, hard masking by printing an N.7limits a range to a position within the actual sequenceGSample a piece of random sequence uniformly from the genome. Only pieces that are not hard masked are sampled, soft masking is allowed, but not reported. On a 32bit platform, this will fail for genomes larger than 1G bases. However, if you're running this code on a 32bit platform, you have bigger problems to worry about.3456789:;<1=>2 2bit filedesired lengthRNGposition, sequence, new RNGP 9:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu3456789:;<1=>2?!"#!"$%&'!()!(*!+,!+-!+.!/012312456789:89;89<89=89>89?@AB@CD@CEFGHIIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzz{|}~      !"#$%&'()*++,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\] ^ _ ` a b c d e f g h i j k l m n o o p q r s t u v w w x y z { | } ~         oo      !"#$%&'()*+,-./00123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`_abcdefghiijkllmnnopqrstuvwxyz{|}~;YZH!!888   !/ ! ! ! ! 88888888888 8!8"8#8$8%8&'8&(8&)8&*8&+8&,8&-8&.8&/8&08&18&28&38&485868788898:8;8<8=8>8?8@8A8B8C8D8E8F8G8H8I8J8K8L8M8N8O8P8Q8R8S8T8U8V8W8X8Y8Z8[8\8]8^8_8`8a8b8c8d8e8f8g8h8i8j8k8l88m8no8np8qr8qs8qt8qu8qv8qw8qw8qx8qx8qy8qy8qz8qz8q{8q{8q|8q}8q~8q8q8q8q8q8q8q8q88'*          12                     w      !""#$%&'()*+,-./0123456789:;<=>?@ABCbioha_5Dtmyz59vuS5Vyorh8uVNRBio.Base Bio.Bam.Rec Bio.IterateeBio.PriorityQueue Bio.AlignBio.UtilBio.Bam.HeaderBio.Bam.Regions Bio.Bam.FastqBio.Bam.Filter Bio.Bam.Trim Bio.Bam.EvanBio.Iteratee.BgzfBio.Iteratee.ZLibBio.Bam.Reader Bio.Bam.IndexBio.Iteratee.BuilderBio.Bam.Writer Data.AvroBio.GlfBio.Genocall.AdnaBio.Bam.Pileup Bio.Genocall Bio.Bam.RmdupBio.Genocall.AvroFile Bio.TwoBitPaths_biohazard Data.ListtakeBioBamBio.BambaseGHC.WordWord8Word32ListL_I5fb9yMzN6pISUEP5Ab29pData.ListLike.BaseListLike Control.Monad<=<>=>GHC.IO.Handle.FDstderrstdinstdoutSystem.Posix.TypesFdbytes_6elQVSg5cWdFrvRnfxTUrHData.ByteString.Internalc2ww2cexcep_8lazgLwpRJWHmL0Je5TK0PControl.Monad.Catch MonadMaskitera_42tTWV7VM4J2rj9esrmYV5Data.Iteratee.ListLike mapStream takeWhileEbreakEheadstryHead isFinishedtrans_3eG64VdP2vzGjP6wJiCp5XControl.Monad.Trans.ClassliftControl.Monad.IO.ClassMonadIOliftIOSizeable usedBytesPQPQ_Confmax_mb temp_pathmakePQdeletePQwithPQ enqueuePQ dequeuePQ peekMinPQgetMinPQsizePQModeGlobally HasPrefixIsPrefix myersAlign showAlignedwilsonshowNumshowOOM invnormcdfestimateComplexity phredplus phredminusphredsum<#> phredconverselog1pexpm1choose float2mini mini2float NucleotideNunN NucleotidesNsunNs everythingQualQunQ nucToNucsProbPrunPrtoQualfromQualfromQualRaisedRanger_posr_lengthPositionPosp_seqp_startSeqidpowtoProbfromProb qualToProb probToQualnucAnucCnucGnucTgapnucsAnucsCnucsGnucsTnucsN unpackSeqid packSeqid p_is_reverse toNucleotide toNucleotidesisBase isProperBase properBasesisGapshowNucleotideshowNucleotidescomplcompls shiftPosition shiftRange reverseRange extendRange insideRange wrapRange findAuxFileMdOpMdNumMdRepMdDelRefsRefsequnRefseq BamOtherShit BamSortingUnknownUnsortedGrouped Queryname Coordinate GroupSortedBamSQsq_name sq_length sq_other_shit BamHeader hdr_version hdr_sortinghdr_other_shitBamKeyBamMetameta_hdr meta_refsmeta_other_shit meta_commentaddPG parseBamMetaparseBamMetaLine showBamMeta isValidRefseq invalidRefseq invalidPos isValidPos unknownMapq isKnownMapqnoRefsgetRef flagPairedflagProperlyPaired flagUnmappedflagMateUnmapped flagReversedflagMateReversed flagFirstMateflagSecondMate flagAuxillary flagFailsQC flagDuplicate eflagTrimmed eflagMerged compareNamesreadMdshowMd distinctBin SubsequenceRegionsRegionrefseqstartendtoListfromListaddaddIntoverlapslookupLT ParseError errorContexts errorMessageQQ Ordering'LessEqualNotLess Enumeratee' Enumerator' groupStreamOn groupStreamBy takeStream headStream peekStream dropStream iLookAhead iGetStringmBindmBind_ioBindioBind_$== mergeEnums'concatMapStreamconcatMapStreamMmapMaybeStream filterStream filterStreamM mapStreamM mapStreamM_ foldStreamM foldStream zipStreams enumAuxFileenumDefaultInputs enumInputsdefaultBufSizemergeSortStreamsparMapChunksIO protectTerm progressNumemptyQlengthQpushQpopQ cancelAllparserToIterateestream2vectorN stream2vector withFileFdExtIntFloatTextBinCharIntArrFloatArr ExtensionsBamRaw virt_offsetraw_dataVector_Nucs_halfBamRecb_qnameb_flagb_rnameb_posb_mapqb_cigarb_mrnmb_mposb_isizeb_seqb_qualb_extsb_virtual_offsetCigOpMatInsDelNopSMaHMaPadCigar:* alignedLength nullBamRecgetMdbamRaw unpackBamdeleteEinsertEupdateEadjustEisPairedisProperlyPaired isUnmappedisMateUnmapped isReversedisMateReversed isFirstMate isSecondMate isAuxillary isFailsQC isDuplicate isTrimmedisMerged type_maskextAsInt extAsString setQualFlag progressPos parseFastqparseFastqCassava parseFastq' QualFilter filterPairs complexSimplecomplexEntropyqualityAveragequalityMinimumqualityFromOldIlluminaqualityFromNewIlluminatrim_3'trim_3trim_low_qualityfixupFlagAbuse fixupBwaFlags removeWartsCompressParamscompression_level queue_depth BgzfChunk SpecialChunk RecordChunk LeftoverChunkNoChunkBlock block_offsetblock_contentsdecompressPlaindecompressBgzfdecompressBgzfBlocksdecompressBgzfBlocks'isBgzfisGzip maxBlockSize bgzfEofMarker getOffset liftBlock compressBgzf' compressBgzfcompressBgzfLv compressChunkCompressionStrategyDefaultStrategyFiltered HuffmanOnly MemoryLevelDefaultMemoryLevelMinMemoryLevelMaxMemoryLevel WindowBitsDefaultWindowBitsMethodDeflatedCompressionLevelDefaultCompression NoCompression BestSpeedBestCompressionFormatGZipZlibRaw GZipOrZlibDecompressParamsdecompressWindowBitsdecompressBufferSizedecompressDictionary compressLevelcompressMethodcompressWindowBitscompressMemoryLevelcompressStrategycompressBufferSizecompressDictionary ZLibExceptionNeedDictionary BufferError StreamError DataError MemoryError VersionError UnexpectedIncorrectStateZLibParamsExceptionIncorrectCompressionLevelIncorrectWindowBitsIncorrectMemoryLeveldefaultCompressParamsdefaultDecompressParams enumDeflate enumInflateenumInflateAny enumSyncFlush enumFullFlushenumBlockFlush BamEnumerateeBamrawEnumeratee isBamOrSamdecodeAnyBamOrSamdecodeAnyBamOrSamFile decodeSam decodeSam'isBam isPlainBam isBgzfBam isGzipBam decodeAnyBamdecodeAnyBamFileconcatDefaultInputs concatInputsmergeDefaultInputs mergeInputscombineCoordinates combineNames decodeBam getBamRawBamIndexminshiftdepth unaln_off extensions refseq_binsrefseq_ckpoints readBamIndex readBaiIndex readTabix eneeBamRefseqeneeBamUnaligned eneeBamSubseqeneeBamRegions subsampleBamPushBBbufferlenmark newBuffer ensureBuffer expandBufferunsafePushBytepushByteunsafePushWord32unsafePushWord16 pushWord32 pushWord16unsafePushByteStringpushByteString pushBuilder unsafeSetMarksetMark endRecordencodeBgzfWith$fNullPointPush $fMonoidPushIsBamRecpushBam pipeSamOutput encodeBamWith writeBamFile pipeBamOutputwriteBamHandle ContainerOptsobjects_per_blockfiletype_labelMkSchemamkSchemaAvrotoSchematoBinfromBintoAvron.=string memoObject runMkSchema wordToFloat wordToDouble floatToWord doubleToWordcastzigzagencodeWordBase128decodeWordBase128encodeIntBase128decodeIntBase128zigIntzagInt deriveAvros deriveAvrowriteAvroContainerreadAvroContaineriterLoopiterGet $fAvroHashMap $fAvroVector $fAvroVector0$fAvro[] $fAvroText$fAvroByteString $fAvroDouble $fAvroFloat $fAvroInt64 $fAvroInt $fAvroBool$fAvro()$fMonadMkSchema$fApplicativeMkSchema$fFunctorMkSchemaGlfSeq glf_seqname glf_seqlenGlfRecSNP glf_refbase glf_offset glf_depth glf_min_lkglf_mapqglf_lkIndel glf_lk_hom1 glf_lk_hom2 glf_lk_het glf_is_ins1 glf_is_ins2glf_seq1glf_seq2 enee_glf_file enum_glf_fileenum_glf_handleDamageParametersDP ssd_sigma ssd_delta ssd_lambda ssd_kappa dsd_sigma dsd_delta dsd_lambdaTo:-> DamageModel!noDamage genSubstMatvec4memoDamageModel univDamageHeapEmptyNodePileFPileMrunPileMCallsPilePile'p_refseqp_pos p_snp_stat p_snp_pile p_indel_stat p_indel_pile IndelPileBasePile IndelVariant deleted_basesinserted_basesV_NucGL CallStats read_depth reads_mapq0sum_mapqsum_mapq_squared DamagedBaseDBdb_calldb_qualdb_dmgPrimBaseBase_pb_wait _pb_likes_pb_mapq_pb_rev _pb_chunks PrimChunksSeek EndOfRead decomposepileup get_refseqget_posupd_posset_pos get_active upd_active get_waiting upd_waitingget_damage_modelyieldpeekbumpconsume_activepileup'pileup''partitionPairEithersunioninsert getMinKeyviewMin$fMonadIOPileM $fMonadPileM$fApplicativePileM$fFunctorPileM$fMonoidCallStats$fShowDamagedBasesimple_indel_callsimple_snp_call simple_call mk_snp_gts maq_snp_callCollapse cons_collapsecons_collapse_keepcheap_collapsecheap_collapse_keeprmdup check_sort normalizeTowrapTo GenoCallSite snp_statssnp_likelihoods indel_statsindel_variantsindel_likelihoods GenoCallBlockreference_namestart_position called_sites $fAvroV_Nuc$fAvroIndelVariantMaskNoneSoftHardBoth TwoBitFile openTwoBit getSubseqWith getSubseqgetSubseqMaskedgetSubseqAscii getSeqnames hasSequence getSeqLength clampPosition getRandomSeqcatchIOversionbindirlibdirdatadir libexecdir sysconfdir getBinDir getLibDir getDataDir getLibexecDir getSysconfDirgetDataFileNameSkewHeap singletongetMindropMin myers_diffghc-prim GHC.ClassesOrd GHC.TypesDoubleGHC.Num-GHC.Base.True V_Nucleotide MV_Nucleotide$fBoundedNucleotideTFCo:R:VectorNucleotide V_NucleotidesMV_Nucleotides$fBoundedNucleotidesTFCo:R:VectorNucleotidesV_QualMV_Qual $fShowQualTFCo:R:VectorQualV_ProbMV_Prob$fReadNucleotides$fShowNucleotides$fReadNucleotide$fShowNucleotide$fFractionalProb $fNumProb $fShowProbTFCo:R:VectorProbbad_seq $fEnumRefseq $fShowRefseq$fMonoidBamHeader$fMonoidBamMeta $fShowBamKey$fIsStringBamKeyData.Iteratee.BaseIteratee Data.NullableNullableData.Iteratee.Iteratee EnumerateeIOvecto_JrQt7SYKOQF2foH4Ugm8MQData.Vector.Generic.BaseVector$fExceptionParseError FileOffset GHC.ExceptiondisplayException fromException toException ExceptionData.Iteratee.IOfileDriverRandomVBuffileDriverRandomfileDriverVBuf fileDriverenumFileRandomenumFileData.Iteratee.IO.Fd enumFdRandomenumFdData.Iteratee.IO.HandleenumHandleRandom enumHandleData.Iteratee.Char enumLinesBS enumWordsBS enumWords enumLinesprintLinesUnterminated printLinesData.Iteratee.BinaryreadWord64le_bsreadWord64be_bsreadWord32le_bsreadWord32be_bsreadWord16le_bsreadWord16be_bs endianRead8 endianRead4 endianRead3i endianRead3 endianRead2LSBMSBEndianenumFromCallbackCatchenumFromCallbackenumCheckIfDoneenumListenumPure1Chunk mergeEnums<><><>$==$>>>enumErrenumEof enumChunkjoinIMjoinIunfoldConvStreamCheckunfoldConvStream convStream mapChunksM mapChunkseneeCheckIfDoneIgnoreeneeCheckIfDonePasseneeCheckIfDoneHandleeneeCheckIfDone getChunksgetChunk foldChunksM mapChunksM_seek skipToEofisStreamFinishedcheckErrthrowRecoverableErrthrowErrEnumerateeHandler Enumeratorifoldilift mapIterateetryRunrunicontMidoneMliftIicontidonesetEOFChunkEOFStreamEofError EofNoError DataRemaining StreamStatusrunIternullCData.NullPointempty NullPointData.Iteratee.Exception iterStrExciterExceptionFromExceptioniterExceptionToException wrapIterExcenStrExc IFException EnumExceptionDivergentExceptionEnumStringExceptionEnumUnhandledIterExceptionfromIterExceptiontoIterException IException IterException SeekException EofExceptionIterStringExceptionData.Iteratee.Base.LooseMaplMapLooseMapMVector_Nucs_halfunpackExtensions$fShowVector_Nucs_half%$fMVectorMVector_Nucs_halfNucleotides#$fVectorVector_Nucs_halfNucleotidesTFCo:R:MutableVector_Nucs_half$fStorableCigar $fShowCigarskipJunk makeRecord some_file fastq_test print_namescounttrim_back_cigartrim_fwd_cigarsanitize_cigar trim_cigar ByteStringget_bgzf_header decompress1 compress1 bgzfBlocksZStreamc_crc32 c_inflateEnd c_deflateEnd c_inflate c_deflatec_inflateInit2_c_deflateInit2_get_bgzf_blockz_checkc_deflateInit2c_inflateInit2$fNullableBgzfChunk$fNullPointBgzfChunk $fMonoidBlock$fNullableBlock$fNullPointBlock ZlibFlush SyncFlush FullFlush EnumerateeSFlushing FinishingInvalidFullOutEmptyInInitialinflateSetDictionarydeflateSetDictionary inflateEnd deflateEnddeflateinflate inflateInit2_ deflateInit2_ fromFlush withZStream fromMethodfromCompressionLevelfromWindowBitsfromMemoryLevelfromCompressionStrategy fromErrno convParamwithByteString mkByteString putOutBuffer putInBuffer pullOutBuffer pullInBuffereneeErr insertOutfillswapOutdoRunflushfinish deflateInit2 inflateInit2deflate'inflate' mkCompress mkDecompress$fExceptionZLibException$fExceptionZLibParamsException$fShowZLibException$fShowZLibParamsException$fExceptionZlibFlush$fShowZlibFlush isEmptyBam decodeSamLoop parseSamRecSegmentCkpointsBins~~getSegmentArrayeneeBamSegment TabFormatGeneric SamFormat VcfFormat ZeroBasedTabMetaformatcol_seqcol_begcol_end comment_char skip_linesnamesTabIndexSegments segmentLists rgnToSegmentsbinList getIntervalsgetIndexArraysreduceMloopMlookupLEprimi_5Jnw7oEuYtM9dmKXelGXVbData.Primitive.ByteArrayMutableByteArrayencodeSamEntry pushBamRaw pushBamRec$fIsBamRecEither$fIsBamRecBamRec$fIsBamRecBamRaw enee_glf_recs enee_glf_seqVec_JRyeYKLYJRWDJi1MV46ZDRData.Vec.PackedMat44DECigdo_rmdup merge_singles merge_halvesRepresentative Consensus split_ecigsetMDWithMD WithoutMDMat'Rep'Ins'Del'Nop'SMa'HMa'Pad'MdFailMPosDecision fromDecisioncollapse originals b_mate_posb_totally_alignedaccumMap do_collapse minViewBy mk_new_md' consensusdo_cheap_collapse replaceXPoplusmask_alltoECigtoCigarrepM mergeblocksTwoBitSequenceIndexed tbs_n_blocks tbs_m_blockstbs_dna_offset tbs_dna_sizeTBFtbf_rawtbf_seqs mkBlockIndex takeOverlapgetFwdSubseqWith