!סc      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-./0123456789:;<=>?@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b cdefghijklmnopqrstuvwxyz{|}~                                                                                                                                         ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S TUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~     None "#27=>?@AM_iqNone "#27=>?@AM_jNone "#27=>?@AM_  biohazard*Calculates the Wilson Score interval. If (l,m,h) = wilson c x n, then m is the binary proportion and (l,h) it's c-confidence interval for x positive examples out of n observations. c" is typically something like 0.05. biohazardJTry to estimate complexity of a whole from a sample. Suppose we sampled total things and among those singles9 occured only once. How many different things are there?Let the total number be mA. The copy number follows a Poisson distribution with paramter lambda. Let  z := e^{\lambda} , then we have: P( 0 ) = e^{-\lambda} = \frac{1}{z} \\ P( 1 ) = \lambda e^{-\lambda} = \frac{\ln z}{z} \\ P(\ge 1) = 1 - e^{-\lambda} = 1 - \frac{1}{z} \\   \mbox{singles} = m \frac{\ln z}{z} \\ \mbox{total} = m \left( 1 - \frac{1}{z} \right) \\   D := \frac{\mbox{total}}{\mbox{singles}} = (1 - \frac{1}{z}) * \frac{z}{\ln z} \\ f := z - 1 - D \ln z = 0 To get z>, we solve using Newton iteration and then substitute to get m: df/dz = 1 - D/z \\ z' = z - \frac{ z (z - 1 - D \ln z) }{ z - D } \\ m = \mbox{singles} * \frac{z}{\ln z} $It converges as long as the initial z is large enough, and 10D (in the line for zz below) appears to work well. biohazard Computes  \ln \left( e^x + e^y \right) D without leaving the log domain and hence without losing precision. biohazard Computes  log (1+x) to a relative precision of 10^-8 even for very small x. Stolen from 0http://www.johndcook.com/cpp_log_one_plus_x.html biohazard Computes  e^x - 1  to a relative precision of 10^-10 even for very small x. Stolen from 'http://www.johndcook.com/cpp_expm1.html biohazard Computes  \ln (1 - e^x) , following Martin Mchler. biohazard Computes  \ln (1 + e^x) , following Martin Mchler. biohazard Computes  \ln ( \sum_i e^{x_i} ) < sensibly. The list must be sorted in descending(!) order. biohazard Computes & \ln \left( c e^x + (1-c) e^y \right) . biohazardBinomial coefficient: * \mbox{choose n k} = \frac{n!}{(n-k)! k!}  5None"#27=>?@AM_, biohazardA strict pair. biohazardMRanges in genomes We combine a position with a length. In 'Range pos len', pos- is always the start of a stretch of length len. Positions therefore move in the opposite direction on the reverse strand. To get the same stretch on the reverse strand, shift r_pos by r_length, then reverse direction (or call reverseRange). biohazardCoordinates in a genome. The position is zero-based, no questions about it. Think of the position as pointing to the crack between two bases: looking forward you see the next base to the right, looking in the reverse direction you see the complement of the first base to the left.CTo encode the strand, we (virtually) reverse-complement any sequence and prepend it to the normal one. That way, reversed coordinates have a negative sign and automatically make sense. Position 0 could either be the beginning of the sequence or the end on the reverse strand... that ambiguity shouldn't really matter. biohazardsequence (e.g. some chromosome) biohazardoffset, zero-based biohazardCommon way of using . biohazardA positive floating point value stored in log domain. We store the natural logarithm (makes computation easier), but allow conversions to the familiar "Phred" scale used for  values. biohazardWQualities are stored in deciban, also known as the Phred scale. To represent a value p , we store -10 * log_10 pa. Operations work directly on the "Phred" value, as the name suggests. The same goes for the >W instance: greater quality means higher "Phred" score, meand lower error probability. biohazardA nucleotide base in an alignment. Experience says we're dealing with Ns and gaps all the type, so purity be damned, they are included as if they were real bases. To allow  Nucleotidess to be unpacked and incorporated into containers, we choose to represent them the same way as the BAM file format: as a 4 bit wide field. Gaps are encoded as 0 where they make sense, N is 15. The contained a is guaranteed to be 0..15. biohazard?A nucleotide base. We only represent A,C,G,T. The contained a ist guaranteed to be 0..3. biohazardConverts a character into a 5. The usual codes for A,C,G,T and U are understood,  and F* become gaps and everything else is an N. biohazardConverts a character into a 5. The usual codes for A,C,G,T and U are understood,  and F* become gaps and everything else is an N. biohazard Tests if a  is a base. Returns l for everything but gaps. biohazard Tests if a  is a proper base. Returns l for A,C,G,T only. biohazard Tests if a * is a gap. Returns true only for the gap. biohazardComplements a Nucleotides. biohazardComplements a Nucleotides. biohazardMoves a Positionf. The position is moved forward according to the strand, negative indexes move backward accordingly. biohazardMoves a Range. This is just  shiftPosition lifted. biohazard Reverses a  to give the same Range on the opposite strand. biohazard>Extends a range. The length of the range is simply increased. biohazardExpands a subrange. (range1  range2) interprets range1 as a subrange of range2? and computes its absolute coordinates. The sequence name of range1 is ignored. biohazardWraps a range to a region. This simply normalizes the start position to be in the interval '[0,n)', which only makes sense if the Rangem is to be mapped onto a circular genome. This works on both strands and the strand information is retained.<a<a28None"#27=>?@AM_./0123./2031None "#27=>?@AM_4 biohazard*Class of things that can be unpacked into s. Kind of the opposite of G.6 biohazard Converts Bytes into Text. This uses UTF8, but if there is an error, it pretends it was Latin1. Evil as this is, it tends to Just Work on files where nobody ever wasted a thought on encodings.7 biohazard Converts Text into Bytes. This uses UTF8.8 biohazardDecompresses Gzip or Bgzf and passes everything else on. In reality, it simply decompresses Gzip, and when done, looks for another Gzip stream. Since there is a small chance to attempt decompression of an uncompressed stream, the original data is returned in case of an error.4567845678None"#27=>?@AM_Ӫ &'*+040/51243678)(9('&%$#":;<=|{zy>?@)AB.-,+*C?>=DyxwvEF G%H213IJjhigKLhM,N-./OilPQRSTUVWXYjkZopq[\]^_`abcdefgmnrstuz{|}~     ! "#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOSRQPTUVWXYZ[\]^_`abcdefknmlqpotsruvwxyz}!|{~#"$     #"! ('&%$)+*-,./0124356798;:=<?>A@CBEDJIHGFMLKONRQPSTUYXWVZ[\]^_`abcdefghijklmnopqrstuvwxy}|{z~      !"#$%&'()*+,-./0123456789:;<=>?A@CBDEHGFJILKMNOPQSRUTZYXWVcba`_^]\[defghijklmnopqrstuvwxyz{|}~      !56789:;<@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx}~45678 !"#$%&'()*<+,-./=>? &'*+040/51243678)(9('&%$#":;<=|{zy>?@)AB.-,+*C?>=DyxwvEF G%H213IJjhigKMN-./OilPQRSTUVWXYjkZopq[\]^_`abcdefgmnrstuz{|}~     "#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOSRQPTUVWXYZ[\]^_`abcdefknmlqpotsruvwxyz}!|{~#"$     #"! ('&%$)+*-,./0124356798;:=<?>A@CBEDJIHGFMLKONRQPSZ[\]^_`abcdefghijklmnopqrstuvwxy}|{z~      !"#$%&'()*+,-./0123456789:;<=>?A@CBDEHGFJILKMNOPQSRUTcba`_^]\[defghijklmnopqrstuvwxyz{|}~      !56789:;<@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx}~M,TUYXWVZYXWV?>LhM,= None "#27=>?@AM_ O biohazardBrings a 2bit file into memory. The file is mmap'ed, so it will not work on streams that are not actual files. It's also unsafe if the file is modified in any way.0 biohazardRepeat monadic action n` times. Returns result in reverse(!) order, but doesn't build a huge list of thunks in memory.R biohazard}Merge blocks of Ns and blocks of Ms into single list of blocks with masking annotation. Gaps remain. Used internally only.S biohazardExtract a subsequence and apply masking. TwoBit file can represent two kinds of masking (hard and soft), where hard masking is usually realized by replacing everything by Ns and soft masking is done by lowercasing. Here, we take a user supplied function to apply masking.T biohazard Works only in forward direction.U biohazard&Extract a subsequence without masking.V biohazardpExtract a subsequence with typical masking: soft masking is ignored, hard masked regions are replaced with Ns.W biohazardxExtract a subsequence with masking for biologists: soft masking is done by lowercasing, hard masking by printing an N.[ biohazard7limits a range to a position within the actual sequence\ biohazardGSample a piece of random sequence uniformly from the genome. Only pieces that are not hard masked are sampled, soft masking is allowed, but not reported. On a 32bit platform, this will fail for genomes larger than 1G bases. However, if you're running this code on a 32bit platform, you have bigger problems to worry about.] biohazardGets a fragment from a 2bit file. The result always has the desired length; if necessary, it is padded with Ns. Be careful about the unconventional encoding: 0..4 == TCAGN\ biohazard 2bit file biohazarddesired length biohazarddraw random int below limit biohazardRNG biohazardposition, sequence, new RNG@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^KLMNEFGHIJOQUSWVT]^XYZ[\PR@ABCD None "#27=>?@AM_6 biohazardEquivalent to (stream2vector . Streaming.Prelude.take n6, but terminates early and is thereby more efficient. biohazardReads the whole stream into a 1. (c) Don Stewart 2006 (c) Duncan Coutts 2006-2011 (c) Michael Thompson 2015 (c) Udo Stenzel 2018 BSD-styleu.stenzel@web.de experimentalportableNone"#27=>?@AHMSVX_" biohazard4A space-efficient representation of a succession of a0 vectors, supporting many efficient operations. An effectful  contains 8-bit bytes, or by using certain operations can be interpreted as containing 8-bit characters. It also contains an offset, which will be needed to track the virtual offsets in the BGZF decode. biohazardSmart constructor for . biohazard"Yield-style smart constructor for . biohazardReconceive an effect that results in an effectful bytestring as an effectful bytestring. Compare Streaming.mwrap. The closes equivalent of2Streaming.wrap :: f (Stream f m r) -> Stream f m r is here  consChunk. mwrap+ is the smart constructor for the internal Go constructor.2 biohazardCConstruct a succession of chunks from its Church encoding (compare GHC.Exts.build)3 biohazardResolve a succession of chunks into its Church encoding; this is not a safe operation; it is equivalent to exposing the constructors biohazardO(n)& Concatenate a stream of byte streams. biohazardMPerform the effects contained in an effectful bytestring, ignoring the bytes. biohazardO(1) The empty  -- i.e.  return () Note that ByteStream m w0 is generally a monoid for monoidal values of w, like () biohazardO(1) Yield a a as a minimal 4 biohazardO(c) Converts a byte stream into a stream of individual strict bytestrings. This of course exposes the internal chunk structure. biohazardO(c)< Converts a stream of strict bytestrings into a byte stream. biohazardO(n)4 Convert a monadic byte stream into a single strict X, retaining the return value of the original pair. This operation is for use with mapped. Wmapped R.toStrict :: Monad m => Stream (ByteStream m) m r -> Stream (Of ByteStream) m rFIt is subject to all the objections one makes to Data.ByteStream.Lazy "; all of these are devastating. biohazardO(c)a Transmute a pseudo-pure lazy bytestring to its representation as a monadic stream of chunks.Q.putStrLn $ Q.fromLazy "hi"hiQ.fromLazy "hi"OChunk "hi" (Empty (())) -- note: a 'show' instance works in the identity monad<Q.fromLazy $ BL.fromChunks ["here", "are", "some", "chunks"]GChunk "here" (Chunk "are" (Chunk "some" (Chunk "chunks" (Empty (()))))) biohazardO(n)5 Convert an effectful byte stream into a single lazy U with the same internal chunk structure, retaining the original return value.1This is the canonical way of breaking streaming (toStrict and the like are far more demonic). Essentially one is dividing the interleaved layers of effects and bytes into one immense layer of effects, followed by the memory of the succession of bytes.(Because one preserves the return value, toLazy is a suitable argument for  M B.mapped Q.toLazy :: Stream (ByteStream m) m r -> Stream (Of LazyBytes) m rQ.toLazy "hello" "hello" :> ()HB.toListM $ traverses Q.toLazy $ Q.lines "one\ntwo\nthree\nfour\nfive\n"6["one","two","three","four","five",""] -- [LazyBytes] biohazardO(1) ! is analogous to '(:)' for lists. biohazardO(1) Extract the head and tail of a a, or its return value if it is empty. This is the 'natural' uncons for an effectful byte stream. biohazardO(n/c)  n xs returns the suffix of xs after the first n elements, or [] if n >  xs.!Q.putStrLn $ Q.drop 6 "Wisconsin"sin"Q.putStrLn $ Q.drop 16 "Wisconsin" biohazardO(n/c)  n xs is equivalent to (L n xs,  n xs).\rest <- Q.putStrLn $ Q.splitAt 3 "therapist is a danger to good hyphenation, as Knuth notes"theQ.putStrLn $ Q.splitAt 19 restrapist is a danger biohazardxStrictly splits off a piece. This breaks streaming, so reserve its use for small strings or when conversion to strict ? is needed anyway. biohazard p xs$ returns the suffix remaining after N p xs. biohazard p is equivalent to I ( . p). biohazardRead entire handle contents lazily into a ,. Chunks are read on demand, in at most k9-sized chunks. It does not block waiting for a whole k-sized chunk, so if less than kS bytes are available then they will be returned immediately as a smaller chunk.The handle is closed on EOF. Note: the t* should be placed in binary mode with   for  to work correctly. biohazardRead entire handle contents lazily into a >. Chunks are read on demand, using the default chunk size. Note: the t* should be placed in binary mode with   for  to work correctly. biohazard Writes a  to a file. Actually writes to a temporary file and renames it on successful completion. The filename "-" causes it to write to stdout instead. biohazard Outputs a  to the specified t.5 biohazard5o is a variant of findIndex, that returns the length of the string if no element is found, rather than Nothing. biohazardiTake a builder and convert it to a genuine streaming bytestring, using a specific allocation strategy. biohazardTurns a ByteStream into a connected stream of ByteStreams that divide at newline characters. The resulting strings do not contain newlines. This is the genuinely streaming K which only breaks chunks, and thus never increases the use of memory.Because fs are usually read in binary mode, with no line ending conversion, this function recognizes both \n and \r\n4 endings (regardless of the current platform).  biohazardTurns a  into a stream of strict ? that divide at newline characters. The resulting strings do not contain newlines. This will cost memory if the lines are very long, and it does not recognize DOS line endings.  biohazardDecompresses GZip if present. If any GZip stream is found, all such streams are decompressed and any remaining data is discarded. Else, the input is returned unchanged. If the input is BGZF, the result will contain meaningful virtual offsets. If the input contains exactly one GZip stream, the result will have meaningfull offsets into the uncompressed data. Else, the offsets will be bogus.  biohazardXChecks if the input is GZip at all, and runs gunzip if it is. If it isn't, it runs k on the input. 6 biohazardDecompresses a gzip stream. If the leftovers look like another gzip stream, it recurses (some files, notably those produced by bgzip, contain multiple streams). Otherwise, the leftovers are discarded (some compressed HETFA files appear to have junk at the end).  biohazard<Compresses a byte stream using GZip with default parameters.** None"#27=>?@AMSX_ None"#27=>?@AM_n biohazardA list of reference sequences. biohazardReference sequence in Bam Bam enumerates the reference sequences and then sorts by index. We need to track that index if we want to reproduce the sorting order. biohazardPossible sorting orders from bam header. Thanks to samtools, which doesn't declare sorted files properly, we have to have the stupid  state, too. biohazardundeclared sort order biohazarddefinitely not sorted biohazardgrouped by query name biohazardsorted by query name biohazardsorted by coordinate biohazard6Exactly two characters, for the "named" fields in bam. biohazardAdds a new program line to a header. The new entry is (arbitrarily) prepended to the first existing chain, or forms a new singleton chain if none exists.7 biohazard"Combines two bam headers into one.rThe overarching goal is to combine headers in such a way that no information is lost, but redundant information is removed. In particular, we sometimes "merge" headers with the same references, at other times we "meld" headers with entirely different references. In the former case, we must concatenate the reference lists, in the latter case we want to keep it as is.NIf both headers have a version number, the result is the smaller of the two.hThe resulting sort order is the most specific one compatible with both input sort orders. The stupid ' state is compatible with everything.1Reference sequences are appended and run through $i. The numbering of reference may thus change, which has to be dealt with in an appropriate way, see  concatInputs,  mergeInputsOn, and "bam-meld" for details. (It is also possible that different sequences are left with the same name. We cannot solve this right here, and there is no reliable way to do it in general.)&Comments are appended and run through $n. This should work in most case, and if it doesn't, someone needs to "samtools reheader" the file anyway.]Program chains are just collected, but when formatting, they are (effectively) run through $7 and are potentially assigned new unique identifiers.8 biohazardFixes a bam header after parsing. It turns accumulated lists in to vectors, and it handles the program lines. Program lines come in as an arbitrary graph. It chould be a linear chain, but this isn't guaranteed in practice. We decompose the graph into chains by tracing from nodes with no predecessor, or from an arbitrary node if all nodes have predecessors. Tracing stops once it would form a cycle. biohazard*Creates the textual form of Bam meta data.1Formatting is straight forward, only program lines are a bit involved. Our multiple chains may lead to common nodes, and we do not want to print multiple identical lines. At the same time, we may need to print multiple different lines that carry the same id. The solution is to memoize printed lines, and to reuse their identity if an identical line is needed. When printing a line, it gets its preferred identifier, but if it's already taken, a new identifier is made up by first removing any trailing number and then by appending numeric suffixes. biohazardZTests whether a reference sequence is valid. Returns true unless the the argument equals  invalidRefseq. biohazardPThe invalid Refseq. Bam uses this value to encode a missing reference sequence. biohazardHThe invalid position. Bam uses this value to encode a missing position. biohazardPTests whether a position is valid. Returns true unless the the argument equals  invalidPos.* biohazardSCompares two sequence names the way samtools does. samtools sorts by "strnum_cmp":if both strings start with a digit, parse the initial sequence of digits and compare numerically, if equal, continue behind the numbersRelse compare the first characters (possibly NUL), if equal continue behind themZelse both strings ended and the shorter one counts as smaller (and that part is stupid), biohazardNormalizes a series of 5s and encodes them in the way BAM and SAM expect it.- biohazard]Computes the "distinct bin" according to the BAM binning scheme. If an alignment starts at pos# and its CIGAR implies a length of len* on the reference, then it goes into bin distinctBin pos len.C      !"#$%&'()*+,-C     * !"#$%&'()-+,None "#27=>?@AM_9 biohazard Default buffer size in elements.Since we often want to merge many files, a read should take more time than a seek. Assuming a rotating hard drive, this sets the sensible buffer size to somewhat more than one MB. A smaller buffer size would surely work on SSDs, but the large buffer doesn't hurt either.V biohazardReads 3 if the filename is "-", else reads the named file.W biohazard"Reads multiple inputs in sequence.Only one file is opened at a time, so they must also be consumed in sequence. The filename "-" refers to stdin, if no filenames are given, stdin is read.X biohazard'Protects the terminal from binary junk.If s is a , then  protectTerm s throws an error if U, is a terminal device, followed by the same s. This is most usefully composed with functions that might otherwise write binary data to an interactive terminal.] biohazardiA general progress indicator that prints some message after a set number of records have passed through.^ biohazard>A simple progress indicator that prints the number of records._ biohazardWA simple progress indicator that prints a position every set number of passed records.]0! <cdefghijklmnopqrstuvwxyz{|}~TUVWXYZ[\]^__! <TUVWXY]^_Z\[0! cdefghijklmnopqrstuvwxyz{|}~None "#27=>?@AM_8` biohazard4A tiny stream that can be afforded to incrementally.The streaming abstraction works fine if multiple sources feed into a small constant number of functions, but fails if there is an unpredictable number of such consumers. In that case, d- should be used to turn each consumer into a `'. It's then possible to incrementally b stuff to each `< in a collection in a simple loop. To get the final value, c each `.d biohazardmTurns a function that consumes a stream into a furrow. Idea and some code stolen from "streaming-eversion".`abcd`adbcNone "#27=>?@AM_<w biohazardThings we are able to encode. Taking inspiration from binary-serialise-cbor, we define these as a lazy list-like thing and consume it in a interpreter. biohazardWe manage a large buffer (multiple megabytes), of which we fill an initial portion. We remember the size, the used part, and two marks where we later fill in sizes for the length prefixed BAM or BCF records. We move the buffer down when we yield a piece downstream, and when we run out of space, we simply move to a new buffer. Garbage collection should take care of the rest. Unused G must be set to (maxBound::Int) so it doesn't interfere with flushing. biohazardDecompresses a bgzip stream. Individual chunks are decompressed in parallel. Leftovers are discarded (some compressed HETFA files appear to have junk at the end).  biohazardCreates a buffer. biohazardCreates a new buffer, copying the active content from an old one, with higher capacity. The size of the new buffer is twice the free space in the old buffer, but at least minsz. biohazard`Expand a chain of tokens into a buffer, sending finished pieces downstream as soon as possible.: biohazardThe EOF marker for BGZF files. This is just an empty string compressed as BGZF. Appended to BAM files to indicate their end.)mnopqrstuvwxyz{|}~)wxyz{|}~mnopqrstuvNone "#27=>?@AM_ biohazard^A mostly contiguous subset of a sequence, stored as a set of non-overlapping intervals in an G from start position to end position (half-open intervals, naturally). biohazardcA subset of a genome. The idea is to map the reference sequence (represented by its number) to a  Subseqeunce.  None"#27=>?@AHMV_  biohazard%A collection of extension fields. A # is actually two ASCII characters. biohazard=Bam record in its native encoding along with virtual address.; biohazard A mutable vector that packs two # into one byte, just like Bam does. biohazardA vector that packs two # into one byte, just like Bam does. biohazard'internal representation of a BAM record biohazard$virtual offset for indexing purposes biohazard{Cigar line in BAM coding Bam encodes an operation and a length into a single integer, we keep those integers in an array. biohazardExtracts the aligned length from a cigar line. This gives the length of an alignment as measured on the reference, which is different from the length on the query or the length of the alignment. biohazard?Smart constructor. Makes sure we got a at least a full record. biohazard/Deletes all occurences of some extension field. biohazardyBlindly inserts an extension field. This can create duplicates (and there is no telling how other tools react to that). biohazardeDeletes all occurences of an extension field, then inserts it with a new value. This is safer than , but also more expensive. biohazard1Adjusts a named extension by applying a function.GG9 None "#27=>?@AM_+ biohazardwrite in SAM format to stdoutThis is useful for piping to other tools (say, AWK scripts) or for debugging. No convenience functions to send SAM to a file or to compress it exist, because these are stupid ideas. biohazardnEncodes BAM records straight into a dynamic buffer, then BGZF's it. Should be fairly direct and perform well. biohazardrWrites BAM encoded stuff to a file. In reality, it cleverly writes to a temporary file and renames it when done. biohazardWrite BAM encoded stuff to stdout. This sends uncompressed(!) BAM to stdout. Useful for piping to other tools. The output is still wrapped in a BGZF stream, because that's what all tools expect; but the individuals blocks are not compressed. biohazardWrites BAM encoded stuff to a t.  None "#27=>?@AM_f  biohazardExtended CIGAR. This subsumes both the CIGAR string and the optional MD field. If we have MD on input, we generate it on output, too. And in between, we break everything into  very small operations. biohazard6Removes duplicates from an aligned, sorted BAM stream.8The incoming stream must be sorted by coordinate, and we check for violations of that assumption. We cannot assume that length was taken into account when sorting (samtools doesn't do so), so duplicates may be separated by reads that start at the same position but have different length or different strand.We are looking at three different kinds of reads: paired reads, true single ended reads, merged or trimmed reads. They are somewhat different, but here's the situation if we wanted to treat them separately. These conditions define a set of duplicates:Merged or trimmed: We compare the leftmost coordinates and the aligned length. If the library prep is strand-preserving, we also compare the strand.jPaired: We compare both left-most coordinates (b_pos and b_mpos). If the library prep is strand-preserving, only first-mates can be duplicates of first-mates. Else a first-mate can be the duplicate of a second-mate. There may be pairs with one unmapped mate. This is not a problem as they get assigned synthetic coordinates and will be handled smoothly.True singles: We compare only the leftmost coordinate. It does not matter if the library prep is strand-preserving, the strand always matters.1Across these classes, we can see more duplicates:Merged/trimmed and paired: these can be duplicates if the merging failed for the pair. We would need to compare the outer coordinates of the merged reads to the two 5' coordinates of the pair. However, since we don't have access to the mate, we cannot actually do anything right here. This case should be solved externally by merging those pairs that overlap in coordinate space..Single and paired: in the single case, we only have one coordinate to compare. This will inevitably lead to trouble, as we could find that the single might be the duplicate of two pairs, but those two pairs are definitely not duplicates of each other. We solve it by removing the single read(s).jSingle and merged/trimmed: same trouble as in the single+paired case. We remove the single to solve it.9In principle, we might want to allow some wiggle room in the coordinates. So far, this has not been implemented. It adds the complication that groups of separated reads can turn into a set of duplicates because of the appearance of a new reads. Needs some thinking about... or maybe it's not too important.Once a set of duplicates is collected, we perform a majority vote on the correct CIGAR line. Of all those reads that agree on this CIGAR line, a consensus is called, quality scores are adjusted and clamped to a maximum, the MD field is updated and the XP field is assigned the number of reads in the original cluster. The new MAPQ becomes the RMSQ of the map qualities of all reads.Treatment of Read Groups: We generalize by providing a "label" function; only reads that have the same label are considered duplicates of each other. The typical label function would extract read groups, libraries or samples.< biohazard Workhorse for duplicate removal.ZUnmapped fragments should not be considered to be duplicates of mapped fragments. The unmapped= flag can serve for that: while there are two classes of unmapped reads (those that are not mapped and those that are mapped to an invalid position), the two sets will always have different coordinates. (Unfortunately, correct duplicate removal now relies on correct unmapped and  mate unmappedp flags, and we don't get them from unmodified BWA. So correct operation requires patched BWA or a run of  bam-fixpair.) sOther definitions (e.g. lack of CIGAR) don't work, because that information won't be available for the mate. This would amount to making the unmappedk flag part of the coordinate, but samtools is not going to take it into account when sorting.'Instead, both flags become part of the mate pos grouping criterion.~First Mates should (probably) not be considered duplicates of Second Mates. This is unconditionally true for libraries with A/B-style adapters (definitely 454, probably Mathias' ds protocol) and the ss protocol, it is not true for fork adapter protocols (vanilla Illumina protocol). So it has to be an option, which would ideally be derived from header information.HThis code ignores read groups, but it will do a majority vote on the RG field and call consensi for the index sequences. If you believe that duplicates across read groups are impossible, you must call it with an appropriately filtered stream.Half-Aligned Pairs (meaning one known coordinate, while the validity of the alignments is immaterial) are rather complicated: 2Given that only one coordinate is known (5' of the aligned mate), we want to treat them like true singles. But the unaligned mate should be kept if possible, though it should not contribute to a consensus sequence. We assume nothing about the unaligned mate, not even that it  shouldn't0 be aligned, never mind the fact that it couldn't[ be. (The difference is in the finite abilities of real world aligners, naturally.)8Therefore, aligned reads with unaligned mates go to the same potential duplicate set as true singletons. If at least one pair exists that might be a duplicate of those, all singletons and half-aligned mates are discarded. Else a consensus is computed and replaces the aligned mates.The unaligned mates end up in the same place in a BAM stream as the aligned mates (therefore we see them and can treat them locally). We cannot call a consensus, since these molecules may well have different length, so we select one. It doesn't really matter which one is selected, and since we're treating both mates at the same time, it doesn't even need to be reproducible without local information. This is made to be the mate of the consensus.See = for how it's actually done.= biohazarddMerging information about true singles, merged singles, half-aligned pairs, actually aligned pairs.UWe collected aligned reads with unaligned mates together with aligned true singles (singlesq). We collected the unaligned mates, which necessarily have the exact same alignment coordinates, separately ( unaligned ). If we don't find a matching true pair (that case is already handled smoothly), we keep the highest quality unaligned mate, pair it with the consensus of the aligned mates and aligned singletons, and give it the lexically smallest name of the half-aligned pairs.> biohazardMerging of half-aligned reads. The first argument is a map of unaligned reads (their mates are aligned to the current position), the second is a list of reads that are aligned (their mates are not aligned).So, suppose we're looking at a ? that was passed through. We need to emit it along with its mate, which may be hidden inside a list. (Alternatively, we could force it to single, but that fails if we're passing everything along somehow.)Suppose we're looking at a @. We could pair it with some mate (which we'd need to duplicate), or we could turn it into a singleton. Duplication is ugly, so in this case, we force it to singleton. biohazardNormalize a read's alignment to fall into the canonical region of [0..l]. Takes the name of the reference sequence and its length. Returns Left xA if the coordinate decreased so the result is out of order now, Right x if the coordinate is unchanged. biohazardWraps a read to be fully contained in the canonical interval [0..l]. If the read overhangs, it is duplicated and both copies are suitably masked. A piece with changed coordinate that is now out of order is returned as Left x+, if the order is fine, it is returned as Right x.A biohazard Split an  into two at some position. The position is counted in terms of the reference (therefore, deletions count, insertions don't). The parts that would be skipped if we were splitting lists are replaced by soft masks. biohazard_Create an MD field from an extended CIGAR and place it in a record. We build it piecemeal (in go), call out to addNum, addRep, addDel to make sure the operations are not generated in a degenerate manner, and finally check if we're even supposed to create an MD field.          None "#27=>?@AM_  biohazard&Trims from the 3' end of a sequence.  trim_3' p b% trims the 3' end of the sequence in b% at the earliest position such that p evaluates to true on every suffix that was trimmed off. Note that the 3' end may be the beginning of the sequence if it happens to be stored in reverse-complemented form. Also note that trimming from the 3' end may not make sense for reads that were constructed by merging paired end data (but we cannot take care of that here). Further note that trimming may break dependent information, notably the "mate" information of the mate and many optional fields.! biohazard4Trim predicate to get rid of low quality sequence. trim_low_quality q ns qs' evaluates to true if all qualities in qs are smaller (i.e. worse) than q." biohazardFinds the merge point. Input is list of forward adapters, list of reverse adapters, sequence1, quality1, sequence2, quality2; output is merge point and two qualities (YM, YN).# biohazardgOverlap-merging of read pairs. We shall compute the likelihood for every possible overlap, then select the most likely one (unless it looks completely random), compute a quality from the second best merge, then merge and clamp the quality accordingly. (We could try looking for chimaera after completing the merge, if only we knew which ones to expect?)4Two reads go in, with two adapter lists. We return jZ if all merges looked mostly random. Else we return the two original reads, flagged as eflagVestigial' *and* the merged version, flagged as ' and optionally &R. All reads contain the computed qualities (in YM and YN), which we also return.>The merging automatically limits quality scores some of the time. We additionally impose a hard limit of 63 to avoid difficulties representing the result, and even that is ridiculous. Sane people would further limit the returned quality! (In practice, map quality later imposes a limit anyway, so no worries...)& biohazardFinds the trimming point. Input is list of forward adapters, sequence, quality; output is trim point and two qualities (YM, YN).' biohazardGTrimming for a single read: we need one adapter only (the one coming after the read), here provided as a list of options, and then we merge with an empty second read. Results in up to two reads (the original, possibly flagged, and the trimmed one, definitely flagged, and two qualities).( biohazardFor merging, we don't need the complete adapters (length around 70!), only a sufficient prefix. Taking only the more-or-less constant part (length around 30), there aren't all that many different adapters in the world. To deal with pretty much every library, we only need the following forward adapters, which will be the default (defined here in the direction they would be sequenced in): Genomic R2, Multiplex R2, Fraft P7.) biohazardLike ), these are the few adapters needed for the reverse read (defined in the direction they would be sequenced in as part of the second read): Genomic R1, CL 72.B biohazardQComputes overlap score for two reads (with qualities) assuming an insert length.  !"#$%&'()*+  !()"#&'+*$%None "#27=>?@AM_, biohazardDecodes either BAM or SAM.The input can be plain, gzip'ed or bgzf'd and either BAM or SAM. BAM is reliably recognized, anything else is treated as SAM. The offsets stored in BAM records make sense only for uncompressed or bgzf'd BAM.. biohazardReads multiple bam files.A continuation is run on the list of headers and streams. Since no attempt is made to unify the headers, this will work for completely unrelated bam files. All files are opened at the same time, which might run into the file descriptor limit given some ridiculous workflows.2 biohazardStreaming parser for SAM files.FIt parses plain uncompressed SAM and returns a result compatible with /. Since it is supposed to work the same way as the BAM parser, it requires a symbol table for the reference names. This is extracted from the @SQ lines in the header. Note that reading SAM tends to be inefficient; if you care about performance at all, use BAM.5 biohazard&Reads multiple bam inputs in sequence.Only one file is opened at a time, so they must also be consumed in sequence. If you can afford to open all inputs simultaneously, you probably want to use 6 instead. The filename "-" refers to stdin, if no filenames are given, stdin is read. Since we can't look ahead into further files, the header of the first input is used for the result, and an exception is thrown if one of the subsequent headers is incompatible with the first one.6 biohazard)Reads multiple bam files and merges them.If the inputs are all sorted by the thing being merged on, the output will be sorted, too. The headers are all merged sensibly, even if their reference lists differ. However, for performance reasons, we don't want to change the rname and mrnm fields in potentially all records. So instead of allowing arbitrary reference lists to be merged, we throw an exception unless every input is compatible with the effective reference list. ,-./012345678 ,-./201356478None"#27=>?@AMSX_NC biohazardMWe need a simple priority queue. Here's a skew heap (specialized to strict  priorities and a values).D biohazardThe things we drag along in E. Notes: * The active queue is a simple stack. We add at the front when we encounter reads, which reverses them. When traversing it, we traverse reads backwards, but since we accumulate the G%, it gets reversed back. The new active queue, however, is no longer reversed (as it should be). So after the traversal, we reverse it again. (Yes, it is harder to understand than using a proper deque type, but it is cheaper. There may not be much point in the reversing, though.)E biohazard`The pileup logic keeps a current coordinate (just two integers) and two running queues: one of active a9s that contribute to current genotype calling and on of waiting a)s that will contribute at a later point.This is the CPS version of multiple state and environment monads. It is somewhat faster than direct style and gives more control over when evaluation happens.= biohazardRRaw pile. Bases and indels are piled separately on forward and backward strands.> biohazard0Running pileup results in a series of piles. A > has the basic statistics of a VarCallr, but no likelihood values and a pristine list of variants instead of a proper call. We emit one pile with two G s (one for each strand) and one F, (the one immediately following) at a time.F biohazardMap quality and a list of encountered indel variants. The deletion has the reference sequence, if known, an insertion has the inserted sequence with damage information.G biohazardbMap quality and a list of encountered bases, with damage information and reference base if known.P biohazardStatistics about a genotype call. Probably only useful for filtering (so not very useful), but we keep them because it's easy to track them.Y biohazardRepresents our knowledge about a certain base, which consists of the base itself (A,C,G,T, encoded as 0..3; no Ns), the quality score (anything that isn't A,C,G,T becomes A with quality 0), and a substitution matrix representing post-mortem but pre-sequencing substitutions.Unfortunately, none of this can be rolled into something more simple, because damage and sequencing error behave so differently.Damage information is polymorphic. We might run with a simple version (a matrix) for calling, but we need more (a matrix and a mutable matrix, I think) for estimation.Z biohazardreference base from MD field[ biohazard called base\ biohazardquality of called base] biohazarddamage information^ biohazarddamage informationb biohazard more chunksc biohazard)number of bases to wait due to a deletione biohazard map qualityg biohazard-The primitive pieces for genotype calling: A position, a base represented as four likelihoods, an inserted sequence, and the length of a deleted sequence. The logic is that we look at a base followed by some indel, and all those indels are combined into a single insertion and a single deletion.h biohazard0skip to position (at start or after N operation)i biohazard1observed deletion and insertion between two basesj biohazardnothing anymorek biohazard/Decomposes a BAM record into chunks suitable for piling up. We pick apart the CIGAR and MD fields, and combine them with sequence and quality as appropriate. Clipped bases are removed/skipped as needed. We also apply a substitution matrix to each base, which must be supplied along with the read.l biohazardThe pileup enumeratee takes Gs, dissects them, interleaves the pieces appropriately, and generates >)s. The output will contain at most one G and one F2 for each position, piles are sorted by position.This top level driver receives s. Unaligned reads and duplicates are skipped (but not those merely failing quality checks). Processing stops when the first read with invalid br_rname$ is encountered or a t end of file.F biohazard!The actual pileup algorithm. If actives contains something, continue here. Else find the coordinate to continue from, which is the minimum of the next waiting] coordinate and the next coordinate in input; if found, continue there, else we're all done.G biohazard8Feeds input as long as it starts at the current positionH biohazardChecks waitingM queue. If there is anything waiting for the current position, moves it to active queue.I biohazardSeparately scans the two active queues and makes one G+ from each. Also sees what's next in the g: is contribute to two separate Fs, hs are pushed back to the waiting queue, j:s are removed, and everything else is added to two fresh active queues.0=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijkl0ghijabcdef`YZ[\]^_VWXkPQRSTUNOLMHIJKGF>?@ABCDE=lNone "#27=>?@AM_J biohazardDecode only those reads that fall into one of several regions. Strategy: We will scan the file mostly linearly, but only those regions that are actually needed. We filter the decoded stuff so that it actually overlaps our regions.:From the binning index, we get a list of segments per requested region. Using the checkpoints, we prune them: if we have a checkpoint to the left of the beginning of the interesting region, we can move the start of each segment forward to the checkpoint. If that makes the segment empty, it can be droppped.tThe resulting segment lists are merged, then traversed. We seek to the beginning of the earliest segment and start decoding. Once the virtual file position leaves the segment or the alignment position moves past the end of the requested region, we move to the next. Moving is a seek if it spans a sufficiently large gap or points backwards, else we just keep going.A JU has a start and an end offset, and an "end coordinate" from the originating region.K biohazardCheckpoints. Each checkpoint is a position with the virtual offset where the first alignment crossing the position is found. In BAI, we get this from the ioffset$ vector, in CSI we get it from the loffsetg field: "Given a region [beg,end), we only need to visit chunks whose end file offset is larger than ioffset of the 16kB window containing beg*." (Sounds like a marginal gain, though.)L biohazard.Mapping from bin number to vector of clusters. biohazardFull index, unifying BAI and CSI style. In both cases, we have the binning scheme, parameters are fixed in BAI, but variable in CSI. Checkpoints are created from the linear index in BAI or from the loffset field in CSI. biohazardMinshift parameter from CSI biohazardDepth parameter from CSI biohazard/Best guess at where the unaligned records start biohazard!Room for stuff (needed for tabix) biohazardVRecords for the binning index, where each bin has a list of segments belonging to it. biohazardjKnown checkpoints of the form (pos,off) where off is the virtual offset of the first record crossing pos.M biohazardjMerges two lists of segments. Lists must be sorted, the merge sort merges overlapping segments into one. biohazard'Reads any index we can find for a file.If the file name has a .bai or .csi extension, optionally followed by .gz, we read it. Else we look for the index by adding such an extension and by replacing the extension with these two, and finally try the file itself. The first file that exists is used. biohazard~Reads an index in BAI or CSI format, recognized automatically. The index can be compressed, even though this isn't standard. biohazardcReads a Tabix index. Note that tabix indices are compressed, this is taken care of automatically.N biohazardKReads the list of segments from an index file and makes sure it is sorted.O biohazard@Seeks to a virtual offset in a BGZF file and streams from there.xIf the optional end offset is supplied, streaming stops when it is reached. Else, streaming goes on to the end of file. biohazard&Streams one reference from a bam file.Seeks to a given sequence in a Bam file and enumerates only those records aligning to that reference. We use the first checkpoint available for the sequence, which an appropriate index. Streams the j records of the correct reference sequence only, and produces an empty stream if the sequence isn't found. biohazard4Reads from a Bam file the part with unaligned reads.Sort of the dual to /. Since the index does not actually point to the unaligned part at the end, we use a best guess at where the unaligned stuff might start, then skip over any aligned records. Our "fallback guess" is to decode from the current position; this only works if something else already consumed the Bam header.P biohazard Streams one J.Takes a t, a J and a  coming from that handle. If skipping ahead in the stream looks cheap enough, that is done. Else we seek the handle to the start offset and stream from it. Either way, the part of the stream before it crosses either the end offset or the max position is returned, and the remaining stream after it is returned in its functorial value so it can be passed to another invocation of e.g. P3. Note that the stream passed in becomes unusable.M4None "#27=>?@AM_0 biohazard/A quality filter is simply a transformation on BamRec/s. By convention, quality filters should set  flagFailsQC#, a further step can then remove the failed reads. Filtering of individual reads tends to result in mate pairs with inconsistent flags, which in turn will result in lone mates and all sort of troubles with programs that expect non-broken BAM files. It is therefore recommended to use  pairFilter4 with suitable predicates to do the post processing. biohazard,A filter/transformation applied to pairs of reads. We supply a predicate to be applied to single reads and one to be applied to pairs, the latter can get incomplete pairs, too, if mates have been separated or filtered asymmetrically. This fails spectacularly if the input isn't grouped by name. biohazardSimple complexity filter aka "Nancy Filter". A read is considered not-sufficiently-complex if the most common base accounts for greater than the cutoff fraction of all non-N bases. biohazardWFilter on order zero empirical entropy. Entropy per base must be greater than cutoff. biohazard>Filter on average quality. Reads without quality string pass. biohazardFilter on minimum quality. In qualityMinimum n q(, a read passes if it has no more than n bases with quality less than q&. Reads without quality string pass. biohazard[Convert quality scores from old Illumina scale (different formula and offset 64 in FastQ). biohazardZConvert quality scores from new Illumina scale (standard formula but offset 64 in FastQ).None "#27=>?@AM_P biohazardReader for DNA (not protein) sequences in FastA and FastQ. We read everything vaguely looking like FastA or FastQ, then shoehorn it into a BAM record. We strive to extract information following more or less established conventions from the header, but don't aim for completeness. The recognized syntactical warts are converted into appropriate flags and removed. Only the canonical variant of FastQ is supported (qualities stored as raw bytes with offset 33).!Supported additional conventions:A name suffix of /1 or /2X is turned into the first mate or second mate flag and the read is flagged as paired.Same for name prefixes of F_ or R_, respectively.A name prefix of M_* flags the sequence as unpaired and mergedA name prefix of T_+ flags the sequence as unpaired and trimmedA name prefix of C_Y, optionally before or after any of the other prefixes, is turned into the extra flag XP:i:-1? (result of duplicate removal with unknown duplicate count).dA collection of tags separated from the name by an octothorpe is removed and put into the fields XI and XJ as text.QEverything before the first sequence header is ignored. Headers can start with > or @j, we treat both equally. The first word of the header becomes the read name, the remainder of the header is ignored. The sequence can be split across multiple lines; whitespace, dashes and dots are ignored, IUPAC-IUB ambiguity codes are accepted as bases, anything else causes an error. The sequence ends at a line that is either a header or starts with +, in the latter case, that line is ignored and must be followed by quality scores. There must be exactly as many Q-scores as there are bases, followed immediately by a header or end-of-file. Whitespace is ignored. biohazardLike  , but alsoIf the first word of the description has at least four colon separated subfields, the first is used to flag first/second mate, the second is the "QC failed" flag, and the fourth is the index sequence. biohazardSame as , but a custom function can be applied to the description string (the part of the header after the sequence name), which can modify the parsed record. Note that the quality field can end up empty.!None "#27=>?@AM_,0! <cdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-TUVWXYZ[\]^_ !"#$%&'()*+,-./012345678None "#27=>?@AM_. biohazardFixes abuse of flags valued 0x800 and 0x1000. We used them for low quality and low complexity, but they have since been redefined. If set, we clear them and store them into the ZQ field. Also fixes abuse of the combination of the paired, 1st mate and 2nd mate flags used to indicate merging or trimming. These are canonicalized and stored into the FF field. This function is unsafe on BAM files of unclear origin! biohazardWFixes typical inconsistencies produced by Bwa: sometimes, 'mate unmapped' should be set, and we can see it, because we match the mate's coordinates. Sometimes 'properly paired' should not be set, because one mate is unmapped. This function is generally safe, but needs to be called only on the output of affected (older?) versions of Bwa. biohazardSRemoves syntactic warts from old read names or the read names used in FastQ files.None "#27=>?@AM_ biohazardMode argument for *, determines where free gaps are allowed. biohazard*align globally, without gaps at either end biohazard:align so that the second sequence is a prefix of the first biohazard:align so that the first sequence is a prefix of the second biohazardAlign two strings. myersAlign maxd seqA mode seqB tries to align seqA to seqB*, which will work as long as no more than maxd( gaps or mismatches are incurred. The modeX argument determines if either of the sequences is allowed to have an overhanging tail.The result is the triple of the actual distance (gaps + mismatches) and the two padded sequences. These sequences are the original sequences with dashes inserted for gaps.HThe algorithm is the O(nd) algorithm by Myers, implemented in C. A gap and a mismatch score the same. The strings are supposed to code for DNA, the code understands IUPAC-IUB ambiguity codes. Two characters match iff there is at least one nucleotide both can code for. Note that N is a wildcard, while X matches nothing. biohazardNicely print an alignment. An alignment is simply a list of strings with inserted gaps to make them align. We split them into manageable chunks, stack them vertically and add a line showing asterisks in every column where all aligned strings agree. The result is almost the Clustal format.None "#27=>?@AM_` biohazardHAlignment record. The reference sequence is filled with Ns if missing. biohazard'Compact storage of a pair of ambiguous . Used to represent alignments in a way that is accessible even to assembly code. The first and sencond field are stored in the low and high nybble, respectively. See , , . biohazard#Collected "traditional" statistics:~Base composition near 5' end and near 3' end. Each consists of five vectors of counts of A,C,G,T, and everything else.  begins with context% bases to the left of the 5' end,  ends with context" bases to the right of the 3' end.Substitutions. Counted from the reconstructed alignment, once around the 5' end and once around the 3' end. For a total of 2*4*4 different substitutions. Positions where the query has a gap are skipped.Substitutions at CpG motifs. Also counted from the reconstructed alignment, and a CpG site is simply the sequence CG in the reference. Gaps may confuse that definition, so that CpHpG still counts as CpG, because the H is gapped. That might actually be desirable.Conditional substitutions. The 5' and 3' ends count as damaged if the very last position has a C-to-T substitution. With that in mind, , ,  are like , but counting only reads where the 5' end is damaged, where the 3' end is damaged, and where both ends are damaged, respectively. biohazard*Parameters for the universal damage model.We assume the correct model is either no damage, or single strand damage, or double strand damage. Each of them comes with a probability. It turns out that blending them into one is simply accomplished by multiplying these probabilities onto the deamination probabilities.PFor single stranded library prep, only one kind of damage occurs (C frequency (a) in single stranded parts, and the overhang length is distributed exponentially with parameter  at the 5' end and  at the 3' end. (Without UDG treatment, those will be equal. With UDG, those are much smaller and in fact don't literally represent overhangs.)eFor double stranded library prep, we get C->T damage at the 5' end and G->A at the 3' end with rate % and both in the interior with rate . Everything is symmetric, and therefore the orientation of the aligned read doesn't matter either. Both overhangs follow a distribution with parameter . biohazardA S is a function that gives substitution matrices for each position in a read. The  can depend on whether the alignment is reversed, the length of the read and the position. (In practice, we should probably memoize precomputed damage models somehow.) biohazard/We represent substitution matrices by the type z. Internally, this is a vector of packed vectors. Conveniently, each of the packed vectors represents all transitions into the given nucleotide. biohazardRConvenience function to access a substitution matrix that has a mnemonic reading. biohazardAdds the two matrices of a mutable substitution model (one for each strand) appropriately, normalizes the result (to make probabilities from pseudo-counts), and freezes that into one immutable matrix. We add a single count everywhere to avoid getting NaN from bizarre data. biohazard for undamaged DNA. The likelihoods follow directly from the quality score. This needs elaboration to see what to do with amibiguity codes (even though those haven't actually been observed in the wild).Q biohazardTGeneric substitution matrix, has C->T and G->A deamination as parameters. Setting p or qP to 0 as appropriate makes this apply to the single stranded or undamaged case. biohazardlStream transformer that computes some statistics from plain BAM (no MD field needed) and a 2bit file. The e is also reconstructed and passed downstream. The final value of the source stream ends up in the  stats_more field of the result.Get the reference sequence including both contexts once. If this includes invalid sequence (negative coordinate), pad suitably.dAccumulate counts for the valid parts around 5' and 3' ends as appropriate from flags and config.cCombine the part that was aligned to (so no context) with the read to reconstruct the alignment.Arguments are the table of reference names, the 2bit file with the reference, the amount of context outside the alignment desired, and the amount of context inside desired.For l fragments, we cut the read in the middle, so the 5' and 3' plots stay clean from each other's influence.  and 8 fragments count completely towards the appropriate end. biohazard]Stream transformer that computes some statistics from plain BAM with a valid MD field. The b is also reconstructed and passed downstream. The final value of the source stream becomes the  stats_more field of the result.2Reconstruct the alignment from CIGAR, SEQ, and MD.BFilter the alignment to get the reference sequence, accumulate it.)Accumulate everything over the alignment.5The argument is the amount of context inside desired.For l fragments, we cut the read in the middle, so the 5' and 3' plots stay clean from each other's influence.  and 8 fragments count completely towards the appropriate end. biohazard*Common logic for statistics. The function get_ref_and_aln reconstructs reference sequence and alignment from a Bam record. It is expected to construct the alignment with respect to the forwards strand of the reference; we reverse-complement it if necessary.R biohazardyReconstructs the alignment from reference, query, and cigar. Only positions where the query is not gapped are produced. biohazardrReconstructs the alignment from query, cigar, and md. Only positions where the query is not gapped are produced. biohazard&Number of mismatches allowed by BWA. bwa_cal_maxdiff thresh len# returns the number of mismatches bwa aln -n $tresh! would allow in a read of length lenI. For reference, here is the code from BWA that computes it (we assume  err = 0.02, just like BWA): Iint bwa_cal_maxdiff(int l, double err, double thres) { double elambda = exp(-l * err); double sum, y = 1.0; int k, x = 1; for (k = 1, sum = elambda; k < 1000; ++k) { y *= l * err; x *= k; sum += elambda * y / x; if (1.0 - sum < thres) return k; } return 2; } II9 8S"#$%&'"()"(*"+,"-"./".0"#1"#2%34"56"78%39"#:";<"#=">?">@"AB"CD"CE"CF"CG%HI%HJ">K"#L"#M"#N"#O"#P"QR"ST"SU"SV"SW"XY"AZ"A["A\"A]"^_"`a"#b"#c"#d"#e"#f"#g"#h"#i"Cj"Ck%Hl"mn"Ao"Ap"#q"rs"#t">u%Hv"wx"Ay"mz"A{"|}"~""Q"X"#"""""#"#%%%%%"""""%"A"A%&"+%%""""""""%""%""%%%%""""~"~"~"~"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"""""""""""""""""""""""""""""""" "  "  "  " " " ";"r"r"r"r"r"r"r"r"r"r"r"r"r"r"r "r!"r""r#"r$"r%"r&"r'"r("r)"r*"r+"r,"r-"r."r/"r0"r1"r2"r3"r4"r5"r6"r7"r8"r9"r:"r;"r<"r="r>"r?"r@"rA"rB"rC"rD"rE"rF"rG"rH"rI"rJ"rK"rL"rM"NO"PQ"PR"PS"PT"PU"PV"PW"PX"PY"PZ"P["\]"\^"\_"\`"\a"\b"\c"de"df"dg"dh"ij"ik"il"im"no"np"nq"nr"st"su"vw"xy"xz"{|"{}"{~""""""""""""""""""""""""^"^"^"^"^"^"^"^"^"^"^"^"^"^""""""""7"7"7"7"7"7"7"7"7"7"7"7""""""""""""""""""""""S"S"S"S"S"S"S"S"S"S"S"S"S"S"S"S"S"S"S"S"S"S"S"S""""""""""""""""""""""""""""" "  "  "  " """"""""""""" "!"""#"$"%"&"'"(")"*"+","-"."/"0"1"2"3"4"5"6"7"8"9":";"<"=">"?"@"A"B"C"D"E"FG"FH"FH"FI"FI"FJ"FJ"FK"FK"FL"FL"FM"FM"FN"FN"FO"FO"PQ"PR"PS"PT"PU"PV"PW"PX"PY"PZ"P["P\"P]"P^"P_"P`"Pa"Pb"Pc"Pd"Pe"Pf"Pg"Ph"Pi"Pj"Pk"Pl"Pm"Pn"Po"Pp"Pq"Pr"Ps"Pt"Pu"Pv"Pw"Px"Py"Pz"P{"P{"P|"P}"P~"P"P"P"P"P"P"P"P"P"P"P"P"P"P"`"`"`"`"`"`"`"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"5"""""""""""""""""""""""""""""""""""""""""""5"5"5"""""""""""""""""""""""""""""""""""""""""""""" " " " " " """"""""""""""""""" "!"""#"$"%"&"'"(")"*"+","-"./".0".0".1".2".2".3".4".4"56"57"57"58"59"59"5:"5;"5;"5<"5="5="5>"5"5"5?"5@"5@"5A"5B"5B"C"D"EF"GH"GI"GJ"GK"GL"GM"GN"GO"GP"GQ"GR"GS"GT"GU"GV"GW"GX"GY"GZ"G["G\"G]"G^"G_"G`"Ga"Gb"Gc"Gd"Ge"Gf"Gg"Gh"Gi"Gj"Gk"Gl"Gm"Gn"Go"Gp"Gq"Gr"Gs"Gt"Gu"Gv"Gw"Gx"Gy"Gz"G{"G|"G}"G~"G"G"""""""""""""""""""""""""""""""""""""""""""""""""""""""+"+"+"+""""""""""""""""""""""""""w"w"w"w"w"w"w"w"w""""""""""m"m"m"m"m"m"m"m"m"m"m"m"m"m"m"m"m"m"m"m"m"m"m"m"m"m"m"m "m "m "m "m "m"m"m""""""""""""""" "!"""#"$"%"&"'"(")"*"+","-"."/"0"1"2"3"4"5"6"7"8"9":";"<"=">"?"@"A"B"C"D"E"F"G"HI"HJ"HK"HL"M"NO"NP"NQ"NR"NS"NT"NU"NV"NW"NX"NY"NZ"N["N\"N]"N^"N_"N`"Na"Nb"Nc"Nd"Ne"Nf"Ng"Nh"Ni"Nj"Nk"Nl"Nm"no"pq"pr"ps"tu"tv"tw"tx"Ay"Az"A{"A|"A}"A~"A"A"A"A"A"A"A"A"A"A"A"A"A"A"A"A"A"C"C"C"C"C"C""|"|"|"|"|"|"|"|"|"|"("("("("("("("("("("("("("("("("("("("("("("("("("("("("("("(""""""""""."."."""""""""""""">">">">">"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#"#""""%H%H%H %H %H %H %H %H%H%H !"#$"#%"#&'()"#*+,"#-"#."#/'0123456789:;<=>?@AABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~                                                            ! " # $ % & '   ( )   * + ,  - . / 0 K 1 2 3 4 5 6 7 8 9 : ; < = > > ? ? @ A B C D E F G  H I J K L M N O P Q R S T U V W X Y Z [ \ ] ] ^ _ _ ` a b c d e f g h h i j k l l m n o p p q q r s t u v w x y z { | } ~                                                             %       !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~      !"#$%&'()*+,-./012345667789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyyz{|}~        $biohazard-2.0-Ek3b5b8UGYN2HSi1h8Bift Bio.PreludeBio.Base Bio.Streaming Bio.Util.Text Bio.Util.MMap Bio.Util.NubBio.Util.NumericBio.Util.Storable Bio.TwoBitBio.Streaming.VectorBio.Streaming.BytesBio.Streaming.ParseBio.Bam.HeaderBio.Streaming.FurrowBio.Streaming.BgzfBio.Bam.Regions Bio.Bam.RecBio.Bam.Writer Bio.Bam.Rmdup Bio.Bam.TrimBio.Bam.ReaderBio.Bam.Pileup Bio.Bam.IndexBio.Bam.Filter Bio.Bam.Fastq Bio.Bam.Evan Bio.AlignBio.Adna Streamingmapped System.IOhSetBinaryModeBio.BambaseGHC.Base++ghc-primGHC.PrimseqGHC.Listfilterzip GHC.Stable newStablePtrprint Data.Tuplefstsnd otherwiseassert GHC.MagiclazyGHC.IO.Exception assertError Debug.TracetraceinlinemapGHC.Exts groupWith$GHC.Num fromInteger-GHC.Real fromRationalGHC.EnumenumFrom enumFromThen enumFromToenumFromThenTo GHC.Classes==>=negatefail>>=>>fmapreturnControl.Monad.Fixmfix Control.Arrowarrapp|||loop Data.String fromString fromIntegral realToFrac toInteger toRational Control.Monadguard Data.DynamictoDyn<>memptymappendmconcatjoin<*>pure*>BoundedEnumEq GHC.FloatFloating FractionalIntegralMonad Data.DataDataFunctorNumOrdGHC.ReadReadReal RealFloatRealFracGHC.ShowShowGHC.ArrIxData.Typeable.InternalTypeableMonadFixIsString Applicative Data.FoldableFoldableData.Traversable Traversable GHC.GenericsGenericGeneric1 SemigroupMonoid GHC.TypesBoolCharDoubleFloatIntGHC.IntInt8Int16Int32Int64 integer-gmpGHC.Integer.TypeInteger GHC.MaybeMaybeOrderingRatioRational RealWorld StablePtrIOWordGHC.WordWord8Word16Word32Word64GHC.PtrPtrFunPtr Data.EitherEitherRep1FalseNothingJustTrueLeftRightLTEQGTTyConGHC.ForeignPtr ForeignPtrGHC.IO.Handle.TypesHandleGHC.STST rangeSizeinRangeindexrangeuntangle ioException heapOverflow stackOverflowcannotCompactMutablecannotCompactPinnedcannotCompactFunctionallocationLimitExceededblockedIndefinitelyOnSTMblockedIndefinitelyOnMVar InterruptedResourceVanished TimeExpiredUnsupportedOperation HardwareFaultInappropriateTypeInvalidArgument OtherError ProtocolError SystemErrorUnsatisfiedConstraints UserErrorPermissionDeniedIllegalOperationResourceExhausted ResourceBusy NoSuchThing AlreadyExistsunsupportedOperation ioe_filename ioe_errnoioe_description ioe_locationioe_type ioe_handleIOError Data.Complexphase magnitudepolarcismkPolar conjugateimagPartrealPart:+Complex Data.Fixed showFixedmod'divMod'div'MkFixedFixed resolution HasResolutionE0UniE1DeciE2CentiE3MilliE6MicroE9NanoE12PicoData.Functor.Compose getComposeComposeData.Functor.SumInRInLSumsortWith tyconModule tyconUQname isNorepType mkNoRepType mkCharConstr mkRealConstrmkIntegralConstr mkCharType mkFloatType mkIntTypemaxConstrIndex constrIndex indexConstr isAlgType readConstr showConstr constrFixity constrFieldsdataTypeConstrsmkConstr mkDataType repConstr constrRep constrType dataTypeRep dataTypeName fromConstrM fromConstrB fromConstrgmapMogmapMpgmapMgmapQigmapQgmapQrgmapQlgmapT dataCast2 dataCast1 dataTypeOftoConstrgunfoldgfoldlDataTypeConstrNoRepCharRepFloatRepIntRepAlgRepDataRep CharConstr FloatConstr IntConstr AlgConstr ConstrRepConIndexInfixPrefixFixitySystem.TimeouttimeoutControl.ConcurrentthreadWaitWriteSTMthreadWaitReadSTMthreadWaitWritethreadWaitReadrunInUnboundThreadrunInBoundThreadisCurrentThreadBoundforkOSWithUnmaskforkOS forkFinallyrtsSupportsBoundThreadsControl.Concurrent.ChanwriteList2ChangetChanContentsdupChanreadChan writeChannewChanChanControl.Concurrent.QSem signalQSemwaitQSemnewQSemQSemControl.Concurrent.QSemN signalQSemN waitQSemNnewQSemNQSemNData.Bifunctorsecondfirstbimap BifunctorControl.Monad.IO.ClassliftIOMonadIO Data.RatioapproxRational Data.STRef modifySTRef' modifySTRef Data.Unique hashUnique newUniqueUniqueGHC.StableName eqStableNamehashStableNamemakeStableName StableNameSystem.EnvironmentgetEnvironment withProgNamewithArgsunsetEnvsetEnv lookupEnvgetEnv getProgNamegetArgs!System.Environment.ExecutablePathgetExecutablePath System.Exitdie exitSuccess exitFailureexitWith System.Mem performGCperformMajorGCperformMinorGC Text.PrintfhPrintfprintfmfilter<$!>unless replicateM_ replicateMfoldM_foldM zipWithM_zipWithM mapAndUnzipMforever<=<>=>filterM Data.Version makeVersion parseVersion showVersion versionTags versionBranchVersion traceMarkerIO traceMarker traceEventIO traceEvent traceStack traceShowMtraceM traceShowId traceShowtraceId putTraceMsgtraceIO Data.ListisSubsequenceOffoldMapDefault fmapDefault mapAccumR mapAccumLforMforsequencemapM sequenceAtraverseControl.Applicativeoptional unwrapMonad WrapMonad WrappedMonad unwrapArrow WrapArrow WrappedArrow getZipListZipListleftApp^<<<<^>>^^>>returnA&&&***Arrow runKleisliKleisli zeroArrow ArrowZero<+> ArrowPlus+++rightleft ArrowChoice ArrowApply ArrowMonad ArrowLoopData.Functor.Identity runIdentityIdentitywithBinaryFilehPrintreadIOreadLn appendFile writeFilereadFileinteract getContentsgetLinegetCharputStrLnputStrputChar GHC.IO.HandlehSeekhCloseGHC.IO.Handle.FDopenBinaryFilestderrstdinGHC.IO.Handle.Text hPutStrLnhPutStr GHC.Conc.IO registerDelay threadDelay closeFdWithioManagerCapabilitiesChangedensureIOManagerIsRunningGHC.Conc.Signal runHandlers setHandlerSignal HandlerFunControl.Concurrent.MVar mkWeakMVaraddMVarFinalizermodifyMVarMaskedmodifyMVarMasked_ modifyMVar modifyMVar_withMVarMaskedwithMVarswapMVarSystem.IO.Unsafe unsafeFixIOControl.ExceptionallowInterruptControl.Monad.ST.ImpfixSTSystem.IO.ErrorannotateIOError modifyIOErrorioeSetFileName ioeSetHandleioeSetLocationioeSetErrorStringioeSetErrorTypeioeGetFileName ioeGetHandleioeGetLocationioeGetErrorStringioeGetErrorTypeisUserErrorTypeisPermissionErrorTypeisIllegalOperationErrorTypeisEOFErrorTypeisFullErrorTypeisAlreadyInUseErrorTypeisDoesNotExistErrorTypeisAlreadyExistsErrorType userErrorTypepermissionErrorTypeillegalOperationErrorType eofErrorType fullErrorTypealreadyInUseErrorTypedoesNotExistErrorTypealreadyExistsErrorType isUserErrorisPermissionErrorisIllegalOperation isEOFError isFullErrorisAlreadyInUseErrorisDoesNotExistErrorisAlreadyExistsError mkIOError tryIOErrorControl.Exception.Base mapExceptionPatternMatchFail RecSelError RecConError RecUpdError NoMethodError TypeErrorNonTerminationNestedAtomically GHC.Conc.SyncgetUncaughtExceptionHandlersetUncaughtExceptionHandler reportErrorreportStackOverflow writeTVarreadTVar readTVarIO newTVarIOnewTVarcatchSTMthrowSTMorElseretry atomically unsafeIOToSTMnewStablePtrPrimMVarmkWeakThreadIdthreadCapability threadStatus runSparksparpseq labelThreadyield myThreadIdthrowTo killThread childHandler numSparksgetNumProcessorssetNumCapabilitiesgetNumCapabilitiesnumCapabilitiesforkOnWithUnmaskforkOnforkIOWithUnmaskforkIOdisableAllocationLimitenableAllocationLimitgetAllocationCountersetAllocationCounterreportHeapOverflowThreadIdBlockedOnOtherBlockedOnForeignCall BlockedOnSTMBlockedOnExceptionBlockedOnBlackHole BlockedOnMVar BlockReason ThreadDied ThreadBlockedThreadFinished ThreadRunning ThreadStatusPrimMVarSTMTVar dynTypeRepdynAppdynApply fromDynamicfromDynDynamicioErrorasyncExceptionFromExceptionasyncExceptionToExceptionBlockedIndefinitelyOnMVarBlockedIndefinitelyOnSTMDeadlockAllocationLimitExceededCompactionFailedAssertionFailedSomeAsyncException UserInterrupt ThreadKilled HeapOverflow StackOverflowAsyncExceptionUndefinedElementIndexOutOfBoundsArrayExceptionFixIOException ExitFailure ExitSuccessExitCode IOErrorTypehFlushstdout GHC.IO.Device SeekFromEnd RelativeSeek AbsoluteSeekSeekMode Data.IORefatomicWriteIORefatomicModifyIORef'atomicModifyIORef modifyIORef' modifyIORef mkWeakIORefForeign.ForeignPtr.ImpmallocForeignPtrArray0mallocForeignPtrArraynewForeignPtrEnvwithForeignPtr newForeignPtrfinalizeForeignPtrplusForeignPtrcastForeignPtrtouchForeignPtrnewForeignPtr_addForeignPtrFinalizerEnvaddForeignPtrFinalizermallocForeignPtrBytesmallocForeignPtr FinalizerPtrFinalizerEnvPtr GHC.IORef writeIORef readIORefnewIORefIORefGHC.IOevaluategetMaskingState interruptiblethrowIOstToIOFilePathMaskedUninterruptibleMaskedInterruptibleUnmasked MaskingState userError IOException GHC.Exceptionthrow ErrorCallErrorCallWithLocationGHC.Exception.Type SomeExceptiondisplayException fromException toException ExceptionRatioZeroDenominatorDenormal DivideByZeroLossOfPrecision UnderflowOverflowArithException Data.TypeabletypeOf7typeOf6typeOf5typeOf4typeOf3typeOf2typeOf1 rnfTypeReptypeRepFingerprint typeRepTyCon typeRepArgs splitTyConAppmkFunTy funResultTygcast2gcast1gcasteqTcast showsTypeReptypeReptypeOfTypeReprnfTyContyConFingerprint tyConName tyConModule tyConPackageData.Functor.ConstgetConstConstfindnotElem minimumBy maximumByallanyorand concatMapconcatmsumasum sequence_ sequenceA_forM_mapM_for_ traverse_foldlMfoldrMproductsumminimummaximumelemlengthnulltoListfoldl1foldr1foldl'foldlfoldr'foldrfoldMapfold Data.MonoidgetFirstFirstgetLastLastgetApApData.Semigroup.InternalgetDualDualappEndoEndogetAllAllgetAnyAnygetSum getProductProductgetAltAltto1from1 Unsafe.Coerce unsafeCoerce Data.OldListunwordswordsunlineslinesunfoldrsortOnsortBysort permutations subsequencestailsinitsgroupBygroupdeleteFirstsByunzip7unzip6unzip5unzip4zipWith7zipWith6zipWith5zipWith4zip7zip6zip5zip4genericReplicate genericIndexgenericSplitAt genericDrop genericTake genericLengthinsertByinsert partition transpose intercalate intersperse intersectBy intersectunionByunion\\deleteBydeletenubBynub isInfixOf isSuffixOf isPrefixOf findIndices findIndex elemIndices elemIndex stripPrefix dropWhileEnd Data.Char isSeparatorisNumberisMarkisLetter digitToInt Text.Readread readMaybe readEitherreads fromRightfromLeftisRightisLeftpartitionEithersrightsleftseitherData.Ord comparingDown Data.ProxyProxyControl.Category>>><<<.idCategoryData.Type.EqualityRefl:~:HRefl:~~: Foreign.Ptr intPtrToPtr ptrToIntPtr wordPtrToPtr ptrToWordPtrfreeHaskellFunPtrWordPtrIntPtr GHC.IO.IOMode ReadWriteMode AppendMode WriteModeReadModeIOModeForeign.Storablepokepeek pokeByteOff peekByteOff pokeElemOff peekElemOff alignmentsizeOfStorablecastPtrToStablePtrcastStablePtrToPtrdeRefStablePtr freeStablePtrcastPtrToFunPtrcastFunPtrToPtr castFunPtr nullFunPtrminusPtralignPtrplusPtrcastPtrnullPtrNumericshowOctshowHex showIntAtBase showHFloat showGFloatAlt showFFloatAlt showGFloat showFFloat showEFloatshowInt readSigned readFloatreadHexreadDecreadOctreadInt lexDigits readLitChar lexLitCharlex readParen readListPrecreadPrecreadList readsPrecText.ParserCombinators.ReadPrec readS_to_Prec readPrec_to_S readP_to_Prec readPrec_to_PReadPrecText.ParserCombinators.ReadP readS_to_P readP_to_SReadSReadPfromRat floatToDigits showFloatatanhacoshasinhtanhcoshsinhatanacosasintancossinlogBase**sqrtlogexppiatan2isIEEEisNegativeZeroisDenormalized isInfiniteisNaN scaleFloat significandexponent encodeFloat decodeFloat floatRange floatDigits floatRadix byteSwap64 byteSwap32 byteSwap16 GHC.UnicodetoTitletoUppertoLowerisLowerisUpperisPrint isControl isAlphaNumisAlphaisSymbol isPunctuation isHexDigit isOctDigitisDigitisSpace isAsciiUpper isAsciiLowerisLatin1isAsciigeneralCategory NotAssigned PrivateUse SurrogateFormatControlParagraphSeparator LineSeparatorSpace OtherSymbolModifierSymbolCurrencySymbol MathSymbolOtherPunctuation FinalQuote InitialQuoteClosePunctuationOpenPunctuationDashPunctuationConnectorPunctuation OtherNumber LetterNumber DecimalNumber EnclosingMarkSpacingCombiningMarkNonSpacingMark OtherLetterModifierLetterTitlecaseLetterLowercaseLetterUppercaseLetterGeneralCategory GHC.STRef writeSTRef readSTRefnewSTRefSTRefrunST Data.BitstoIntegralSizedpopCountDefaulttestBitDefault bitDefaultpopCountrotateRrotateL unsafeShiftRshiftR unsafeShiftLshiftLisSignedbitSize bitSizeMaybetestBit complementBitclearBitsetBitbitzeroBitsrotateshift complementxor.|..&.BitscountTrailingZeroscountLeadingZeros finiteBitSize FiniteBits Data.Boolbool Data.Function&onfix Data.Functorvoid$><&><$>lcmgcd^^^oddeven showSigned denominator numerator%divModquotRemmoddivremquotrecip/floorceilingroundtruncateproperFractionmaxBoundminBoundfromEnumtoEnumpredsuccGHC.Charchr intToDigit showLitChar showParen showStringshowCharshowsShowSshowListshow showsPrecunzip3unzipzipWith3zipWithzip3!!lookupreversebreakspansplitAtdroptake dropWhile takeWhilecycle replicaterepeatiterate'iteratescanr1scanrscanl'scanl1scanlfoldl1'initlasttailunconshead Data.MaybemapMaybe catMaybes listToMaybe maybeToList fromMaybefromJust isNothingisJustmaybeswapuncurrycurry GHC.IO.UnsafeunsafeInterleaveIOunsafeDupablePerformIOunsafePerformIOGHC.MVar isEmptyMVar tryReadMVar tryPutMVar tryTakeMVarputMVarreadMVartakeMVarnewMVar newEmptyMVarMVarsubtractsignumabs*+asTypeOfuntil$!flipconstordapliftM5liftM4liftM3liftM2liftMwhen=<<liftA3liftA<**>stimessconcat<$<*liftA2manysome<|>empty Alternativemplusmzero MonadPlus:|NonEmptyStringGHC.Err undefinederrorWithoutStackTraceerror/=<=compare&&||not<>maxminbytestring-0.10.8.2Data.ByteString.Internalc2ww2ccontainers-0.6.0.1Data.IntMap.InternalIntMapData.IntSet.InternalIntSettransformers-0.5.5.0Control.Monad.Trans.Classlift MonadTrans text-1.2.3.1Data.Text.InternalTextunsafeMMapFile'hashable-1.2.7.0-2SI038axTEd7AEZJ275kpiData.Hashable.ClassHashablehash hashWithSalt4unordered-containers-0.2.10.0-LgoTL3wbBEY5bZIDJiyxW4Data.HashSet.BaseHashSet Hashable1nubHash nubHashByliftHashWithSalt Hashable2liftHashWithSalt2Data.HashMap.BaseHashMapwilsonshowNumshowOOM invnormcdfestimateComplexity<#>log1pexpm1log1mexplog1pexplsumllerpchoosePair:!:Ranger_posr_lengthPositionPosp_seqp_startProbProb'PrunPrQualQunQ NucleotidesNsunNs NucleotideNunN nucToNucstoQualfromQualfromQualRaisedpowtoProbfromProb qualToProb probToQualnucAnucCnucGnucTgapnucsAnucsCnucsGnucsTnucsN p_is_reverse toNucleotide toNucleotidesisBase isProperBase properBasesisGapshowNucleotideshowNucleotidescomplcompls shiftPosition shiftRange reverseRange extendRange insideRange wrapRange$fReadNucleotide$fShowNucleotide$fBoundedNucleotide$fReadNucleotides$fShowNucleotides$fBoundedNucleotides $fShowQual$fFractionalProb' $fNumProb' $fShowProb'$fEqNucleotide$fOrdNucleotide$fEnumNucleotide$fIxNucleotide$fStorableNucleotide$fEqNucleotides$fOrdNucleotides$fEnumNucleotides$fIxNucleotides$fStorableNucleotides$fEqQual $fOrdQual$fStorableQual $fBoundedQual $fEqProb' $fOrdProb'$fStorableProb'$fShowPosition $fEqPosition $fOrdPosition $fShowRange $fEqRange $fOrdRange$fEqPair $fOrdPair $fShowPair $fReadPair $fBoundedPair$fIxPair peekWord8peekUnalnWord16LEpeekUnalnWord32LEpokeUnalnWord32LEpeekUnalnWord16BEpeekUnalnWord32BEUnpackunpack decodeBytes encodeBytesdecompressGzip $fUnpack[] $fUnpackText$fUnpackByteString'exceptions-0.10.0-KStaZHFhmg9WV0B4Gib1EControl.Monad.Catch MonadMaskLazyText LazyBytesBytesMaskNoneSoftHardBothTwoBitSequenceTBS tbs_n_blocks tbs_m_blockstbs_dna_offset tbs_dna_size TwoBitFileTBFtbf_rawtbf_seqs openTwoBit takeOverlapgetFwdSubseqWith mergeBlocks getSubseqWith getLazySubseq getSubseqgetSubseqMaskedgetSubseqAscii getSeqnameslookupSequence getSeqLength clampPosition getRandomSeq getFragment getFwdSubseqV$fEqMask $fOrdMask $fEnumMask $fShowMask(streaming-0.2.2.0-6BC4Zeul6ZYBVIvxn7NEdFStreaming.InternalchunksOfconcatscutoff decomposedelaysdestroy distributeeffectexpand expandPostgroupshoistUnexposedinspect intercalates interleavesiterTiterTMmapsmapsM mapsMPostmapsM_mapsPostneverrepeatsrepeatsM replicatesrunseparatesplitsAt streamBuild streamFoldtakesunfold unseparate untilJustunzipswrapyieldszipszipsWith zipsWith'Streaming.Preludelazily mappedPoststrictly#mmorph-1.1.2-LvQ57aLXvK27t9kijWwtlOControl.Monad.MorphMFunctorhoistMMonadembedData.Functor.OfOf:>Streamstream2vectorN stream2vectoreach ByteStreamEmptyChunkGo consChunk consChunkOffchunkmwrapcopyeffects singleton mapChunksM_ fromChunkstoStrictfromLazytoLazyconsnextByte nextByteOff nextChunk nextChunkOffsplitAt'trim hGetContentsN hGetContentswithOutputFilehPut toByteStreamtoByteStreamWithconcatBuilderslines'gunzip gunzipWithgzip$fMonoidByteStream$fSemigroupByteStream$fShowByteStream$fIsStringByteStream$fMonadTransByteStream$fMonadIOByteStream$fApplicativeByteStream$fFunctorByteStream$fMonadByteStream EofException ParseError errorContexts errorMessageParserparseparseIOparseM abortParse isFinisheddropLinegetByte getString getWord32 getWord64isolateatto$fMonadThrowParser$fMonadTransParser$fMonadIOParser $fMonadParser$fApplicativeParser$fFunctorParser$fExceptionParseError$fExceptionEofException$fShowParseError$fShowEofExceptionMdOpMdNumMdRepMdDelRefsunRefsRefsequnRefseq BamOtherShit BamSortingUnknownUnsortedGrouped Queryname CoordinateBamSQsq_name sq_length sq_other_shit BamHeader hdr_version hdr_sortinghdr_other_shitBamKeyBamMetameta_hdr meta_refsmeta_pgsmeta_other_shit meta_commentaddPG parseBamMeta showBamMeta isValidRefseq invalidRefseq invalidPos isValidPos unknownMapq isKnownMapqgetRef flagPairedflagProperlyPaired flagUnmappedflagMateUnmapped flagReversedflagMateReversed flagFirstMateflagSecondMate flagAuxillary flagSecondary flagFailsQC flagDuplicateflagSupplementary eflagTrimmed eflagMergedeflagAlternativeeflagExactIndex compareNamesreadMdshowMd distinctBin $fShowBamKey$fIsStringBamKey $fHashableFix $fShowFix$fEqFix$fSemigroupBamSorting$fHashable1BamPG$fHashableBamSQ$fSemigroupBamHeader$fMonoidBamHeader $fEnumRefseq $fShowRefseq$fSemigroupRefs $fMonoidRefs$fMonoidBamMeta$fSemigroupBamMeta $fEqBamKey $fOrdBamKey$fHashableBamKey$fGenericBamKey$fShowBamSorting$fEqBamSorting $fShowBamPG $fEqBamPG$fGeneric1BamPG $fShowBamSQ $fEqBamSQ$fGenericBamSQ$fShowBamHeader $fEqBamHeader $fEqRefseq $fOrdRefseq $fIxRefseq$fBoundedRefseq $fShowRefs $fShowBamMeta$fGenericBamMeta $fShowMdOp streamFile streamHandle streamInput streamInputs protectTerm psequence mergeStreamsmergeStreamsOnmergeStreamsBy progressGen progressNum progressPosFurrowafforddrain evertStream$fMonadThrowFurrow$fFunctorFurrow$fApplicativeFurrow $fMonadFurrow$fMonadTransFurrow$fMonadIOFurrow$fMFunctorFurrow$fMMonadFurrowBclArgsBclSpecialType BclNucsBin BclNucsAsc BclNucsAscRev BclNucsWide BclQualsBin BclQualsAscBclQualsAscRev BgzfTokensTkWord32TkWord16TkWord8TkFloatTkDoubleTkString TkDecimal TkSetMark TkEndRecordTkEndRecordPart1TkEndRecordPart2TkEnd TkBclSpecial TkLowLevelBBbuffersizeoffusedmarkmark2bgunzip getBgzfHdr newBuffer expandBuffer encodeBgzf fillBuffer loop_dec_intloop_bcl_special SubsequenceRegionsRegionrefseqstartendfromListoverlaps $fEqRegion $fOrdRegion $fShowRegion$fShowSubsequence $fShowRegionsExtBinIntArrFloatArr ExtensionsBamRaw virt_offsetraw_dataVector_Nucs_halfBamRecb_qnameb_flagb_rnameb_posb_mapqb_cigarb_mrnmb_mposb_isizeb_seqb_qualb_extsb_virtual_offsetCigOpMatInsDelNopSMaHMaPadCigar:* alignedLength nullBamRecgetMdbamRaw unpackBamdeleteEinsertEupdateEadjustEisPairedisProperlyPaired isUnmappedisMateUnmapped isReversedisMateReversed isFirstMate isSecondMate isAuxillary isSecondary isFailsQC isDuplicateisSupplementary isTrimmedisMerged isAlternative isExactIndex type_maskextAsInt extAsString setQualFlag$fStorableCigar $fShowCigar$fShowVector_Nucs_half%$fMVectorMVector_Nucs_halfNucleotides#$fVectorVector_Nucs_halfNucleotides $fEqCigOp $fOrdCigOp $fEnumCigOp $fShowCigOp$fBoundedCigOp $fIxCigOp $fEqCigar $fOrdCigar $fShowExt$fEqExt$fOrdExt $fShowBamRecIsBamRecpushBam unpackBamRec pipeSamOutput encodeBamWith writeBamFile pipeBamOutputwriteBamHandlepackBam$fIsBamRecEither$fIsBamRecBamRec$fIsBamRecBamRawECigWithMD WithoutMDMat'Rep'Ins'Del'Nop'SMa'HMa'Pad'Collapse cons_collapsecons_collapse_keepcheap_collapsecheap_collapse_keeprmdup check_sort normalizeTowrapTotoECigtoCigarsetMDtrim_3'trim_3trim_low_quality find_mergemergeBam merged_seq merged_qual find_trimtrimBamdefault_fwd_adaptersdefault_rev_adapterstwoMins mergeTrimBam decodeBam decodeBamFiledecodeBamFilesdecodePlainBam getBamMeta getBamRawdecodePlainSam getSamRecguardRefCompat concatInputs mergeInputsOn coordinatesqnames$fShowShortRecord$fExceptionShortRecord$fShowIncompatibleRefs$fExceptionIncompatibleRefsPilePile'p_refseqp_pos p_snp_stat p_snp_pile p_indel_stat p_indel_pile IndelPileBasePile IndelVariant deleted_basesinserted_basesV_NucsV_Nuc CallStats read_depth reads_mapq0sum_mapqsum_mapq_squaredDmgToken fromDmgToken DamagedBaseDBdb_calldb_qual db_dmg_tk db_dmg_posdb_ref PosPrimChunksPrimBaseBase_pb_wait_pb_base_pb_mapq _pb_chunks PrimChunksSeekIndel EndOfReaddissectpileup$fShowDamagedBase$fSemigroupCallStats$fMonoidCallStats $fMonadPileM$fApplicativePileM$fFunctorPileM$fShowPrimChunks$fShowPrimBase$fShowCallStats $fEqCallStats$fGenericCallStats $fEqV_Nuc $fOrdV_Nuc $fShowV_Nuc $fEqV_Nucs $fOrdV_Nucs $fShowV_Nucs$fEqIndelVariant$fOrdIndelVariant$fShowIndelVariant$fGenericIndelVariant $fShowPile'BamIndexminshiftdepth unaln_off extensions refseq_binsrefseq_ckpoints readBamIndex readBaiIndexwithIndexedBam readTabixstreamBamRefseqstreamBamUnalignedstreamBamSubseqstreamBamRegions$fShowIndexFormatError$fExceptionIndexFormatError$fShowBamIndex $fShowSegment$fShowTabFormat $fShowTabMeta QualFilter filterPairs complexSimplecomplexEntropyqualityAveragequalityMinimumqualityFromOldIlluminaqualityFromNewIllumina parseFastqparseFastqCassavaparseFastqWithfixupFlagAbuse fixupBwaFlags removeWartsModeGlobally HasPrefixIsPrefix myersAlign showAligned $fEnumMode AlignmentALN a_sequencea_fragment_typeNPairFragTypeCompleteLeadingTrailingSubstitutionStatsCompositionStatsDmgStats basecompo5 basecompo3substs5substs3 substs5d5 substs3d5 substs5d3 substs3d3 substs5dd substs3dd substs5cpg substs3cpgGenDamageParameters UnknownDamage OldDamage NewDamageNewDamageParametersNDP dp_gc_fracdp_mudp_nu dp_alpha5dp_beta5dp_alphadp_beta dp_alpha3dp_beta3DamageParametersDP ssd_sigma ssd_delta ssd_lambda ssd_kappa dsd_sigma dsd_delta dsd_lambdaSubst:-> DamageModelMMat44DMat44Dbangnudge scalarMatcomplMat freezeMatsnoDamage univDamage empDamagenpairfst_npsnd_np addFragTypedamagePatternsIter2BitdamagePatternsIterMDdamagePatternsIter alnFromMdbwa_cal_maxdiff$fSemigroupDmgStats$fMonoidDmgStats $fShowNPair$fStorableNPair $fShowMat44D$fGenericMat44D $fEqSubst $fOrdSubst $fIxSubst $fShowSubst$fReadDamageParameters$fShowDamageParameters$fGenericDamageParameters$fReadNewDamageParameters$fShowNewDamageParameters$fGenericNewDamageParameters$fShowGenDamageParameters$fGenericGenDamageParameters$fReadGenDamageParameters$fShowDmgStats$fShowFragType $fEqFragType $fEqNPair $fOrdNPairbracketbracketOnErrorbracket_catchAll catchIOErrorcatchIf catchJustcatchesfinallyhandle handleAll handleIOErrorhandleIf handleJustmask_onError onExceptiontrytryJustuninterruptibleMask_ExitCase ExitCaseAbortExitCaseExceptionExitCaseSuccessHandler MonadCatchcatchmaskuninterruptibleMaskgeneralBracket MonadThrowthrowMrepM&vector-0.12.0.2-H1Eu1OCXL0L9y980iV8EwUData.Vector.Generic.BaseVector materialize dematerializetoChunksfindIndexOrEnd gunzipLoopcombineBamMeta fixupMetadefaultBufSize bgzfEofMarkerMVector_Nucs_halfdo_rmdup merge_singles merge_halvesRepresentative Consensus split_ecig match_readsHeapPileFPileMpileup' p'feed_inputp'check_waiting p'scan_activeSegmentCkpointsBins~~getSegmentArray streamBgzfstreamBamSegment genSubstMat aln_from_ref