Îõ³h*#‚Æ      !"#$%&'()*+,-./0123456789:;<=>?@ABCDE1.0 Safe-Inferred"#%&i  twobitreaderÛThis is a (piece of a) reference sequence. It consists of stretches with uniform masking.The offset is stored as a FÔ. This is done because on a 32 bit platform, every bit counts. This limits the genome to approximately four gigabases, which would be a file of about one gigabyte. That's just about enough to work with the human genome. On a 64 bit platform, the file format itself imposes a limit of four gigabytes, or about 16 gigabases in total.èIf length is zero, the piece is empty and the mask, pointer, and offset fields may not be valid. If length is positive, ptr+offset points at the first base of the piece. If length is negative, ptr+offset points just past the end of the piece, ptr+offset+length points to the first base of the piece, and the sequence in meant to be reverse complemented.In a &, length must not be negative. In a TwoBitSequence' Bidirectional%, length can be positive or negative. twobitreaderÎ2bit supports two kinds of masking, typically rendered as lowercase letters (MaskSoft ) and Ns (MaskHard). They can overlap (MaskBothÒ), and even the hard masking has underlying sequence (which is normally ignored). twobitreaderÝLazily generated sequence in forward direction; the argument is the offset of the first base. twobitreaderÊLazily generated sequence in reverse direction; the argument is the offset of the first base to the right of the beginning. (The first base generated is the complement of the base found at (offset-1). twobitreaderÖFinds a named scaffold in the reference. If it doesn't find the exact name, it will try to compensate for the crazy naming differences between NCBI and UCSC. This doesn't work in general, but is good enough in the common case. In particular, "1" maps to "chr1" and back, "GL000192.1" to "chr1_gl000192_random" and back, and "chrM" to MT and back. twobitreader´Brings a 2bit file into memory. The file is mmap'ed, so it will not work on streams that are not actual files. It's also unsafe if the file is concurrently modified in any way.G twobitreaderParses a 2bit file. The FilePathø argument is only used in error messages, what is really parsed is the memory block, typically from mmapping the file.1The workhorse in here is the construction of the  and Ö functions. When called, they first run a binary search on the mask lists, then produce a list of blocks with uniform masking. Both parts of the algorithm are fast and directly use the on-disk data structures.µIn theory, there could be 2bit files in big endian format out there. We nominally support them, but since I've never seen one in the wild, this may well fail in a spectacular way.! twobitreader–Unpacks a reference sequence into a (very long) list of bytes. Each byte contains the nucleotide in bits 0 and 1 with valjues 0..3 corresponding to TCAGÁ, and the soft and hard mask bits in bits 2 and 3, respectively." twobitreaderöUnpacks a reference sequence into a (very long) list of ASCII characters. Hard masked nucleotides become the letter N, others become TCAG.# twobitreaderçUnpacks a reference sequence into a list of ASCII characters, interpreting masking in the customary way. Specifically, hard masking produces Ns, soft masking produces lower case letters, and dual masking produces lower case Ns.H twobitreaderìReads a 32 bit word from an address, which doesn't need to be aligned. The byte order used is unspecified.I twobitreader6Equivalent to peekUnalnWord32 followed by a byte swap. twobitreaderhow is it masked? twobitreader0primitive bases in 2bit encoding: [0..3] = TCAG twobitreaderoffset in bases(!) twobitreaderlength in bases$ !"# $ !"#  Safe-Inferred"#%&` J twobitreaderôA way to accumulate bytes. If the accumulated bytes will hang around in memory, this has much lower overhead than Builder. If it has short lifetime, Builder is much more convenient.K twobitreaderŸA cDNA or mRNA or transcript (these are all synonymous), with some metainformation collected from the annotation. Whatever the input was called, we call it cdna in the transciptome.? twobitreaderÉExtracts the reference from a VCF. This assumes the presence of at least one record per site. The VCF must be sorted by position. When writing out, we try to match the order of the contigs as listed in the header. Unlisted contigs follow at the end with their order preserved; contigs without data are not written at all.L twobitreader!Appends bytes to a collection of M in such a way that the MÞ keep doubling in size. This ensures O(n) time and space complexity and fairly low overhead.N twobitreaderƒCollects stretches of Ns by looking at one character at a time. In reality, anything that isn't one of "ACGT" is treated as an N.O twobitreaderïCollects stretches of masked dna by looking at one letter at a time. Anything lowercase is considered masked.P twobitreaderßCollects bases in 2bit format. It accumulates 4 bases in one word, then collects bytes in an J. From the 2bit spec:ºpackedDna - the DNA packed to two bits per base, represented as so: T - 00, C - 01, A - 10, G - 11. The first base is in the most significant 2-bit byte; the last base is in the least significant 2 bits. For example, the sequence TCAG is represented as 00011011.@ twobitreader½Parses annotations in GFF format. We want to turn an annotation and a 2bit file into a FastA of the transcriptome (one sequence per annotated transcript), that looks like the stuff Lior Pachter feeds into Kallisto. Annotations come in two dialects of GFF, either GFF3 or GTF. We autodetect and understand both.Q twobitreaderâParses the random stuff in GFF into a hash table. Returns 'Just (Left _)' if the file uses assignment style ("foo=bar;"), returns 'Just (Right _)' if the file uses statement style ("foo "bar";"), otherwise returns Nothing.R twobitreaderlength twobitreaderlist of N stretches twobitreaderlist of mask stretches twobitreaderaccumulated bases23456789:<>;@=?23456789:<>;@=?Ó       !"#$%&'()*+,-./01234556789:;<=>?@ABCDEFGHIJKLMNOPQRSTUV×'twobitreader-1.0-2XkAlGA7Nho5GM6BESlxI8 Bio.TwoBitBio.TwoBit.Tool twobitreaderTwoBitSequence Bidirectional UnidrectionalTwoBitSequence'SomeSeqRefEndMaskingTwoBitChromosomeTBCtbc_rawtbc_name tbc_indextbc_dna_offset tbc_dna_size tbc_fwd_seq tbc_rev_seq TwoBitFileTBFtbf_rawtbf_sizetbf_path tbf_chroms tbf_chrmap tbf_chrnames findChrom openTwoBit isSoftMasked isHardMasked noneMasked softMasked hardMasked bothMasked unpackRSRawunpackRSunpackRSMasked$fExceptionTwoBitError$fBoundedMasking $fEnumMasking$fMonoidMasking$fSemigroupMasking $fReadMasking $fShowMasking$fShowTwoBitSequence' $fEqMasking $fOrdMasking $fShowBlock $fEqBlock $fOrdBlock$fShowTwoBitErrorEncodeProgressEncoded ep_seqname ep_position ep_hardmasked ep_softmasked ep_enclengthep_tail formatCdna buildFasta twoBitToFa faToTwoBit vcfToTwoBit parseAnno$fExceptionGffError$fShowGffError$fShowGffErrorDetail $fShowCdna $fShowRangeghc-prim GHC.TypesWord parseTwoBitpeekUnalnWord32peekUnalnWord32SwapAccuCdnagrowBytes collect_Ns collect_ms collect_bases parseStuff encode_seq