biohazard-2.1: bioinformatics support library

Bio.Bam.Evan

Description

This module contains stuff relating to conventions local to MPI EVAN. The code is needed regularly, but it can be harmful when applied to BAM files that follow different conventions. Most importantly, no program should call these functions by default.

Synopsis

# Documentation

Fixes abuse of flags valued 0x800 and 0x1000. We used them for low quality and low complexity, but they have since been redefined. If set, we clear them and store them into the ZQ field. Also fixes abuse of the combination of the paired, 1st mate and 2nd mate flags used to indicate merging or trimming. These are canonicalized and stored into the FF field. This function is unsafe on BAM files of unclear origin!

Fixes typical inconsistencies produced by Bwa: sometimes, 'mate unmapped' should be set, and we can see it, because we match the mate's coordinates. Sometimes 'properly paired' should not be set, because one mate is unmapped. This function is generally safe, but needs to be called only on the output of affected (older?) versions of Bwa.

Removes syntactic warts from old read names or the read names used in FastQ files. Supported conventions:

• A name suffix of /1 or /2 is turned into the first mate or second mate flag and the read is flagged as paired.
• Same for name prefixes of F_ or R_, respectively.
• A name prefix of M_ flags the sequence as unpaired and merged
• A name prefix of T_ flags the sequence as unpaired and trimmed
• A name prefix of C_, optionally before or after any of the other prefixes, is turned into the extra flag XP:i:-1 (result of duplicate removal with unknown duplicate count).
• A collection of tags separated from the name by an octothorpe is removed and put into the fields XI and XJ as text.