phybin-0.3: Utility for clustering phylogenetic trees in Newick format based on Robinson-Foulds distance.

Bio.Phylogeny.PhyBin.RFDistance

Synopsis

Types

Dense sets of taxa, aka Bipartitions or BiPs We assume that taxa labels have been mapped onto a dense, contiguous range of integers [0,N).

Rule: Bipartitions are really two disjoint sets. But as long as the parent set (the union of the partitions, aka all taxa) then a bipartition can be represented just by *one* subset. Yet we must choose WHICH subset for consistency. We use the rule that we always choose the SMALLER. Thus the DenseLabelSet should always be half the size or less, compared to the total number of taxa.

A set that is more than a majority of the taxa can be normalized by flipping, i.e. taking the taxa that are NOT in that set.

Bipartition (Bip) utilities

Get all non-singleton BiPs implied by a tree.

Print a BiPartition in a pretty form

consensusTree :: Int -> [NewickTree a] -> NewickTree ()Source

Take only the bipartitions that are agreed on by all trees.

Convert from bipartitions BACK to a single tree.

filterCompatible :: NewickTree a -> [NewickTree b] -> [NewickTree b]Source

Which of a set of trees are compatible with a consensus?

`compatibleWith consensus tree` -- Is a tree compatible with a consensus? This is more efficient if partially applied then used repeatedly.

Note, tree compatibility is not the same as an exact match. It's like (<=) rather than (==). The star topology is consistent with the all trees, because it induces the empty set of bipartitions.

Assume that total taxa are 0..N-1, invert membership:

Methods for computing distance matrices

Returns a triangular distance matrix encoded as a vector. Also return the set-of-BIPs representation for each tree.

This uses a naive method, directly computing the pairwise distance between each pair of trees.

This method is TOLERANT of differences in the laba/taxa sets between two trees. It simply prunes to the intersection before doing the distance comparison. Other scoring methods may be added in the future. (For example, penalizing for missing taxa.)

hashRF :: Int -> [NewickTree a] -> DistanceMatrixSource

This version slices the problem a different way. A single pass over the trees populates the table of bipartitions. Then the table can be processed (locally) to produce (non-localized) increments to a distance matrix.