ConClusion: Cluster algorithms, PCA, and chemical conformere analysis

[ agpl, chemistry, library, program, statistics ] [ Propose Tags ]

Please see the README on GitLab at https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster


[Skip to Readme]
Versions [RSS] [faq] 0.0.1, 0.0.2, 0.1.0
Change log Changelog.md
Dependencies aeson (==1.5.*), attoparsec (>=0.13.0.0 && <0.15), base (>=4.7 && <4.16), cmdargs (>=0.10.0 && <0.11), ConClusion, containers (>=0.6.0.0 && <0.7), formatting (>=7.1.0 && <7.2), hmatrix (>=0.20.0 && <0.21), massiv (>=1.0.0.0 && <1.1), optics (>=0.3 && <0.5), psqueues (>=0.2.7.0 && <0.3), rio (>=0.1.13.0 && <0.2), text (>=1.2.0.0 && <1.3) [details]
License AGPL-3.0-only
Copyright 2021 Phillip Seeber
Author Phillip Seeber
Maintainer phillip.seeber@googlemail.com
Category Statistics, Chemistry
Bug tracker https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster/-/issues
Source repo head: git clone https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster
Uploaded by phillipseeber at 2021-09-15T08:51:23Z
Distributions NixOS:0.0.2
Executables conclusion
Downloads 165 total (61 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Hackage Matrix CI
Docs uploaded by user
Build status unknown [no reports yet]

Modules

[Index] [Quick Jump]

Downloads

Maintainer's Corner

For package maintainers and hackage trustees

Candidates


Readme for ConClusion-0.1.0

[back to package description]

ConClusion

ConClusion provides principal component analysis, hierarchical clustering and DBScan in Haskell. There is also a command line interface for processing of CREST conformere trajectories. Hence the name: CONformere CLUStering. The procedure to analyse conformere data has three steps:

  1. Read the trajectory and calculate a set of features for each conformere. The features can include the energy, a set of bond lengths, a set of bond angles, and a set of dihedral angles in arbitrary combination. Those descriptors form a feature matrix.
  2. A principal component analysis of the feature matrix might be perfomed to reduce the number of dimensions and remove redundancies.
  3. The (potentially PCA-processed) feature matrix is being clustered. Different distance measures are available. Either DBScan or hierarchical clustering can be used to group different conformeres.

While the command line interface only fits the work flow described above, the underlying clustering algorithms and PCA are implemented in a general way and can be utilised independently as library.

Installation

Bundled Archive

A self-contained executable archive is build for the main branch and for releases. This can be executed directly on any Linux and has just to be downloaded. Go to the page of releaes and download an archive. Make it executable (e.g. chmod +x conclusion) and you are done!

From Source

If you have the Haskell toolchain intalled and therefore working Cabal and GHC, you may build ConClusion from source. This also requires working BLAS and LAPACK libraries on your system.

git clone https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster.git ConClusion
cd ConClusion
cabal install --installdir=$(PREFIX)

Choose a PREFIX where to install the executable. $HOME/.local/bin/ is often a good choice.

If you would like to use ConClusion on systems where Nix is not available (Windows, BSD, ...) this is the way to go.

With Nix

When you have Nix available on your system, everything can be build by Nix:

git clone https://gitlab.com/theoretical-chemistry-jena/quantum-chemistry/ConfoCluster.git ConClusion
cd ConClusion/nix
nix-build -A ConClusion.components.exes.conclusion

Usage

The command line interface to the conclusion executable offers full control about all three steps described above.

  • -x --xyz takes the path to the XYZ trajectory that is to be processed. It must contain the energy as comment line (which is the case for CREST trajectories).

  • --dim specifies the characterising features; therefore a set of internal coordinates and maybe the energy. A comma separated list of an arbitrary amount of features can be specified with the following syntax:

    • e for the energy
    • b m n for a bond length, where m and n are the 0-based atom indices
    • a m n o for an angle, where m, n and o are 0-based atom indices. Calculates the angle around n
    • d m n o p for a dihedral angle, where m, n, o and p are 0-based atom indices. Calculates the rotation of the bond between n and o. Dihedrals use a metric, that respects periodicity and direction of the rotation. For each dihedral there will be two rows in the feature matrix, therefore. See this paper.
    • (indices are 0-based)
  • -p --pca activates dimensionalty reduction by principal component analysis. Give an integer to specify how many principal components are kept. During the execution of ConClusion the error introduced by PCA will be printed.

  • -c --cluster activates clustering of the results. The clustering algorithm can be selected by giving either dbscan or hca

  • --measure specifies the distance measure between the conformeres. By default an euclidean distance is used, but Manhattan and Mahalanobis distances are also available, as well as a general L_p norm. If Mahalanobis distances are used, it might be worth a try to disable PCA.

  • --joinstrat controls how inter-cluster distances are calculated in hierarchical clustering. single might be the best choice to get dense groups of conformeres.

  • --distance gives the search radius in DBScan or the dendrogram cut distance in hierarchical clustering.

  • --minsize is the minimum size of a cluster in DBScan. If --forcemin is given, the clusters obtained by HCA are also filtered for a minimum size.

  • --forcemin forces filtering of HCA clusters for their minimum size as given by --minsize. Disabled by default.

Each processing step will produce a Gnuplot compatible file (space separated columns). The pure feature matrix will be features.dat, the results of the PCA will be in pca.dat and the clustering results will be in cluster.dat. The first column in cluster.dat will be an integer giving the cluster number this point belongs to, that can be used for colour-coding in Gnuplot.

Example

A perylene dye with four phenoxy groups has different conformeres, that have different spectral properties. For solubility the dye has also some alkyl groups. Crest finds about 1400 conformeres, most of them being different only in the alkyl side-chains, that do not influence spectral properties. Therefore, a much smaller group of different conformeres with respect to different positions of phenoxy groups exist. From each of those groups the lowest energy conformere shall be obtained. We therefore select eight dihedral angles and the energy as features; 2 dihedrals for each phenoxy group. One dihedral per phenoxy group describing the rotation of the perylene-O bond, the second one describing the rotation around the O-Ph bond. As the dihedral angles are not independent from each other, as some orientations of phenoxy groups are not possible, we use a PCA to reduce dimensionalty and remove redundancies. After the PCA, DBScan is used to obtain clusters of similar conformeres. The lowest index in each cluster is also the lowest energy conformere in each group, as CREST sorts conformeres by energy.

conclusion \
  --xyz=crest_conformers.xyz \
  --pca=3 \
  --dim="e, d 19 18 2 56, d 18 2 56 65, d 16 15 1 45, d 15 1 45 46, d 11 13 0 34, d 13 0 34 43, d 29 31 3 67, d31 3 67 76" \
  --measure=manhattan \
  --cluster=dbscan \
  --distance=0.3 \
  --minsize=5

Library/Haskell Package

ConClusion provides principal components analysis and the clustering algorithms DBScan and hierarchical clustering. The algorithms are implemented in efficient parallel arrays and perform quite well. For the API see the haddock documentation, which can be generated by:

cabal haddock