tesliper.datawork.geometry

Conformers geometry-related functions, primarily an RMSD sieve implementation.

This module provides an implementation of RMSD sieve, allowing for easy mathematical comparision of conformers’ geometry and filtering out similar ones, based on user-provided “threshold of similarity”.

Functions

`calc_rmsd`(a, b)	Compute RMSD (round-mean-square deviation) of two conformers (or sets of them).
`center`(a)	Zero-center all given conformers by subtracting their centroids.
`drop_atoms`(values, atoms, discarded)	Filters given values, returning those corresponding to atoms not specified as discarded.
`find_atoms`(atoms, find[, reverse])	Get indices of wanted atoms.
`fixed_windows`(series, size)	Simple, vectorized implementation of basic sliding window.
`get_triangular`(m)	Find mth triangular number.
`get_triangular_base`(n)	Find which mth triangular number n is.
`is_triangular`(n)	Checks if number n is triangular.
`kabsch_rotate`(a, b)	Minimize RMSD of conformers a and b by rotating molecule a onto b.
`pyramid_windows`(series)	Produces windows of shrinking sizes, from full sequence to last element only.
`rmsd_sieve`(geometry, windows[, threshold])	Compare conformers' geometry to keep only those that differ at least by a given threshold.
`select_atoms`(values, indices)	Filter given values to contain values only corresponding to atoms on given indices.
`stretching_windows`(values, size[, ...])	Implements a sliding window of a variable size, where values in each window are at most size bigger than the lowest value in given window.
`take_atoms`(values, atoms, wanted)	Filters given values, returning those corresponding to atoms specified as wanted.

tesliper.datawork.geometry.find_atoms(atoms: Union[Sequence[int], numpy.ndarray], find: Union[int, Iterable[int], numpy.ndarray], reverse: bool = False) → numpy.ndarray[source]

Get indices of wanted atoms.

Parameters

atoms (Sequence of int or numpy.ndarray) – List of atoms represented by their atomic numbers.
find (int, Sequence of int, or numpy.ndarray) – Element or list of elements, represented by their atomic numbers, which indices should be find in atoms array.
reverse (bool) – If True, indices of atoms NOT specified in find will be returned.

Returns

Indices of found elements.

Return type

numpy.ndarray

tesliper.datawork.geometry.select_atoms(values: Union[Sequence, numpy.ndarray], indices: Union[Sequence[int], numpy.ndarray]) → numpy.ndarray[source]: Filter given values to contain values only corresponding to atoms on given indices. Recognizes if given values are a list of values for one or for many conformers, but it must be in shape (A, N) or (C, A, N) respectively.

tesliper.datawork.geometry.take_atoms(values: Union[Sequence, numpy.ndarray], atoms: Union[Sequence[int], numpy.ndarray], wanted: Union[int, Iterable[int], numpy.ndarray]) → numpy.ndarray[source]

Filters given values, returning those corresponding to atoms specified as wanted. Roughly equivalent to: >>> numpy.take(values, numpy.nonzero(numpy.equal(atoms, wanted))[0], 1) but returns empty array, if no atom in atoms matches wanted atom. If wanted is list of elements, numpy.isin is used instead of numpy.equal.

Parameters

values (Sequence or numpy.ndarray) – array of values; it should be one-dimensional list of values or n-dimensional array of shape (conformers, values[, coordinates[, other]])
atoms (Sequence of int or numpy.ndarray) – list of atoms in molecule, given as atomic numbers; order should be the same as corresponding values for each conformer
wanted (int or Iterable of int or numpy.ndarray) – atomic number of wanted atom, or a list of those

Returns

values trimmed to corresponding to desired atoms only; preserves original dimension information

Return type

numpy.ndarray

tesliper.datawork.geometry.drop_atoms(values: Union[Sequence, numpy.ndarray], atoms: Union[Iterable[int], numpy.ndarray], discarded: Union[int, Iterable[int], numpy.ndarray]) → numpy.ndarray[source]

Filters given values, returning those corresponding to atoms not specified as discarded. Roughly equivalent to: >>> numpy.take(values, numpy.nonzero(~numpy.equal(atoms, discarded))[0], 1) If wanted is list of elements, numpy.isin is used instead of numpy.equal.

Parameters

values (Sequence or numpy.ndarray) – array of values; it should be one-dimensional list of values or n-dimensional array of shape (conformers, values[, coordinates[, other]])
atoms (Iterable of int or numpy.ndarray) – list of atoms in molecule, given as atomic numbers; order should be the same as corresponding values for each conformer
discarded (int or Iterable of int or numpy.ndarray) – atomic number of discarded atom, or a list of those

Returns

values trimmed to corresponding to desired atoms only; preserves original dimension information

Return type

numpy.ndarray

tesliper.datawork.geometry.is_triangular(n: int) → bool[source]

Checks if number n is triangular.

Notes

If n is the mth triangular number, then n = m*(m+1)/2. Solving for m using the quadratic formula: m = (sqrt(8n+1) - 1) / 2, so n is triangular if and only if 8n+1 is a perfect square.

Parameters: n (int) – number to check
Returns: True if number n is triangular, else False
Return type: bool

tesliper.datawork.geometry.get_triangular_base(n: int) → int[source]: Find which mth triangular number n is.

tesliper.datawork.geometry.get_triangular(m: int) → int[source]: Find mth triangular number.

tesliper.datawork.geometry.center(a: Union[Sequence[Sequence[float]], Sequence[Sequence[Sequence[float]]]]) → Union[Sequence[Sequence[float]], Sequence[Sequence[Sequence[float]]]][source]: Zero-center all given conformers by subtracting their centroids. Accepts single molecule or list of conformers.

tesliper.datawork.geometry.kabsch_rotate(a: Union[Sequence[Sequence[float]], Sequence[Sequence[Sequence[float]]]], b: Union[Sequence[Sequence[float]], Sequence[Sequence[Sequence[float]]]]) → numpy.ndarray[source]

Minimize RMSD of conformers a and b by rotating molecule a onto b. Expects given representation of conformers to be zero-centered. Both a and b may be a single molecule or a set of conformers.

Parameters

a ([Sequence of ]Sequence of Sequence of float) – Set of points representing atoms, that will be rotated to best match reference.
b ([Sequence of ]Sequence of Sequence of float) – Set of points representing atoms of the reference molecule.

Returns

Rotated set of points a.

Return type

numpy.ndarray

Notes

Uses Kabsch algorithm, also known as Wahba’s problem. See: https://en.wikipedia.org/wiki/Kabsch_algorithm and https://en.wikipedia.org/wiki/Wahba%27s_problem

tesliper.datawork.geometry.calc_rmsd(a: Union[Sequence[Sequence[float]], Sequence[Sequence[Sequence[float]]]], b: Union[Sequence[Sequence[float]], Sequence[Sequence[Sequence[float]]]]) → numpy.ndarray[source]

Compute RMSD (round-mean-square deviation) of two conformers (or sets of them).

Parameters

a ([Sequence of ]Sequence of Sequence of float) – Set of points representing atoms or list thereof.
b ([Sequence of ]Sequence of Sequence of float) – Set of points representing atoms or list thereof.

Returns

Value of RMSD of two conformers or list of values, if list of conformers given.

Return type

float or numpy.ndarray

Notes

https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions

tesliper.datawork.geometry.fixed_windows(series: Sequence, size: int) → numpy.ndarray[source]

Simple, vectorized implementation of basic sliding window. Produces a list of windows of given size from given series.

Parameters

series (sequence) – Series of data, of which sliding window view is requested.
size (int) – Number of data points in the window. Must be a positive integer.

Returns

List of indices, corresponding to values in the original array, that form a window

Return type

numpy.ndarray

Raises

ValueError – if non-positive integer given as window size
TypeError – if non-integer value given as window size

Notes

Implementation inspired by https://towardsdatascience.com/fast-and-robust-sliding-window-vectorization-with-numpy-3ad950ed62f5

tesliper.datawork.geometry.stretching_windows(values: Sequence[float], size: Union[int, float], keep_hermits: bool = False, hard_bound: bool = False) → Iterator[numpy.ndarray][source]

Implements a sliding window of a variable size, where values in each window are at most size bigger than the lowest value in given window. Values yielded are np.ndarrays of indices of sorted values, that constitute each window.

When window reaches a border, that is an end of the values array or a gap between values that is larger than given size, it is “squeezed”, when pressed against the border, producing subsequences of the first view that touches a border. This is usefull, when one wants to form a window for each value in the original array.

>>> list(stretching_windows([1, 2, 3, 4, 7, 8], 3))
[[0, 1, 2], [1, 2, 3], [2, 3], [4, 5]]

This “soft” right bound may be “hardened” by passing hard_bound=True as a parameter to a function call. A window will than move immediately to the border’s other side.

>>> list(stretching_windows([1, 2, 3, 4, 7, 8], 3), hard_bound=True)
[[0, 1, 2], [1, 2, 3], [4, 5]]

Windows of size 1, called hermits, are by default ignored.

>>> arr = [1, 2, 10, 20, 22]
>>> list(stretching_windows(arr, 5))
[[0, 1], [3, 4]]

If such behavior is not desired, it may be turned off with keep_hermits = True. One must remember that, when a bound is “soft”, the last window is always a hermit.

>>> list(stretching_windows(arr, 5, keep_hermits=True))
[[0, 1], [1], [2], [3, 4], [4]]
>>> list(stretching_windows(arr, 5, keep_hermits=True, hard_bound=True))
[[0, 1], [2], [3, 4]]

Parameters

values (Sequence of float) – List of values, on which sliding window view is requested.
size (int or float) – Maximum difference of smallest and largest values inside each window.
keep_hermits (bool) – If windows of size one should be yielded (True) or omitted (False). False by default.
hard_bound (bool) – How window should behave close to borders. With hard bound (True) it will move to the other side of border as soon, as it is reached. With soft bound (False) it will “squeeze” when pressed against the border, producing subsequences of the first view that includes border value. False by default.

Yields

np.array of int – List of indices, corresponding to sorted values in the original array, that form a window.

Raises

ValueError – If given size is not a positive number.

tesliper.datawork.geometry.pyramid_windows(series: Sequence) → Iterator[numpy.ndarray][source]

Produces windows of shrinking sizes, from full sequence to last element only.

This function yields numpy.ndarrays with indices that may be used to index an original sequence (assuming original sequence is numpy.ndarray as well). The first window yielded represents a whole series sequence and each consecutive window is reduced by the first element, leaving only the last element in the final window. This allows for easy setup of efficient calculations in symmetric each-to-each relationship.

>>> series = [3, 6, 3, 5, 7]
>>> for window in pyramid_windows(series):
...     print(window)
[0 1 2 3 4]
[1 2 3 4]
[2 3 4]
[3 4]
[4]

Parameters: series (sequence) – Sequence of elements, for which windows should be generated.
Yields: np.ndarray(dtype=int) – Windows as np.ndarray of indices.

tesliper.datawork.geometry.rmsd_sieve(geometry: Sequence[Sequence[Sequence[float]]], windows: Iterable[Sequence[int]], threshold: float = 1) → numpy.ndarray[source]

Compare conformers’ geometry to keep only those that differ at least by a given threshold.

This function calculates how similar conformers are one to another, using a RMSD measure, that is is a root-mean-square deviation of atomic positions, and signalizes which of the conformers are duplicates, according to a given similarity threshold. Returned array of booleans may be treated as “originality” indicators for each conformer: True means given conformer has distinct structure, False means given conformer is similar to some other conformer marked as “original”.

The measure of conformers’ similarity, the threshold parameter, is a minimum value of RMSD needed to consider two conformers different. In other words, if two conformers give a RMSD value that is lower then threshold, one of them will be marked as similar, producing a False in the output array.

To lower a computational expense, similarity measurement is performed in “chunks”, using a sliding window technique. Windows consist of a portion of conformers from the original data, or more precisely, indices of conformers that should be included in the particular window. First item from the window is compared to all the others that are in the same window, and if any of them is similar to the reference item, it is marked as duplicate (not “original”). The process is repeated for each window.

The windows itself should be provided by user as windows parameter. This provides a flexibility in the process: you may choose to sacrifice accuracy to lower necessary computational time or vice versa. You may also choose a different moving window strategy or reject it alltogether, and calculate one-to-each similarity in the whole set. Iterables of windows accepted by this function may be generated with one of the dedicated moving window funcions: stretching_windows(), fixed_windows(), or pyramid_windows(). Refer to their documentation for more information.

Parameters

geometry (sequence of sequence of sequence of float) – A list of conformers, where each conformer is represented by a sequence of coordinates in 3-dimensional space. It is assumed that order of atoms in each conformers’ representation is identical.
windows (iterable of sequence of int) – An iterable of windows, where each window is a list of indices. Comparision of RMSD values will be performed inside each window.
threshold (float) – Minimum RMSD value to consider two compared conformers different.

Returns

Array of booleans for each conformer: True if conformer’s structure is “original” and should be kept, False if it is a duplicate of other, “original” structure (at least according to threshold given), and should be discarded.

Return type

np.ndarray(dtype=bool)