tesliper.datawork.geometry
Conformers geometry-related functions, primarily an RMSD sieve implementation.
This module provides an implementation of RMSD sieve, allowing for easy mathematical comparision of conformers’ geometry and filtering out similar ones, based on user-provided “threshold of similarity”.
Functions
|
Compute RMSD (round-mean-square deviation) of two conformers (or sets of them). |
|
Zero-center all given conformers by subtracting their centroids. |
|
Filters given values, returning those corresponding to atoms not specified as discarded. |
|
Get indices of wanted atoms. |
|
Simple, vectorized implementation of basic sliding window. |
Find mth triangular number. |
|
Find which mth triangular number n is. |
|
Checks if number n is triangular. |
|
|
Minimize RMSD of conformers a and b by rotating molecule a onto b. |
|
Produces windows of shrinking sizes, from full sequence to last element only. |
|
Compare conformers' geometry to keep only those that differ at least by a given threshold. |
|
Filter given values to contain values only corresponding to atoms on given indices. |
|
Implements a sliding window of a variable size, where values in each window are at most size bigger than the lowest value in given window. |
|
Filters given values, returning those corresponding to atoms specified as wanted. |
- tesliper.datawork.geometry.find_atoms(atoms: Union[Sequence[int], numpy.ndarray], find: Union[int, Iterable[int], numpy.ndarray], reverse: bool = False) numpy.ndarray[source]
Get indices of wanted atoms.
- Parameters
atoms (Sequence of int or numpy.ndarray) – List of atoms represented by their atomic numbers.
find (int, Sequence of int, or numpy.ndarray) – Element or list of elements, represented by their atomic numbers, which indices should be find in atoms array.
reverse (bool) – If
True, indices of atoms NOT specified in find will be returned.
- Returns
Indices of found elements.
- Return type
numpy.ndarray
- tesliper.datawork.geometry.select_atoms(values: Union[Sequence, numpy.ndarray], indices: Union[Sequence[int], numpy.ndarray]) numpy.ndarray[source]
Filter given values to contain values only corresponding to atoms on given indices. Recognizes if given values are a list of values for one or for many conformers, but it must be in shape (A, N) or (C, A, N) respectively.
- tesliper.datawork.geometry.take_atoms(values: Union[Sequence, numpy.ndarray], atoms: Union[Sequence[int], numpy.ndarray], wanted: Union[int, Iterable[int], numpy.ndarray]) numpy.ndarray[source]
Filters given values, returning those corresponding to atoms specified as wanted. Roughly equivalent to: >>> numpy.take(values, numpy.nonzero(numpy.equal(atoms, wanted))[0], 1) but returns empty array, if no atom in atoms matches wanted atom. If wanted is list of elements, numpy.isin is used instead of numpy.equal.
- Parameters
values (Sequence or numpy.ndarray) – array of values; it should be one-dimensional list of values or n-dimensional array of shape (conformers, values[, coordinates[, other]])
atoms (Sequence of int or numpy.ndarray) – list of atoms in molecule, given as atomic numbers; order should be the same as corresponding values for each conformer
wanted (int or Iterable of int or numpy.ndarray) – atomic number of wanted atom, or a list of those
- Returns
values trimmed to corresponding to desired atoms only; preserves original dimension information
- Return type
numpy.ndarray
- tesliper.datawork.geometry.drop_atoms(values: Union[Sequence, numpy.ndarray], atoms: Union[Iterable[int], numpy.ndarray], discarded: Union[int, Iterable[int], numpy.ndarray]) numpy.ndarray[source]
Filters given values, returning those corresponding to atoms not specified as discarded. Roughly equivalent to: >>> numpy.take(values, numpy.nonzero(~numpy.equal(atoms, discarded))[0], 1) If wanted is list of elements, numpy.isin is used instead of numpy.equal.
- Parameters
values (Sequence or numpy.ndarray) – array of values; it should be one-dimensional list of values or n-dimensional array of shape (conformers, values[, coordinates[, other]])
atoms (Iterable of int or numpy.ndarray) – list of atoms in molecule, given as atomic numbers; order should be the same as corresponding values for each conformer
discarded (int or Iterable of int or numpy.ndarray) – atomic number of discarded atom, or a list of those
- Returns
values trimmed to corresponding to desired atoms only; preserves original dimension information
- Return type
numpy.ndarray
- tesliper.datawork.geometry.is_triangular(n: int) bool[source]
Checks if number n is triangular.
Notes
If n is the mth triangular number, then n = m*(m+1)/2. Solving for m using the quadratic formula: m = (sqrt(8n+1) - 1) / 2, so n is triangular if and only if 8n+1 is a perfect square.
- Parameters
n (int) – number to check
- Returns
True if number n is triangular, else False
- Return type
bool
- tesliper.datawork.geometry.get_triangular_base(n: int) int[source]
Find which mth triangular number n is.
- tesliper.datawork.geometry.center(a: Union[Sequence[Sequence[float]], Sequence[Sequence[Sequence[float]]]]) Union[Sequence[Sequence[float]], Sequence[Sequence[Sequence[float]]]][source]
Zero-center all given conformers by subtracting their centroids. Accepts single molecule or list of conformers.
- tesliper.datawork.geometry.kabsch_rotate(a: Union[Sequence[Sequence[float]], Sequence[Sequence[Sequence[float]]]], b: Union[Sequence[Sequence[float]], Sequence[Sequence[Sequence[float]]]]) numpy.ndarray[source]
Minimize RMSD of conformers a and b by rotating molecule a onto b. Expects given representation of conformers to be zero-centered. Both a and b may be a single molecule or a set of conformers.
- Parameters
a ([Sequence of ]Sequence of Sequence of float) – Set of points representing atoms, that will be rotated to best match reference.
b ([Sequence of ]Sequence of Sequence of float) – Set of points representing atoms of the reference molecule.
- Returns
Rotated set of points a.
- Return type
numpy.ndarray
Notes
Uses Kabsch algorithm, also known as Wahba’s problem. See: https://en.wikipedia.org/wiki/Kabsch_algorithm and https://en.wikipedia.org/wiki/Wahba%27s_problem
- tesliper.datawork.geometry.calc_rmsd(a: Union[Sequence[Sequence[float]], Sequence[Sequence[Sequence[float]]]], b: Union[Sequence[Sequence[float]], Sequence[Sequence[Sequence[float]]]]) numpy.ndarray[source]
Compute RMSD (round-mean-square deviation) of two conformers (or sets of them).
- Parameters
a ([Sequence of ]Sequence of Sequence of float) – Set of points representing atoms or list thereof.
b ([Sequence of ]Sequence of Sequence of float) – Set of points representing atoms or list thereof.
- Returns
Value of RMSD of two conformers or list of values, if list of conformers given.
- Return type
float or numpy.ndarray
Notes
https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions
- tesliper.datawork.geometry.fixed_windows(series: Sequence, size: int) numpy.ndarray[source]
Simple, vectorized implementation of basic sliding window. Produces a list of windows of given size from given series.
- Parameters
series (sequence) – Series of data, of which sliding window view is requested.
size (int) – Number of data points in the window. Must be a positive integer.
- Returns
List of indices, corresponding to values in the original array, that form a window
- Return type
numpy.ndarray
- Raises
ValueError – if non-positive integer given as window size
TypeError – if non-integer value given as window size
Notes
Implementation inspired by https://towardsdatascience.com/fast-and-robust-sliding-window-vectorization-with-numpy-3ad950ed62f5
- tesliper.datawork.geometry.stretching_windows(values: Sequence[float], size: Union[int, float], keep_hermits: bool = False, hard_bound: bool = False) Iterator[numpy.ndarray][source]
Implements a sliding window of a variable size, where values in each window are at most size bigger than the lowest value in given window. Values yielded are
np.ndarrays of indices of sorted values, that constitute each window.When window reaches a border, that is an end of the values array or a gap between values that is larger than given size, it is “squeezed”, when pressed against the border, producing subsequences of the first view that touches a border. This is usefull, when one wants to form a window for each value in the original array.
>>> list(stretching_windows([1, 2, 3, 4, 7, 8], 3)) [[0, 1, 2], [1, 2, 3], [2, 3], [4, 5]]
This “soft” right bound may be “hardened” by passing
hard_bound=Trueas a parameter to a function call. A window will than move immediately to the border’s other side.>>> list(stretching_windows([1, 2, 3, 4, 7, 8], 3), hard_bound=True) [[0, 1, 2], [1, 2, 3], [4, 5]]
Windows of size 1, called hermits, are by default ignored.
>>> arr = [1, 2, 10, 20, 22] >>> list(stretching_windows(arr, 5)) [[0, 1], [3, 4]]
If such behavior is not desired, it may be turned off with
keep_hermits = True. One must remember that, when a bound is “soft”, the last window is always a hermit.>>> list(stretching_windows(arr, 5, keep_hermits=True)) [[0, 1], [1], [2], [3, 4], [4]] >>> list(stretching_windows(arr, 5, keep_hermits=True, hard_bound=True)) [[0, 1], [2], [3, 4]]
- Parameters
values (Sequence of float) – List of values, on which sliding window view is requested.
size (int or float) – Maximum difference of smallest and largest values inside each window.
keep_hermits (bool) – If windows of size one should be yielded (True) or omitted (False). False by default.
hard_bound (bool) – How window should behave close to borders. With hard bound (True) it will move to the other side of border as soon, as it is reached. With soft bound (False) it will “squeeze” when pressed against the border, producing subsequences of the first view that includes border value. False by default.
- Yields
np.array of int – List of indices, corresponding to sorted values in the original array, that form a window.
- Raises
ValueError – If given size is not a positive number.
- tesliper.datawork.geometry.pyramid_windows(series: Sequence) Iterator[numpy.ndarray][source]
Produces windows of shrinking sizes, from full sequence to last element only.
This function yields
numpy.ndarrays with indices that may be used to index an original sequence (assuming original sequence isnumpy.ndarrayas well). The first window yielded represents a whole series sequence and each consecutive window is reduced by the first element, leaving only the last element in the final window. This allows for easy setup of efficient calculations in symmetric each-to-each relationship.>>> series = [3, 6, 3, 5, 7] >>> for window in pyramid_windows(series): ... print(window) [0 1 2 3 4] [1 2 3 4] [2 3 4] [3 4] [4]
- Parameters
series (sequence) – Sequence of elements, for which windows should be generated.
- Yields
np.ndarray(dtype=int) – Windows as np.ndarray of indices.
- tesliper.datawork.geometry.rmsd_sieve(geometry: Sequence[Sequence[Sequence[float]]], windows: Iterable[Sequence[int]], threshold: float = 1) numpy.ndarray[source]
Compare conformers’ geometry to keep only those that differ at least by a given threshold.
This function calculates how similar conformers are one to another, using a RMSD measure, that is is a root-mean-square deviation of atomic positions, and signalizes which of the conformers are duplicates, according to a given similarity threshold. Returned array of booleans may be treated as “originality” indicators for each conformer:
Truemeans given conformer has distinct structure,Falsemeans given conformer is similar to some other conformer marked as “original”.The measure of conformers’ similarity, the threshold parameter, is a minimum value of RMSD needed to consider two conformers different. In other words, if two conformers give a RMSD value that is lower then threshold, one of them will be marked as similar, producing a
Falsein the output array.To lower a computational expense, similarity measurement is performed in “chunks”, using a sliding window technique. Windows consist of a portion of conformers from the original data, or more precisely, indices of conformers that should be included in the particular window. First item from the window is compared to all the others that are in the same window, and if any of them is similar to the reference item, it is marked as duplicate (not “original”). The process is repeated for each window.
The windows itself should be provided by user as windows parameter. This provides a flexibility in the process: you may choose to sacrifice accuracy to lower necessary computational time or vice versa. You may also choose a different moving window strategy or reject it alltogether, and calculate one-to-each similarity in the whole set. Iterables of windows accepted by this function may be generated with one of the dedicated moving window funcions:
stretching_windows(),fixed_windows(), orpyramid_windows(). Refer to their documentation for more information.- Parameters
geometry (sequence of sequence of sequence of float) – A list of conformers, where each conformer is represented by a sequence of coordinates in 3-dimensional space. It is assumed that order of atoms in each conformers’ representation is identical.
windows (iterable of sequence of int) – An iterable of windows, where each window is a list of indices. Comparision of RMSD values will be performed inside each window.
threshold (float) – Minimum RMSD value to consider two compared conformers different.
- Returns
Array of booleans for each conformer:
Trueif conformer’s structure is “original” and should be kept,Falseif it is a duplicate of other, “original” structure (at least according to threshold given), and should be discarded.- Return type
np.ndarray(dtype=bool)