Math and Algorithms
Simulation of spectra
To simulate a spectrum, each band’s theoretical signal intensity, derived from quantum
chemical calculations of corresponding optical activity, must be expressed as a
broadened peak, instead of the single scalar value. To simulate peak’s shape one of the
curve fitting functions is used. tesliper
implements two, most commonly used, such
functions: gaussian function1 and lorentzian function2.
For each point on the simulated spectrum’s abscissa, the corresponding signal intensity is calculated by applying the fitting function to all bands of the conformer and summing resulting values.
Gaussian fitting function
- \(\nu\)
Arbitrary point on the \(x\)-axis, for which the signal intensity is calculated.
- \(\nu_i\)
Point on the \(x\)-axis, at which the \(i\)th band occur.
- \(I_i\)
Intensity of the \(i\)th band.
- \(\sigma = \sqrt{2}\omega\)
Standard derivation, in this context interpreted as equal to \(\sqrt{2}\) times \(\omega\).
- \(\omega\)
Half width of the peak at \(\frac{1}{e}\) of its maximum value (HW1OeM), expressed in the \(x\)-axis units.
Lorentzian fitting function
- \(\nu\)
Arbitrary point on the \(x\)-axis, for which the signal intensity is calculated.
- \(\nu_i\)
Point on the \(x\)-axis, at which the \(i\)th band occur.
- \(I_i\)
Intensity of the \(i\)th band.
- \(\gamma\)
Half width of the peak at half of its maximum value (HWHM), expressed in the \(x\)-axis units.
Calculation of intensities
Dipole strength and rotator strength is converted to the theoretical intensity as described by Polavarapu3. Constants used in below equations are as follows.
- \(c = 2.99792458 \times 10^{10}\ \mathrm{cm}\cdot\mathrm{s}^{-1}\)
Speed of light.
- \(h = 6.62606896 \times 10^{-30}\ \mathrm{kg}\cdot\mathrm{cm}^2\cdot\mathrm{s}^{-1}\)
Planck’s constant.
- \(N_A = 6.02214199 \times 10^{23}\ \mathrm{mol}^{-1}\)
Avogadro’s constant.
- \(m_e = 9.10938 \times 10^{-28}\ \mathrm{g}\)
Mass of the electron.
- \(e = 4.803204 \times 10^{-10}\ \mathrm{esu}\)
The charge on the electron.
- 3
Prasad L. Polavarapu (2017), Chiroptical Spectroscopy Fundamentals and Applications, CRC Press
Dipole strength to IR intensities
- \(D_k \ [\times 10^{-40}\ \mathrm{esu}^2\mathrm{cm}^2]\)
Dipole strength of \(k^{\mathrm{th}}\) transition.
- \(\nu_k \ [\mathrm{cm}^{-1}]\)
Frequency of the \(k^{\mathrm{th}}\) transition.
- \(I_k \ [\mathrm{L}\,\mathrm{mol}^{-1}\mathrm{cm}^{-1}]\)
Theoretical (“zero-width”) peak intensity in terms of the decadic molar absorption coefficient \(\epsilon\).
Rotator strength to VCD intensities
- \(R_k \ [\times 10^{-44}\ \mathrm{esu}^2\mathrm{cm}^2]\)
Rotator strength of \(k^{\mathrm{th}}\) transition.
- \(\nu_k \ [\mathrm{cm}^{-1}]\)
Frequency of the \(k^{\mathrm{th}}\) transition.
- \(I_k \ [\mathrm{L}\,\mathrm{mol}^{-1}\mathrm{cm}^{-1}]\)
Theoretical (“zero-width”) peak intensity in terms of the decadic molar absorption coefficient \(\epsilon\).
Dipole strength to UV intensities
- \(D_k \ [\times 10^{-40}\ \mathrm{esu}^2\mathrm{cm}^2]\)
Dipole strength of \(k^{\mathrm{th}}\) transition.
- \(\lambda_k \ [\mathrm{cm}^{-1}]\)
Wavelength of the \(k^{\mathrm{th}}\) transition.
- \(I_k \ [\mathrm{L}\,\mathrm{mol}^{-1}\mathrm{cm}^{-1}]\)
Theoretical (“zero-width”) peak intensity in terms of the decadic molar absorption coefficient \(\epsilon\).
Rotator strength to ECD intensities
- \(R_k \ [\times 10^{-44}\ \mathrm{esu}^2\mathrm{cm}^2]\)
Rotator strength of \(k^{\mathrm{th}}\) transition.
- \(\lambda_k \ [\mathrm{cm}^{-1}]\)
Wavelength of the \(k^{\mathrm{th}}\) transition.
- \(I_k \ [\mathrm{L}\,\mathrm{mol}^{-1}\mathrm{cm}^{-1}]\)
Theoretical (“zero-width”) peak intensity in terms of the decadic molar absorption coefficient \(\epsilon\).
Oscillator strength to UV intensities
Conversion from oscillator strength to signal intensity of UV spectrum is calculated as described by Gaussian4.
- \(f_k\)
Oscillator strength of \(k^{\mathrm{th}}\) transition.
- \(I_k \ [\mathrm{L}\,\mathrm{mol}^{-1}\mathrm{cm}^{-1}]\)
Theoretical (“zero-width”) peak intensity in terms of the decadic molar absorption coefficient \(\epsilon\).
Raman/ROA intensities
Gaussian-provided Raman and ROA activities are used without any conversion.
Population of conformers
Population of conformers is calculated according to the Boltzmann probability distribution that “gives the probability that a system will be in a certain state as a function of that state’s energy and the temperature of the system.”5 In this context each conformer is considered one of the possible states of the system (a studied molecule).
Firstly, we calculate a Boltzmann factors for each conformer in respect to the most stable conformer (the one of the lowest energy). Boltzmann factor of two states is defined as:
where:
- \(E_a\) and \(E_b\)
energies of states \(a\) and \(b\);
- \(k = 0.0019872041 \: \mathrm{kcal/(mol*K)}\)
Boltzmann constant;
- \(t\)
temperature of the system.
Boltzmann factor represents a ratio of probabilities of the two states being occupied. In other words, it shows how much more likely it is for the molecule to take the form of one conformer over another conformer. Having a ratio of these probabilities for each possible conformer in respect to the most stable conformer, we are able to find the distribution of conformers (probability of taking the form of each conformer):
assuming that \(state_0\) is the state of the lowest energy (the most stable conformer).
RMSD of conformers
RMSD, or root-mean-square deviation of atomic positions, is used as a measure of similarity between two conformers. As its name hints, it is an average distance between atoms in the two studied conformers: the lower the RMSD value, the more similar are conformers in question.
Finding minimized value of RMSD
In a typical output of the quantum chemical calculations software, molecule is represented by a number of points (mapping to particular atoms) in a 3-dimensional space. Usually, orientation and position of the molecule in the coordinate system is arbitrary and simple overlay of the two conformers may not be the same as their optimal overlap. To neglect the effect of conformers’ rotation and shift on the similarity measure, we will look for the common reference frame and optimal alignment of atoms.
Zero-centring atomic coordinates
To find the common reference frame for two conformers we move both to the origin of the coordinate system. This is done by calculating a centroid of a conformer and subtracting it from each point representing an atom. The centroid is given as an arithmetic mean of all the atoms in the conformer:
where:
- \(a_i\), \(a_j\)
atom’s original position in the coordinate system;
- \(a^0_i\)
atom’s centered position in the coordinate system;
- \(n\)
number of atoms in the molecule.
Rotating with Kabsch algorithm
Optimal rotation of one conformer onto another is achieved using a Kabsch algorithm6 (also known as Wahba’s problem7). Interpreting positions of each conformers’ atoms as a matrix, we find the covariance matrix \(H\) of these matrices (\(P\) and \(Q\)):
and then we use the singular value decomposition (SVD)8 routine to get \(U\) and \(V\) unitary matrices.
Having these, we can calculate the optimal rotation matrix as:
where \(d = \mathrm{sign}(\mathrm{det}(VU^\intercal))\) that allows to ensure a right-handed coordinate system.
Note
To allow for calculation of th best rotation between sets of molecules and to
compromise between efficiency and simplicity of implementation, tesliper
uses
Einstein summation convention9 via numpy.einsum()
function. The
implementation is as follows:
def kabsch_rotate(a: MoleculeOrList, b: MoleculeOrList) -> np.ndarray:
"""Minimize RMSD of conformers *a* and *b* by rotating molecule *a* onto *b*.
Expects given representation of conformers to be zero-centered.
Both *a* and *b* may be a single molecule or a set of conformers.
Parameters
----------
a : [Sequence of ]Sequence of Sequence of float
Set of points representing atoms, that will be rotated to best match reference.
b : [Sequence of ]Sequence of Sequence of float
Set of points representing atoms of the reference molecule.
"""
# this approach is probably not the most efficient
# but lets us easily perform a matrix multiplication on stacks of matrices
a, b = np.asanyarray(a), np.asanyarray(b)
# calculate covariance matrix for each stacked set of points
# for each of stacked sets of points, equivalent of:
# >>> cov = a.T @ b
cov = np.einsum("...ji,...jk", a, b)
u, s, vh = np.linalg.svd(cov) # singular value decomposition
# if determinant is negative, swap to ensure right-handed coordinate system
det = np.linalg.det(vh @ u) # works with stacked matrices
# don't introduce new dimension if not necessary
shape = (det.size, 3, 3) if det.size > 1 else (3, 3)
swap = np.zeros(shape)
swap[..., np.arange(2), np.arange(2)] = 1
swap[..., -1, -1] = np.sign(det)
# calculate optimally rotated set/s of points `a`
# for each of stacked sets of points, equivalent of
# >>> rotated = a @ u @ swap @ vh
# where u @ swap @ vh is rotation matrix
return np.einsum("...ij,...jk,...kl,...lm", a, u, swap, vh)
Calculating RMSD of atomic positions
Once conformers are aligned, the value of RMSD10 is calculated simply by finding a distance between each equivalent atoms and averaging their squares and finding the root of this average:
where:
- \(p_i\) and \(q_i\)
positions of \(i\)th equivalent atoms in conformers \(P\) and \(Q\);
- \(n\)
number of atoms in each conformer.
Comparing conformers
To compare conformers as efficiently as possible, the RMSD values are calculated not in the each-to-each scheme, but inside a rather small moving window. The size of this window determines how many calculations will be done for the whole collection.
Moving window mechanism
tesliper
provides three types of moving windows: a fixed
, stretching
, and
pyramid
windows. The strategy you choose will affect
both the performance and the accuracy of the RMSD sieve, as described below.
fixed
The most basic sliding window of a fixed size. Provides the most control over the performance of the sieve, but is the least accurate.
stretching
The default, allows to specify the size of the window in the context of some numeric property, usually the energy of conformers. The size may differ in the sense of the number of conformers in each window, but the difference between maximum and minimum values of said property inside a window will not be bigger than the given size. Provides a best compromise between the performance and the accuracy.
pyramid
The first window will contain the whole collection and each consecutive window will be smaller by one conformer. Allows to perform a each-to-each comparison, but in logarithmic time rather than quadratic time. Best accuracy but worst performance.
Note
The actual windows produced by sliding window functions are iterables of
numpy.ndarray
s of indices (that point to the value in the original
array of conformers).
The sieve
The RMSD sieve function
takes care of zero-centring and
finding the best overlap of the conformers, as described previously. Aside form this, it
works as follows: for each window, provided by one of the moving window functions
described above, it takes the first conformer in the window (reference) and calculates
it’s minimum RMSD value with respect to each of the other conformers in this window. Any
conformers that have lower RMSD value than a given threshold, will be considered
identical to the reference conformer and internally marked as not kept.
The sieve returns an array of boolean value for each conformer: True
if conformer’s
structure is “original” and should be kept, False
if it is a duplicate of other,
“original” structure (at least according to threshold given), and should be discarded.
Spectra transformation
Finding best shift
Optimal offset of two spectra is determined by calculating their cross-correlation11 (understood as in the signal processing context) and finding its maximum value. Index of this max value of the discrete cross-correlation array indicates the position of one spectrum in respect to the other spectrum, in which the overlap of the two is the greatest.
Finding optimal scaling
Optimal scaling factor of spectra is determined by comparing a mean y values of target spectrum and a reference spectrum. Values lower than 1% of maximum absolute y value of each spectrum are ignored.