Math and Algorithms

Simulation of spectra

To simulate a spectrum, each band’s theoretical signal intensity, derived from quantum chemical calculations of corresponding optical activity, must be expressed as a broadened peak, instead of the single scalar value. To simulate peak’s shape one of the curve fitting functions is used. tesliper implements two, most commonly used, such functions: gaussian function1 and lorentzian function2.

For each point on the simulated spectrum’s abscissa, the corresponding signal intensity is calculated by applying the fitting function to all bands of the conformer and summing resulting values.

1: https://mathworld.wolfram.com/GaussianFunction.html
2: https://mathworld.wolfram.com/LorentzianFunction.html

Gaussian fitting function

\[f(\nu) = \frac{1}{\sigma\sqrt{2\pi}}\sum\limits_i I_i e^{ -(\nu_i - \nu)^2 / (2\sigma^2) }\]

\(\nu\): Arbitrary point on the \(x\)-axis, for which the signal intensity is calculated.
\(\nu_i\): Point on the \(x\)-axis, at which the \(i\)^th band occur.
\(I_i\): Intensity of the \(i\)^th band.
\(\sigma = \sqrt{2}\omega\): Standard derivation, in this context interpreted as equal to \(\sqrt{2}\) times \(\omega\).
\(\omega\): Half width of the peak at \(\frac{1}{e}\) of its maximum value (HW1OeM), expressed in the \(x\)-axis units.

Lorentzian fitting function

\[f(\nu) = \frac{\gamma}{\pi}\sum\limits_i\frac{I_i}{(\nu_i - \nu)^2 + \gamma^2}\]

\(\nu\): Arbitrary point on the \(x\)-axis, for which the signal intensity is calculated.
\(\nu_i\): Point on the \(x\)-axis, at which the \(i\)^th band occur.
\(I_i\): Intensity of the \(i\)^th band.
\(\gamma\): Half width of the peak at half of its maximum value (HWHM), expressed in the \(x\)-axis units.

Calculation of intensities

Dipole strength and rotator strength is converted to the theoretical intensity as described by Polavarapu3. Constants used in below equations are as follows.

\(c = 2.99792458 \times 10^{10}\ \mathrm{cm}\cdot\mathrm{s}^{-1}\): Speed of light.
\(h = 6.62606896 \times 10^{-30}\ \mathrm{kg}\cdot\mathrm{cm}^2\cdot\mathrm{s}^{-1}\): Planck’s constant.
\(N_A = 6.02214199 \times 10^{23}\ \mathrm{mol}^{-1}\): Avogadro’s constant.
\(m_e = 9.10938 \times 10^{-28}\ \mathrm{g}\): Mass of the electron.
\(e = 4.803204 \times 10^{-10}\ \mathrm{esu}\): The charge on the electron.

3: Prasad L. Polavarapu (2017), Chiroptical Spectroscopy Fundamentals and Applications, CRC Press

Dipole strength to IR intensities

\[I_k = \frac{100 \cdot 8 \pi^3 N_A}{3 \cdot \ln(10) \cdot hc} D_k \nu_k = 0.010886 \cdot D_k \nu_k\]

\(D_k \ [\times 10^{-40}\ \mathrm{esu}^2\mathrm{cm}^2]\): Dipole strength of \(k^{\mathrm{th}}\) transition.
\(\nu_k \ [\mathrm{cm}^{-1}]\): Frequency of the \(k^{\mathrm{th}}\) transition.
\(I_k \ [\mathrm{L}\,\mathrm{mol}^{-1}\mathrm{cm}^{-1}]\): Theoretical (“zero-width”) peak intensity in terms of the decadic molar absorption coefficient \(\epsilon\).

Rotator strength to VCD intensities

\[I_k = 4 \cdot \frac{8 \pi^3 N_A}{3 \cdot \ln(10) \cdot hc} R_k \nu_k = 0.0435441 \cdot R_k \nu_k\]

\(R_k \ [\times 10^{-44}\ \mathrm{esu}^2\mathrm{cm}^2]\): Rotator strength of \(k^{\mathrm{th}}\) transition.
\(\nu_k \ [\mathrm{cm}^{-1}]\): Frequency of the \(k^{\mathrm{th}}\) transition.
\(I_k \ [\mathrm{L}\,\mathrm{mol}^{-1}\mathrm{cm}^{-1}]\): Theoretical (“zero-width”) peak intensity in terms of the decadic molar absorption coefficient \(\epsilon\).

Dipole strength to UV intensities

\[I_k = \frac{8 \pi^3 N_A}{3 \cdot \ln(10) \cdot hc} D_k \lambda_k = 0.010886 \cdot D_k \lambda_k\]

\(D_k \ [\times 10^{-40}\ \mathrm{esu}^2\mathrm{cm}^2]\): Dipole strength of \(k^{\mathrm{th}}\) transition.
\(\lambda_k \ [\mathrm{cm}^{-1}]\): Wavelength of the \(k^{\mathrm{th}}\) transition.
\(I_k \ [\mathrm{L}\,\mathrm{mol}^{-1}\mathrm{cm}^{-1}]\): Theoretical (“zero-width”) peak intensity in terms of the decadic molar absorption coefficient \(\epsilon\).

Rotator strength to ECD intensities

\[I_k = 4 \cdot \frac{8 \pi^3 N_A}{3 \cdot \ln(10) \cdot hc} R_k \lambda_k = 0.0435441 \cdot R_k \lambda_k\]

\(R_k \ [\times 10^{-44}\ \mathrm{esu}^2\mathrm{cm}^2]\): Rotator strength of \(k^{\mathrm{th}}\) transition.
\(\lambda_k \ [\mathrm{cm}^{-1}]\): Wavelength of the \(k^{\mathrm{th}}\) transition.
\(I_k \ [\mathrm{L}\,\mathrm{mol}^{-1}\mathrm{cm}^{-1}]\): Theoretical (“zero-width”) peak intensity in terms of the decadic molar absorption coefficient \(\epsilon\).

Oscillator strength to UV intensities

Conversion from oscillator strength to signal intensity of UV spectrum is calculated as described by Gaussian4.

\[I_k = \frac{e^2 N_A}{10^3 \ln(10) m_e c^2} f_k = 2.315351857 \times 10^8 f_k\]

\(f_k\): Oscillator strength of \(k^{\mathrm{th}}\) transition.
\(I_k \ [\mathrm{L}\,\mathrm{mol}^{-1}\mathrm{cm}^{-1}]\): Theoretical (“zero-width”) peak intensity in terms of the decadic molar absorption coefficient \(\epsilon\).

4: https://gaussian.com/uvvisplot/

Raman/ROA intensities

Gaussian-provided Raman and ROA activities are used without any conversion.

Population of conformers

Population of conformers is calculated according to the Boltzmann probability distribution that “gives the probability that a system will be in a certain state as a function of that state’s energy and the temperature of the system.”5 In this context each conformer is considered one of the possible states of the system (a studied molecule).

Firstly, we calculate a Boltzmann factors for each conformer in respect to the most stable conformer (the one of the lowest energy). Boltzmann factor of two states is defined as:

\[B^a_b = \frac{F(state_a)}{F(state_b)} = e^{(E_b - E_a)/kt}\]

where:

\(E_a\) and \(E_b\): energies of states \(a\) and \(b\);
\(k = 0.0019872041 \: \mathrm{kcal/(mol*K)}\): Boltzmann constant;
\(t\): temperature of the system.

Boltzmann factor represents a ratio of probabilities of the two states being occupied. In other words, it shows how much more likely it is for the molecule to take the form of one conformer over another conformer. Having a ratio of these probabilities for each possible conformer in respect to the most stable conformer, we are able to find the distribution of conformers (probability of taking the form of each conformer):

\[p_i = \frac{B_0^i}{\sum\limits_j^{states}B_0^j}\]

assuming that \(state_0\) is the state of the lowest energy (the most stable conformer).

5: https://en.wikipedia.org/wiki/Boltzmann_distribution

RMSD of conformers

RMSD, or root-mean-square deviation of atomic positions, is used as a measure of similarity between two conformers. As its name hints, it is an average distance between atoms in the two studied conformers: the lower the RMSD value, the more similar are conformers in question.

Finding minimized value of RMSD

In a typical output of the quantum chemical calculations software, molecule is represented by a number of points (mapping to particular atoms) in a 3-dimensional space. Usually, orientation and position of the molecule in the coordinate system is arbitrary and simple overlay of the two conformers may not be the same as their optimal overlap. To neglect the effect of conformers’ rotation and shift on the similarity measure, we will look for the common reference frame and optimal alignment of atoms.

Zero-centring atomic coordinates

To find the common reference frame for two conformers we move both to the origin of the coordinate system. This is done by calculating a centroid of a conformer and subtracting it from each point representing an atom. The centroid is given as an arithmetic mean of all the atoms in the conformer:

\[a^0_i = a_i - \frac{1}{n}\sum\limits_{j=a}^{n}a_j\]

where:

\(a_i\), \(a_j\): atom’s original position in the coordinate system;
\(a^0_i\): atom’s centered position in the coordinate system;
\(n\): number of atoms in the molecule.

Rotating with Kabsch algorithm

Optimal rotation of one conformer onto another is achieved using a Kabsch algorithm6 (also known as Wahba’s problem7). Interpreting positions of each conformers’ atoms as a matrix, we find the covariance matrix \(H\) of these matrices (\(P\) and \(Q\)):

\[H = P^\intercal Q\]

and then we use the singular value decomposition (SVD)8 routine to get \(U\) and \(V\) unitary matrices.

\[H = U \Sigma V ^\intercal\]

Having these, we can calculate the optimal rotation matrix as:

\[\begin{split}R = V \begin{pmatrix}1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & d\end{pmatrix} U ^\intercal\end{split}\]

where \(d = \mathrm{sign}(\mathrm{det}(VU^\intercal))\) that allows to ensure a right-handed coordinate system.

Note

To allow for calculation of th best rotation between sets of molecules and to compromise between efficiency and simplicity of implementation, tesliper uses Einstein summation convention9 via numpy.einsum() function. The implementation is as follows:

def kabsch_rotate(a: MoleculeOrList, b: MoleculeOrList) -> np.ndarray:
    """Minimize RMSD of conformers *a* and *b* by rotating molecule *a* onto *b*.
    Expects given representation of conformers to be zero-centered.
    Both *a* and *b* may be a single molecule or a set of conformers.

    Parameters
    ----------
    a : [Sequence of ]Sequence of Sequence of float
        Set of points representing atoms, that will be rotated to best match reference.
    b : [Sequence of ]Sequence of Sequence of float
        Set of points representing atoms of the reference molecule.
    """
    # this approach is probably not the most efficient
    # but lets us easily perform a matrix multiplication on stacks of matrices
    a, b = np.asanyarray(a), np.asanyarray(b)
    # calculate covariance matrix for each stacked set of points
    # for each of stacked sets of points, equivalent of:
    # >>> cov = a.T @ b
    cov = np.einsum("...ji,...jk", a, b)
    u, s, vh = np.linalg.svd(cov)  # singular value decomposition
    # if determinant is negative, swap to ensure right-handed coordinate system
    det = np.linalg.det(vh @ u)  # works with stacked matrices
    # don't introduce new dimension if not necessary
    shape = (det.size, 3, 3) if det.size > 1 else (3, 3)
    swap = np.zeros(shape)
    swap[..., np.arange(2), np.arange(2)] = 1
    swap[..., -1, -1] = np.sign(det)
    # calculate optimally rotated set/s of points `a`
    # for each of stacked sets of points, equivalent of
    # >>> rotated = a @ u @ swap @ vh
    # where u @ swap @ vh is rotation matrix
    return np.einsum("...ij,...jk,...kl,...lm", a, u, swap, vh)

6: https://en.wikipedia.org/wiki/Kabsch_algorithm
7: https://en.wikipedia.org/wiki/Wahba%27s_problem
8: https://en.wikipedia.org/wiki/Singular_value_decomposition
9: https://en.wikipedia.org/wiki/Einstein_notation

Calculating RMSD of atomic positions

Once conformers are aligned, the value of RMSD10 is calculated simply by finding a distance between each equivalent atoms and averaging their squares and finding the root of this average:

\[\mathrm{RMSD} = \sqrt{\frac{1}{n}\sum\limits_i^n(p_i - q_i)^2}\]

where:

\(p_i\) and \(q_i\): positions of \(i\)^th equivalent atoms in conformers \(P\) and \(Q\);
\(n\): number of atoms in each conformer.

10: https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions

Comparing conformers

To compare conformers as efficiently as possible, the RMSD values are calculated not in the each-to-each scheme, but inside a rather small moving window. The size of this window determines how many calculations will be done for the whole collection.

Moving window mechanism

tesliper provides three types of moving windows: a fixed, stretching, and pyramid windows. The strategy you choose will affect both the performance and the accuracy of the RMSD sieve, as described below.

fixed: The most basic sliding window of a fixed size. Provides the most control over the performance of the sieve, but is the least accurate.
stretching: The default, allows to specify the size of the window in the context of some numeric property, usually the energy of conformers. The size may differ in the sense of the number of conformers in each window, but the difference between maximum and minimum values of said property inside a window will not be bigger than the given size. Provides a best compromise between the performance and the accuracy.
pyramid: The first window will contain the whole collection and each consecutive window will be smaller by one conformer. Allows to perform a each-to-each comparison, but in logarithmic time rather than quadratic time. Best accuracy but worst performance.

Note

The actual windows produced by sliding window functions are iterables of numpy.ndarrays of indices (that point to the value in the original array of conformers).

The sieve

The RMSD sieve function takes care of zero-centring and finding the best overlap of the conformers, as described previously. Aside form this, it works as follows: for each window, provided by one of the moving window functions described above, it takes the first conformer in the window (reference) and calculates it’s minimum RMSD value with respect to each of the other conformers in this window. Any conformers that have lower RMSD value than a given threshold, will be considered identical to the reference conformer and internally marked as not kept. The sieve returns an array of boolean value for each conformer: True if conformer’s structure is “original” and should be kept, False if it is a duplicate of other, “original” structure (at least according to threshold given), and should be discarded.

Spectra transformation

Finding best shift

Optimal offset of two spectra is determined by calculating their cross-correlation11 (understood as in the signal processing context) and finding its maximum value. Index of this max value of the discrete cross-correlation array indicates the position of one spectrum in respect to the other spectrum, in which the overlap of the two is the greatest.

11: https://en.wikipedia.org/wiki/Cross-correlation

Finding optimal scaling

Optimal scaling factor of spectra is determined by comparing a mean y values of target spectrum and a reference spectrum. Values lower than 1% of maximum absolute y value of each spectrum are ignored.