Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect exact mass for RIKEN/PR3* spectra #125

Open
bachi55 opened this issue Apr 24, 2020 · 1 comment
Open

Incorrect exact mass for RIKEN/PR3* spectra #125

bachi55 opened this issue Apr 24, 2020 · 1 comment

Comments

@bachi55
Copy link
Contributor

bachi55 commented Apr 24, 2020

Hei,

I stumbled into an issue with the RIKEN/PR3* spectra. It seems, that the exact mass is not correctly calculated. Let's look at the following example:

PR302491.txt

CH$FORMULA: C27H32O15
CH$EXACT_MASS: 596.538
CH$SMILES: C[C@@H]1O[C@@H](OC[C@H]2O[C@@H](OC3=CC(O)=C4C(=O)C[C@H](OC4=C3)C3=CC(O)=C(O)C=C3)[C@H](O)[C@@H](O)[C@@H]2O)[C@H](O)[C@H](O)[C@H]1O

If I use an online tool to calculate the exact mass from the molecular formula, I get: 596.174125 (diff ~0.35).

I also calculated the exact mass using RDKit directly from the SMILES. I get: 596.1741203239999 (diff ~ 0.35).

When I check the compound in PubChem (searched by InChIKey) than I get: 596.17412.

Actually, the molecular weight in PubChem is pretty close the reported exact mass in the spectra file: 596.5 vs. 596.538.

I attached (see below) a Python script to run a comparison on reported and calculated (using RDKit) exact mass. I ran it for the RIKEN spectra files with an absolute tolerance of 0.001. Only the PR3*.txt seems to be effected.

I believe the files need a curation.

Best regards,
Eric

import sys
import os
import glob

from math import isclose

from rdkit.Chem import MolFromSmiles
from rdkit.Chem.rdMolDescriptors import CalcExactMolWt, CalcMolFormula


MF_PATTERN = "CH$FORMULA:"
EXACT_MASS_PATTERN = "CH$EXACT_MASS:"
SMILES_PATTERN = "CH$SMILES:"


if __name__ == "__main__":
    # Directory containing the RIKEN spectra files
    idir = sys.argv[1]

    # Iterate overall ms-files in the directory
    for msfn in sorted(glob.glob(os.path.join(idir, "*.txt"))):
        with open(msfn, "r") as msfile:
            # Read information from file: Molecular Formula, Exact Mass and SMILES
            line = msfile.readline().strip()
            while line:
                # Extract molecular formula
                if line.startswith(MF_PATTERN):
                    mf_file = line[(len(MF_PATTERN) + 1):]
                # Extract exact mass
                elif line.startswith(EXACT_MASS_PATTERN):
                    exact_mass_file = float(line[(len(EXACT_MASS_PATTERN) + 1):])
                # Extract SMILES
                elif line.startswith(SMILES_PATTERN):
                    smiles_file = line[(len(SMILES_PATTERN) + 1):]

                line = msfile.readline().strip()

        # We skip molecules that are intrinsically charged, as those might not be correctly handled by rdkit
        if mf_file.endswith("+"):
            continue

        # Calculate Molecular Formula and Exact Mass from the given SMILES and compare
        mol = MolFromSmiles(smiles_file)
        mf_smi = CalcMolFormula(mol)
        exact_mass_smi = CalcExactMolWt(mol)

        if mf_smi != mf_file:
            print("%s: MF (ms-file vs. rdkit) '%s' - '%s'" % (os.path.basename(msfn), mf_file, mf_smi))

        if not isclose(exact_mass_file, exact_mass_smi, abs_tol=1e-3):
            print("%s: Exact Mass (ms-file vs. rdkit) %f - %f = %f" % (os.path.basename(msfn), exact_mass_file,
                                                                       exact_mass_smi, exact_mass_file - exact_mass_smi))
@tsufz
Copy link
Member

tsufz commented Apr 24, 2020

Yes, this true. The given mass is the molar mass. @meier-rene, this is an issue, we should check using the validator (and fix it?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants