Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential smile/coordinate discrepency #8

Open
max-hoffman opened this issue Oct 12, 2018 · 0 comments
Open

Potential smile/coordinate discrepency #8

max-hoffman opened this issue Oct 12, 2018 · 0 comments

Comments

@max-hoffman
Copy link

Hello,

I was trying to convert the ANI-1 dataset into a parquet format, and I ran into a potential mismatch between the coordinates and smiles string of at least one molecule (around 4k conformers).

I wrote a piece of sample code to try to isolate this first issue I ran into (Python 2.7.6 interpreter):

import h5py
from pybel import readstring
import json
import numpy as np
import pandas as pd

ani_path = '.../ani'
shard3 = os.path.join(ani_path, 'ani_gdb_s03.h5')

with h5py.File(shard3, 'r') as f:
    data_dict = f['gdb11_s03/gdb11_s03-11']

    coords     = data_dict['coordinates']
    elements   = data_dict['species']
    energies   = data_dict['energies']
    smi        = ''.join(data_dict['smiles'])
    
    mol = readstring('smi', smi)
    jmol = json.loads(pymol_to_json(mol))

    if len(jmol['atoms']) != len(elements[:]):
        print "shard: ", shard1
        print "\nmolecule: gdb11_s03/gdb11_s03-11"
        print "\nsmile: ", smi
        print "\nspecies:", elements[:]
        print "\npybel mol:", jmol
        print "\ncoordinates: ", coords.shape

with sample output:

shard:  .../ani_gdb_s03.h5
molecule: gdb11_s03/gdb11_s03-11
smile:  [H]C([H])=NN([H])[H]
species ['O' 'C' 'O' 'H' 'H']
pybel mol {u'atoms': [[1, 0], [6, 0], [1, 0], [7, 0], [7, 0], [1, 0], [1, 0]], u'bonds': [[1, 2, 1], [2, 3, 1], [2, 4, 2], [4, 5, 1], [5, 6, 1], [5, 7, 1]]}
coordinates:  (4320, 5, 3)

Only the filepath should need to be edited back in for this to run. I also wrote a different parser than the example code because I was having trouble getting the iteration to perform consistently, so maybe I introduced an unintended error there.

I will filter my parquet files for similar mismatches and go-ahead without them for now. If I have made an obvious mistake or if this has already been identified I'd still appreciate feedback.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant