-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: support alternative atom names within connect_via_residue_names #715
base: main
Are you sure you want to change the base?
fix: support alternative atom names within connect_via_residue_names #715
Conversation
c6acc78
to
72da6ec
Compare
@Croydon-Brixton welcome your thoughts, as I know you've thought a lot about this as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the fix! I answered your questions above in the review comments.
I found a problem with residues with mixed naming standard: For about 50% of all residues in the CCD there are conflicting names between standard and alternative, i.e. the same name is used for different atoms: For example Hence, I would use your current approach: Only map atoms from a residue, if any atom name is not in the std list of atoms. |
Sounds like a plan. I went with the behavior to only try alternative atom names in the event we cannot map all atoms to standard names for efficiency as well — only in very rare instances (and typically only for non-canonicals/ligands) would such a mapping be needed |
9830bee
to
02f4029
Compare
Updated to address the comments; given that we only lookup alternative atom names in the rare event that we can't match all of the atoms to standard names, I thought that we didn't need a separate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, here are a few more minor comments.
I thought that we didn't need a separate
get_std_to_alt_atom_name_map()
This is a good point. I would leave this decision up to you.
std_atom_ids = get_from_ccd( | ||
"chem_comp_atom", | ||
res_names[curr_start_i], | ||
"atom_id" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_from_ccd()
returns a BinaryCIFColumn
which needs to be converted into a NumPy array.
std_atom_ids = get_from_ccd( | |
"chem_comp_atom", | |
res_names[curr_start_i], | |
"atom_id" | |
) | |
std_atom_ids = get_from_ccd( | |
"chem_comp_atom", | |
res_names[curr_start_i], | |
"atom_id" | |
) | |
std_atom_ids = std_atom_ids.as_array() if std_atom_ids is not None else None |
# all atoms using standard names | ||
if (atom_names_in_res is not None and \ | ||
std_atom_ids is not None and \ | ||
not set(atom_names_in_res).issubset(std_atom_ids)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is executed for each residue, so performance is important here. Let's check if something like np.isin(atom_names_in_res, std_atom_ids).all()
is faster, as soon as the benchmarks in the CI have run (tests need to pass first).
|
||
cdef list bonds = [] | ||
cdef int res_i | ||
cdef int i, j | ||
cdef int curr_start_i, next_start_i | ||
cdef np.ndarray atom_names = atoms.atom_name | ||
cdef np.ndarray atom_names_in_res | ||
cdef np.ndarray std_atom_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changes below this variable cannot be statically typed anymore as the type changes from BinaryCIFColumn
to ndarray
. We are not excessively accessing its methods anyway so there is probably only marginal performance benefit from typing it.
assert (atoms_with_alt_ids.bonds.as_array() == atoms.bonds.as_array()).all() | ||
|
||
# Ensure atom names are the same (we mapped the alternative atom IDs to the standard atom IDs) | ||
assert (atoms.atom_name == atoms_with_alt_ids.atom_name).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true, anymore, right?
def test_connect_via_residue_names_with_alt_atom_ids(): | ||
""" | ||
Ensure that the residue names are correctly parsed, even when alternative | ||
atom IDs are present. | ||
""" | ||
pdbx_file_with_no_alt_ids = pdbx.CIFFile.read( | ||
join(data_dir("structure"), "6q9t.cif") | ||
) | ||
atoms = pdbx.get_structure(pdbx_file_with_no_alt_ids, model=1, include_bonds=True) | ||
|
||
# ZY9 modified to have alternative atom IDs | ||
pdbx_file_with_alt_ids = pdbx.CIFFile.read( | ||
join(data_dir("structure"), "6q9t_with_alt_ids.cif") | ||
) | ||
atoms_with_alt_ids = pdbx.get_structure( | ||
pdbx_file_with_alt_ids, model=1, include_bonds=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
connect_via_residue_names()
is not specific to PDBx files, but can be used on any AtomArray
. Hence, I propose to move such test to test_bonds.py
. This would also remove the need to introduce a new test structure as you could run this test on ZY9
via structure.info.residue("ZY9")
directly, where its atom names are modified before passing it to connect_via_residue_names()
Addresses #684