Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support alternative atom names within connect_via_residue_names #715

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nscorley
Copy link
Contributor

@nscorley nscorley commented Dec 10, 2024

Addresses #684

@nscorley nscorley force-pushed the fix/alt-atom-names-for-connect-via-residue-names branch from c6acc78 to 72da6ec Compare December 10, 2024 04:48
@nscorley
Copy link
Contributor Author

@Croydon-Brixton welcome your thoughts, as I know you've thought a lot about this as well

@nscorley nscorley marked this pull request as draft December 10, 2024 04:52
Copy link
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the fix! I answered your questions above in the review comments.

src/biotite/structure/bonds.pyx Outdated Show resolved Hide resolved
src/biotite/structure/bonds.pyx Show resolved Hide resolved
src/biotite/structure/bonds.pyx Outdated Show resolved Hide resolved
src/biotite/structure/bonds.pyx Outdated Show resolved Hide resolved
@padix-key
Copy link
Member

padix-key commented Dec 10, 2024

I found a problem with residues with mixed naming standard: For about 50% of all residues in the CCD there are conflicting names between standard and alternative, i.e. the same name is used for different atoms: For example ZZU swaps O and OXT. This means always mapping from alt to std name would mean, we accidentally identify std as alt names. Thus some bonds would not be found, even if a residue uses std nomenclature.

Hence, I would use your current approach: Only map atoms from a residue, if any atom name is not in the std list of atoms.

@nscorley
Copy link
Contributor Author

I found a problem with residues with mixed naming standard: For about 50% of all residues in the CCD there are conflicting names between standard and alternative, i.e. the same name is used for different atoms: For example ZZU swaps O and OXT. This means always mapping from alt to std name would mean, we accidentally identify std as alt names. Thus some bonds would not be found, even if a residue uses std nomenclature.

Hence, I would use your current approach: Only map atoms from a residue, if any atom name is not in the std list of atoms.

Sounds like a plan. I went with the behavior to only try alternative atom names in the event we cannot map all atoms to standard names for efficiency as well — only in very rare instances (and typically only for non-canonicals/ligands) would such a mapping be needed

@nscorley nscorley force-pushed the fix/alt-atom-names-for-connect-via-residue-names branch from 9830bee to 02f4029 Compare December 15, 2024 01:14
@nscorley
Copy link
Contributor Author

Updated to address the comments; given that we only lookup alternative atom names in the rare event that we can't match all of the atoms to standard names, I thought that we didn't need a separate get_std_to_alt_atom_name_map function within atoms; happy to add if you think that would be cleaner!

@nscorley nscorley marked this pull request as ready for review December 15, 2024 01:17
Copy link
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, here are a few more minor comments.

I thought that we didn't need a separate get_std_to_alt_atom_name_map()

This is a good point. I would leave this decision up to you.

Comment on lines +1652 to +1656
std_atom_ids = get_from_ccd(
"chem_comp_atom",
res_names[curr_start_i],
"atom_id"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_from_ccd() returns a BinaryCIFColumn which needs to be converted into a NumPy array.

Suggested change
std_atom_ids = get_from_ccd(
"chem_comp_atom",
res_names[curr_start_i],
"atom_id"
)
std_atom_ids = get_from_ccd(
"chem_comp_atom",
res_names[curr_start_i],
"atom_id"
)
std_atom_ids = std_atom_ids.as_array() if std_atom_ids is not None else None

# all atoms using standard names
if (atom_names_in_res is not None and \
std_atom_ids is not None and \
not set(atom_names_in_res).issubset(std_atom_ids)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is executed for each residue, so performance is important here. Let's check if something like np.isin(atom_names_in_res, std_atom_ids).all() is faster, as soon as the benchmarks in the CI have run (tests need to pass first).


cdef list bonds = []
cdef int res_i
cdef int i, j
cdef int curr_start_i, next_start_i
cdef np.ndarray atom_names = atoms.atom_name
cdef np.ndarray atom_names_in_res
cdef np.ndarray std_atom_ids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes below this variable cannot be statically typed anymore as the type changes from BinaryCIFColumn to ndarray. We are not excessively accessing its methods anyway so there is probably only marginal performance benefit from typing it.

assert (atoms_with_alt_ids.bonds.as_array() == atoms.bonds.as_array()).all()

# Ensure atom names are the same (we mapped the alternative atom IDs to the standard atom IDs)
assert (atoms.atom_name == atoms_with_alt_ids.atom_name).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true, anymore, right?

Comment on lines +217 to +233
def test_connect_via_residue_names_with_alt_atom_ids():
"""
Ensure that the residue names are correctly parsed, even when alternative
atom IDs are present.
"""
pdbx_file_with_no_alt_ids = pdbx.CIFFile.read(
join(data_dir("structure"), "6q9t.cif")
)
atoms = pdbx.get_structure(pdbx_file_with_no_alt_ids, model=1, include_bonds=True)

# ZY9 modified to have alternative atom IDs
pdbx_file_with_alt_ids = pdbx.CIFFile.read(
join(data_dir("structure"), "6q9t_with_alt_ids.cif")
)
atoms_with_alt_ids = pdbx.get_structure(
pdbx_file_with_alt_ids, model=1, include_bonds=True
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connect_via_residue_names() is not specific to PDBx files, but can be used on any AtomArray. Hence, I propose to move such test to test_bonds.py. This would also remove the need to introduce a new test structure as you could run this test on ZY9 via structure.info.residue("ZY9") directly, where its atom names are modified before passing it to connect_via_residue_names()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants