fix: support alternative atom names within connect_via_residue_names #715

nscorley · 2024-12-10T04:44:09Z

Addresses #684

nscorley · 2024-12-10T04:48:51Z

@Croydon-Brixton welcome your thoughts, as I know you've thought a lot about this as well

padix-key

Hi, thanks for the fix! I answered your questions above in the review comments.

src/biotite/structure/bonds.pyx

padix-key · 2024-12-10T10:36:03Z

I found a problem with residues with mixed naming standard: For about 50% of all residues in the CCD there are conflicting names between standard and alternative, i.e. the same name is used for different atoms: For example ZZU swaps O and OXT. This means always mapping from alt to std name would mean, we accidentally identify std as alt names. Thus some bonds would not be found, even if a residue uses std nomenclature.

Hence, I would use your current approach: Only map atoms from a residue, if any atom name is not in the std list of atoms.

nscorley · 2024-12-10T19:34:30Z

I found a problem with residues with mixed naming standard: For about 50% of all residues in the CCD there are conflicting names between standard and alternative, i.e. the same name is used for different atoms: For example ZZU swaps O and OXT. This means always mapping from alt to std name would mean, we accidentally identify std as alt names. Thus some bonds would not be found, even if a residue uses std nomenclature.

Hence, I would use your current approach: Only map atoms from a residue, if any atom name is not in the std list of atoms.

Sounds like a plan. I went with the behavior to only try alternative atom names in the event we cannot map all atoms to standard names for efficiency as well — only in very rare instances (and typically only for non-canonicals/ligands) would such a mapping be needed

nscorley · 2024-12-15T01:16:12Z

Updated to address the comments; given that we only lookup alternative atom names in the rare event that we can't match all of the atoms to standard names, I thought that we didn't need a separate get_std_to_alt_atom_name_map function within atoms; happy to add if you think that would be cleaner!

padix-key

Hi, here are a few more minor comments.

I thought that we didn't need a separate get_std_to_alt_atom_name_map()

This is a good point. I would leave this decision up to you.

padix-key · 2024-12-20T12:29:37Z

src/biotite/structure/bonds.pyx

+        std_atom_ids = get_from_ccd(
+            "chem_comp_atom", 
+            res_names[curr_start_i], 
+            "atom_id"
+        ) 


get_from_ccd() returns a BinaryCIFColumn which needs to be converted into a NumPy array.

Suggested change

std_atom_ids = get_from_ccd(

"chem_comp_atom",

res_names[curr_start_i],

"atom_id"

)

std_atom_ids = get_from_ccd(

"chem_comp_atom",

res_names[curr_start_i],

"atom_id"

)

std_atom_ids = std_atom_ids.as_array() if std_atom_ids is not None else None

padix-key · 2024-12-20T12:32:31Z

src/biotite/structure/bonds.pyx

+        # all atoms using standard names
+        if (atom_names_in_res is not None and \
+            std_atom_ids is not None and \
+            not set(atom_names_in_res).issubset(std_atom_ids)):


This line is executed for each residue, so performance is important here. Let's check if something like np.isin(atom_names_in_res, std_atom_ids).all() is faster, as soon as the benchmarks in the CI have run (tests need to pass first).

padix-key · 2024-12-20T12:38:34Z

src/biotite/structure/bonds.pyx


    cdef list bonds = []
    cdef int res_i
    cdef int i, j
    cdef int curr_start_i, next_start_i
    cdef np.ndarray atom_names = atoms.atom_name
    cdef np.ndarray atom_names_in_res
+    cdef np.ndarray std_atom_ids


With the changes below this variable cannot be statically typed anymore as the type changes from BinaryCIFColumn to ndarray. We are not excessively accessing its methods anyway so there is probably only marginal performance benefit from typing it.

padix-key · 2024-12-20T12:41:30Z

tests/structure/io/test_pdbx.py

+    assert (atoms_with_alt_ids.bonds.as_array() == atoms.bonds.as_array()).all()
+
+    # Ensure atom names are the same (we mapped the alternative atom IDs to the standard atom IDs)
+    assert (atoms.atom_name == atoms_with_alt_ids.atom_name).all()


This is not true, anymore, right?

padix-key · 2024-12-20T12:53:23Z

tests/structure/io/test_pdbx.py

+def test_connect_via_residue_names_with_alt_atom_ids():
+    """
+    Ensure that the residue names are correctly parsed, even when alternative
+    atom IDs are present.
+    """
+    pdbx_file_with_no_alt_ids = pdbx.CIFFile.read(
+        join(data_dir("structure"), "6q9t.cif")
+    )
+    atoms = pdbx.get_structure(pdbx_file_with_no_alt_ids, model=1, include_bonds=True)
+
+    # ZY9 modified to have alternative atom IDs
+    pdbx_file_with_alt_ids = pdbx.CIFFile.read(
+        join(data_dir("structure"), "6q9t_with_alt_ids.cif")
+    )
+    atoms_with_alt_ids = pdbx.get_structure(
+        pdbx_file_with_alt_ids, model=1, include_bonds=True
+    )


connect_via_residue_names() is not specific to PDBx files, but can be used on any AtomArray. Hence, I propose to move such test to test_bonds.py. This would also remove the need to introduce a new test structure as you could run this test on ZY9 via structure.info.residue("ZY9") directly, where its atom names are modified before passing it to connect_via_residue_names()

fix: support alternative atom names within connect_via_residue_names

d4b0f01

nscorley force-pushed the fix/alt-atom-names-for-connect-via-residue-names branch from c6acc78 to 72da6ec Compare December 10, 2024 04:48

nscorley marked this pull request as draft December 10, 2024 04:52

padix-key requested changes Dec 10, 2024

View reviewed changes

src/biotite/structure/bonds.pyx Outdated Show resolved Hide resolved

src/biotite/structure/bonds.pyx Show resolved Hide resolved

src/biotite/structure/bonds.pyx Outdated Show resolved Hide resolved

src/biotite/structure/bonds.pyx Outdated Show resolved Hide resolved

chore: remove TODO's

02f4029

nscorley force-pushed the fix/alt-atom-names-for-connect-via-residue-names branch from 9830bee to 02f4029 Compare December 15, 2024 01:14

nscorley marked this pull request as ready for review December 15, 2024 01:17

padix-key requested changes Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: support alternative atom names within connect_via_residue_names #715

fix: support alternative atom names within connect_via_residue_names #715

nscorley commented Dec 10, 2024 •

edited

Loading

nscorley commented Dec 10, 2024

padix-key left a comment

padix-key commented Dec 10, 2024 •

edited

Loading

nscorley commented Dec 10, 2024

nscorley commented Dec 15, 2024

padix-key left a comment

padix-key Dec 20, 2024

padix-key Dec 20, 2024

padix-key Dec 20, 2024

padix-key Dec 20, 2024

padix-key Dec 20, 2024

fix: support alternative atom names within connect_via_residue_names #715

Are you sure you want to change the base?

fix: support alternative atom names within connect_via_residue_names #715

Conversation

nscorley commented Dec 10, 2024 • edited Loading

nscorley commented Dec 10, 2024

padix-key left a comment

Choose a reason for hiding this comment

padix-key commented Dec 10, 2024 • edited Loading

nscorley commented Dec 10, 2024

nscorley commented Dec 15, 2024

padix-key left a comment

Choose a reason for hiding this comment

padix-key Dec 20, 2024

Choose a reason for hiding this comment

padix-key Dec 20, 2024

Choose a reason for hiding this comment

padix-key Dec 20, 2024

Choose a reason for hiding this comment

padix-key Dec 20, 2024

Choose a reason for hiding this comment

padix-key Dec 20, 2024

Choose a reason for hiding this comment

nscorley commented Dec 10, 2024 •

edited

Loading

padix-key commented Dec 10, 2024 •

edited

Loading