Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lammps molecule number #1073

Merged
merged 24 commits into from
Feb 3, 2023
Merged

Conversation

jaclark5
Copy link
Contributor

@jaclark5 jaclark5 commented Dec 11, 2022

PR Summary:

Resolves #1072

This PR achieved the following:

  1. Allow keyword arguments to be passed to Compound.to_parmed() from Compound.save()
  2. Make Compound.to_parmed(infer_residues=True) the default
  3. Allow infer_residues to trigger a search for Compounds without box definitions, which is likely a complete compound.
  4. Updated .gitignore to ignore .swp files
  5. Update pep8 compliance in unrelated docstrings in compound.py

Point 3 is achieved with a recursive hidden function so that a system could be built with a combination of assembled boxes and the compounds of each would still be found. An example of hierarchy and output could be:

System.children => [Box] # A final box to be written
Box.children => [Phase1, Phase2] # Two boxes representing two phases to be combined
Phase1.children => [Water, Water, Water,.....] # Water molecules don't have box definition
Phase2.children => [CO2, CO2,...]

residues = [Water, Water, Water,.....] + [CO2, CO2,...]
Additionally, if Phase2 was made from multiple boxes, then the search would continue to each of those tiers to find the compounds.

I've tested this locally with:

import mbuild as mb
ch4 = mb.recipes.Alkane(n=1)
c4H10 = mb.recipes.Alkane(n=4)
ch4.name = "Methane"
c4H10.name = "nButane"
box = mb.Box(lengths=[5, 5, 5])
box_of_methane = mb.fill_box(compound=[ch4,c4H10], n_compounds=[1,2], box=box)
box_of_methane.save(
   "methane_ethane_box_1.lmp", 
   overwrite=True, 
   forcefield_files='utils/OPLSaa_alkanes.xml',
   parmed_kwargs={"infer_residues": True}
)

I'll add tests after I get comments.

PR Checklist


  • Includes appropriate unit test(s)
  • Appropriate docstring(s) are added/updated
  • Code is (approximately) PEP8 compliant
  • Issue(s) raised/addressed?

@daico007
Copy link
Member

Thank you for your PR! I like this new feature/addition but it seems like the logic need to be alter a bit (just looking at the failed unit test, this method failed to detect some the residues info of an existing test case). I will try to play with that method a bit and see what might be the cause of the issue (_pull_hierarchical_attrs())

@jaclark5
Copy link
Contributor Author

Hey, I was thinking about it later, and I chose to pull residue names, but with further thought it would probably be better to use object IDs throughout this method (to_parmed), since multiply layers in the ancestry could have the same name and so name is easily ambiguous.

@daico007
Copy link
Member

That may worth a shot!

@jaclark5
Copy link
Contributor Author

Is there a more appropriate place to put _pull_hierarchical_attrs() since I tried to write it generally?

@daico007
Copy link
Member

I can see that method lives in compound.py. I believe it would be a very versatile method that can be utilized in other conversions!

@jaclark5
Copy link
Contributor Author

The reason the test failed is because the parent Compound doesn't have a _box definition that the compounds were placed into. I changed the function to find compounds using the number of particles and bonds.

@daico007
Copy link
Member

Hi @ jaclark5, do you mind adding me as a collaborator on your repo and give me push access to this branch so I can push the fixes for the failing test? I am also exploring the _pull_hierarchical_attrs method, in term of what attribute should be used as the default, since using box may not always work. (Sorry, I let this sit stale for a bit)

@@ -1147,14 +1156,44 @@ def to_parmed(
--------
parmed.structure.Structure : Details on the ParmEd Structure object
"""

def _pull_hierarchical_attrs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the recursive approach that you are doing with this method! My only concern is that I am not sure _box == None is a good indicator of to refer a residue, since many of us often convert a compound without a box to parmed and assign the box information later (during the writing out step). One method I think might be useful is Compound.is_independent(), which basically indicate if a Compound has any external bond or not (i.e., if it's a stand-alone molecule), but it will require minor alteration of this method and might make not as versatile as you have it now (i.e., when you traveling down the hierarchy and bump into a Compound that is not independent, the residue should be its parent but not itself).

Also, I think I have tried to solve similar issue before (infer relevant residues/molecule information from the hierarchy) and came up with several different approaches:

  1. Bottom up approach, determine the lowest level Compound that form a full molecule
def _infer_molecules(compound):
    """Bottom up approach to infer residues"""
    connected_subgraph = compound.bond_graph.connected_components()
    residues = IndexedSet()
    for molecule in connected_subgraph:
        if len(molecule) == 1:
            ancestors = [molecule[0]]
        else:
            ancestors = IndexedSet(molecule[0].ancestors())
            for particle in molecule[1:]:
                # This works because particle.ancestors traversed, and hence
                # the lower level will be in the front.
                # The intersection will be left at the end,
                # ancestor of the first particle is used as reference.
                # Hence, this called will return the lowest-level Compound]
                # that is a molecule
                ancestors = ancestors.intersection(
                    IndexedSet(particle.ancestors())
                )
        residues.add(ancestors[0].name)
    return list(residues)
  1. Top down approach, always take the 2nd level Compound (if they are independent) as residue
def _infer_groups(compound):
    """Top down, second level if any is available"""
    residues = IndexedSet()
    if all(not child.children for child in compound.children):
        residues.add(compound.name)
    else:
        residues.update([child.name for child in compound.children])
    return residues

I believe the 1st approach should work in most cases, but the 2nd may be useful if you want better control of the grouping (e.g., a container make up of water, methanol and ethanol and you want to group methanol and ethanol under an alcohol group/Compound and water by itself).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the failing test, it's the test_resname_parmed in test_compound.py, specifically, when we try convert a compound (with not top level box) to parmed, so the infer_hierarchy step stops short at determining the lower level residues. A quick fix for this will be to add a line to define a box for that compound/system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I actually made many changes to this branch already too but haven't gotten back to it either. I'll wrap up my changes and push them soon. I think it would be easier to brain storm at that point.

@codecov
Copy link

codecov bot commented Jan 27, 2023

Codecov Report

Attention: Patch coverage is 95.55556% with 2 lines in your changes missing coverage. Please review.

Project coverage is 90.64%. Comparing base (3d7c93d) to head (5b3b49a).
Report is 101 commits behind head on main.

Files Patch % Lines
mbuild/conversion.py 95.55% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1073      +/-   ##
==========================================
+ Coverage   90.54%   90.64%   +0.09%     
==========================================
  Files          61       61              
  Lines        6190     6231      +41     
==========================================
+ Hits         5605     5648      +43     
+ Misses        585      583       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@daico007 daico007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new pull_residues() method! Only one minor typo, but the rest LGTM! All tests seem to pass also.

mbuild/conversion.py Outdated Show resolved Hide resolved
@daico007
Copy link
Member

Hey @CalCraven, if you have some time today, can you give this PR a review, we may be to merge it today (if not some time this week).

@CalCraven
Copy link
Contributor

Yes, I will get this reviewed today!

@jaclark5
Copy link
Contributor Author

I just updated the tests to complete the code coverage. However, something weird happened with the precommits where I need to update to python 3.10 so that isort can update to 5.12, so that poetry can function properly.

@CalCraven
Copy link
Contributor

Looks like they dropped python 3.7 recently. May have to update your environment to 3.8 so pre-commit can do it's thing.

@jaclark5
Copy link
Contributor Author

I was running 3.8.15. After a search for the error, it seems that I need to update to 3.10. There are many conflicts that Conda had been sorting through for awhile now.

@daico007
Copy link
Member

It sounds like we need to stat adding 3.10 test for mbuild (but that can be done in a different PR, I think we can ignore the pre-commit test here, assuming you already have that checks on your local). Regarding conda installation, we did realize it took a bit long to install the package with just conda and had actually started using mamba to do most of the installation in our lab, so it's probably make sense for use to update the instruction on mbuild readme.

@jaclark5
Copy link
Contributor Author

jaclark5 commented Jan 30, 2023

Is there a way of getting around the precommit locally? This feature isn't letting me commit my most recent changes to the PR to complete the code coverage.

@jaclark5
Copy link
Contributor Author

Actually I was able to get it to run by changing the version of isort in the precommit config to 5.12.0

mbuild/tests/test_compound.py Fixed Show resolved Hide resolved
Copy link
Contributor

@CalCraven CalCraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important note is the bug related to ring structures. Other minor wording changes in documentation.

mbuild/conversion.py Show resolved Hide resolved
mbuild/conversion.py Show resolved Hide resolved
mbuild/compound.py Outdated Show resolved Hide resolved
mbuild/conversion.py Outdated Show resolved Hide resolved
mbuild/conversion.py Outdated Show resolved Hide resolved
mbuild/tests/test_compound.py Fixed Show resolved Hide resolved
@CalCraven
Copy link
Contributor

I think I have a solution for the ring structures issues noted above. Here would be the changes to pull_residues()

def catalog_bondgraph_type(compound, bond_graph=None):
    """Identify type of subgraph found at this stage of the compound.
    
    Returns
    -------
    str:
        "particle_graph" if compound is at the particle level, 
        "one_graph" if compound is a single molecule piece,
        "multiple_graphs" if compound has multiple molecules
    """
    if not compound.children:
        return "particle_graph"
    elif compound.bond_graph:
        # check at the top level
        multiple_connectionsBool = len(compound.bond_graph.connected_components()) == 1
    else:
        # at a subgraph level
        multiple_connectionsBool = len(bond_graph.subgraph(compound).connected_components()) == 1
    if multiple_connectionsBool:
        return "one_graph" 
    else: 
        return "multiple_graphs"

def pull_residues(
    compound,
    segment_level=0,
    include_base_level=False,
    bond_graph=None
):
    """Pull residues from a Compound object.
    Search class instance for completed compounds based on the number of
    particles and bonds. If for example and peptide chain with three
    individual peptides that were connected and hydrated with thirty water
    molecules, the list will contain 31 residues. However, if the water and
    the individual peptides should be labeled as individual residues, set
    ``segment_level==1`` to receive a list with 33 residues. Depending on
    the method used to assemble the peptides, this procedure may continue to
    set ``segment_level=2`` and breakdown the peptide into functional groups.
    Setting ``include_base_level==True`` will allow this procedure to
    function with coarse-grained systems, while the default behavior will
    end a search at the level before atoms are obtained.

    Parameters
    ----------
    compound : obj
        An instance of :class:`mbuild.compound.Compound`
    segment_level : int, optional, default=0
        Level of full residue architecture to be identified
    include_base_level : bool, optional, default=False
        Set whether a search should continue if the list of children are single particles.
    bond_graph: obj
        An instance of :class:`mbuild.BondGraph`. 
        The top level bondgraph that contains the compound to get residues from.

    Returns
    -------
    residuesList : list
        List of residue ids (str).
    """
    residuesList = []
    no_grandchildrenList = [not child.children for child in compound.children]
    
    compound_graphtype = catalog_bondgraph_type(compound, bond_graph=bond_graph)
    
    if (
        compound_graphtype == "one_graph" and segment_level == 0
    ):  # All atoms are bonded, at chosen tier
        residuesList.append(id(compound))
    elif compound_graphtype == "particle_graph":  # Currently at the particle level
        if include_base_level:
            residuesList.append(id(compound))
        else:
            raise ValueError(
                "Did not expect to reach single segments with argument "
                "`include_base_level=False`. Consider setting this to True, "
                "or check to see if the compound hierarchy is as expected."
            )
    elif (
        all(no_grandchildrenList) and not include_base_level
    ):  # One above base level
        residuesList.append(id(compound))
    else:
        if compound_graphtype == "multiple_graphs":  # Go to the next tier below
            if segment_level > 0:
                segment_level -= 1
            elif segment_level < 0:
                raise ValueError(
                    "`segment_level` must be greater than zero. "
                    "Something went wrong when parsing the hierachy of "
                    "the compound."
                )

        # Check the next tier of compounds.
        for i, child in enumerate(compound.children):
            if no_grandchildrenList[i]:  # single atom particle
                residuesList.append(id(child))
            else:
                residuesList.extend(
                    pull_residues(
                        child,
                        segment_level=segment_level,
                        include_base_level=include_base_level,
                        bond_graph=bond_graph
                    )
                )

    return residuesList


########TESTS#############

compound = mb.load("C1=CC=CC=C1", smiles=True)
assert catalog_bondgraph_type(compound) == "one_graph"

compound = mb.Compound([mb.load("C1=CC=CC=C1", smiles=True), mb.load("C1=CC=CC=C1", smiles=True)])
assert catalog_bondgraph_type(compound) == "multiple_graphs"

compound = mb.Compound([mb.load("C1=CC=CC=C1", smiles=True), mb.load("C1=CC=CC=C1", smiles=True)])
assert catalog_bondgraph_type(compound.children[1], compound.bond_graph) == "one_graph"

compound = mb.Compound([mb.load("C1=CC=CC=C1", smiles=True), mb.load("C1=CC=CC=C1", smiles=True)])
assert catalog_bondgraph_type(compound.children[1][0], compound.bond_graph) == "particle_graph"

compound = mb.load("C1=CC=CC=C1", smiles=True)
assert len(pull_residues(compound, compound.bond_graph)) == 1

compound = mb.Compound([mb.load("C1=CC=CC=C1", smiles=True), mb.load("C1=CC=CC=C1", smiles=True)])
assert len(pull_residues(compound, bond_graph=compound.bond_graph)) == 2

compound = mb.Compound([mb.load("C1=CC=CC=C1", smiles=True), mb.load("C1=CC=CC=C1", smiles=True)])
compound.add_bond([compound.children[0][0], compound.children[1][0]])
assert len(pull_residues(compound, bond_graph=compound.bond_graph)) == 1

compound = mb.load("CCC", smiles=True)
assert len(pull_residues(compound, bond_graph=compound.bond_graph)) == 1

compound = mb.Compound([mb.load("CCC", smiles=True), mb.load("CCC", smiles=True)])
assert len(pull_residues(compound, bond_graph=compound.bond_graph)) == 2

compound = mb.Compound([mb.load("CCC", smiles=True), mb.load("CCC", smiles=True)])
compound.add_bond([compound.children[0][0], compound.children[1][0]])
assert len(pull_residues(compound, bond_graph=compound.bond_graph)) == 1

n_waters = 100
compound = mb.fill_box(mb.load("O", smiles=True), n_waters, mb.Box([10, 10, 10]))
assert len(pull_residues(compound, bond_graph=compound.bond_graph)) == n_waters

If this looks reasonable @jaclark5, I can take these changes and tests and push them directly to your branch

@jaclark5
Copy link
Contributor Author

Hey @CalCraven, that looks good. I did a quick check of your new function with the existing tests and it results in an error since H2O doesn't have a bond graph.

@jaclark5
Copy link
Contributor Author

I added the ability to edit for maintainers, but after you merge your addition I'd like to look at it again.

@jaclark5
Copy link
Contributor Author

jaclark5 commented Feb 1, 2023

It looks good to me! Are there any test for LJ systems? I think it might be good to have a CG system tested.

@CalCraven
Copy link
Contributor

I added a set of tests for loading in coarse grained structures. I think it works okay, it can start to get a little strange when you look to use coarse grained and atomistic structures, or you create an uneven hierarchy with molecules located at multiple different levels. Probably rare use cases, but good to have some defined behavior for them.

@jaclark5
Copy link
Contributor Author

jaclark5 commented Feb 2, 2023

Oh great, I thought of one more test I'd to run this afternoon.

@jaclark5
Copy link
Contributor Author

jaclark5 commented Feb 3, 2023

There's an issue with the catalog_bondgraph_type function. Single particle "atoms" i.e., electrolytes or single bead CG solvent won't show up in that list. For an atomistic system with water, the function will return "multigraph" and then identify those electrolytes. However for a single CG polymer in a solvent of single beads, the Compound reads as "one_graph" because none of the solvent molecules are on the bond_graph list.

@jaclark5
Copy link
Contributor Author

jaclark5 commented Feb 3, 2023

Ok, I just pushed a fix, I think this PR is complete :)

Copy link
Contributor

@CalCraven CalCraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change looks good to me. I also think this PR is ready to go @daico007

@daico007 daico007 merged commit c15fb1a into mosdef-hub:main Feb 3, 2023
@daico007
Copy link
Member

daico007 commented Feb 3, 2023

Great work! Thank you for your PR @jaclark5!

@jaclark5 jaclark5 deleted the lammps_molecule_number branch February 3, 2023 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot number compounds in write_lammpsdata
3 participants