Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sasascore module #862

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open

sasascore module #862

wants to merge 35 commits into from

Conversation

mgiulini
Copy link
Contributor

@mgiulini mgiulini commented Apr 16, 2024

You are about to submit a new Pull Request. Before continuing make sure you read the contributing guidelines and that you comply with the following criteria:

  • You have sticked to Python. Please talk to us before adding other programming languages to HADDOCK3
  • Your PR is about CNS
  • Your code is well documented: proper docstrings and explanatory comments for those tricky parts
  • You structured the code into small functions as much as possible. You can use classes if there is a (state) purpose
  • Your code follows our coding style
  • You wrote tests for the new code
  • tox tests pass. Run tox command inside the repository folder
  • -test.cfg examples execute without errors. Inside examples/ run python run_tests.py -b
  • PR does not add any dependencies, unless permission granted by the HADDOCK team
  • PR does not break licensing
  • Your PR is about writing documentation for already existing code 🔥
  • Your PR is about writing tests for already existing code :godmode:

Closes #861 by creating a sasascore module, which allows to score PDB files against existing accessibility information.

As an example, if some glycosylation sites on chain A (say residues 40 and 50) are known to be preserved upon complex formation, a penalty can be added if they are buried in the resulting model. At the same time, if some residues are known to be buried in the complex, we can impose a penalty if they are accessible.

An example application of the module:

[sasascore]
resdic_accessible_A = [40,50]
residic_buried_B = [22,23,24]

This will create a sasascore.tsv file, analogous to the other scoring tsv files.

structure       original_name   md5     score
cluster_1_model_1.pdb   emref_1.pdb     None    2
cluster_1_model_2.pdb   emref_2.pdb     None    2
cluster_2_model_1.pdb   emref_4.pdb     None    5
cluster_3_model_1.pdb   emref_3.pdb     None    5
cluster_2_model_2.pdb   emref_5.pdb     None    5

Here the score is the number of times the input information has not be satisfied (the lower the better). A file named violations.tsv is also produced, with a detailed picture of the violations:

structure             bur_A acc_B
cluster_1_model_1.pdb 40 22
cluster_1_model_2.pdb 50 23 
cluster_2_model_1.pdb 40,50 22,23,24
cluster_3_model_1.pdb 40,50 22,23,24  
cluster_2_model_2.pdb 40,50 22,23,24

@mgiulini mgiulini self-assigned this Apr 16, 2024
@mgiulini mgiulini added enhancement Enhancing an existing feature of adding a new one community contributions from people outside the haddock team m|accscoring related to accessibility (a posteriori) scoring labels Apr 16, 2024
Copy link
Member

@amjjbonvin amjjbonvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to rename this module to sasascore

@mgiulini mgiulini changed the title Accscoring module sasascore module Apr 25, 2024
@mgiulini mgiulini requested a review from VGPReys August 6, 2024 12:13
VGPReys
VGPReys previously approved these changes Aug 6, 2024
Copy link
Contributor

@VGPReys VGPReys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code base: OK

still have to test it

Co-authored-by: Victor Reys <132575181+VGPReys@users.noreply.github.com>
mgiulini and others added 6 commits August 7, 2024 09:40
Co-authored-by: Victor Reys <132575181+VGPReys@users.noreply.github.com>
Co-authored-by: Victor Reys <132575181+VGPReys@users.noreply.github.com>
Co-authored-by: Victor Reys <132575181+VGPReys@users.noreply.github.com>
Co-authored-by: Victor Reys <132575181+VGPReys@users.noreply.github.com>
Co-authored-by: Victor Reys <132575181+VGPReys@users.noreply.github.com>
Co-authored-by: Victor Reys <132575181+VGPReys@users.noreply.github.com>
@mgiulini mgiulini marked this pull request as ready for review August 7, 2024 07:42
@@ -214,10 +223,12 @@ def add_calc_accessibility_arguments(calc_accessibility_subcommand):
'PNS': 78.11,
}
}
DEFAULT_PROBE_RADIUS = 1.4


def get_accessibility(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best to move this inside the sasascore module and call it here instead of the other way around since these CLIs are marked for deletion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move it to the restraints library, imo a better place than the module file

Comment on lines +63 to +69
# Combine files
with open(output_fname, 'w') as out_file:
for core in range(ncores):
tmp_file = Path(path, f"{key}_" + str(core) + ".tsv")
with open(tmp_file) as infile:
out_file.write(infile.read())
tmp_file.unlink()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern here looks very similar to what you had before in the caprieval module and that we changed to produce less files. See the implementation I did there: https://github.com/haddocking/haddock3/blob/main/src/haddock/modules/analysis/caprieval/capri.py#L905

You can extract the values directly from the scheduler object, no need to have these intermediate files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that holds only for the scheduler, doesn't it? and here the intermediate files are much less..OK though, will add this other call

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is producing several files, and they are being merge here, right? This thing that is producing the several files is the one that should go trough the scheduler and then here instead of merging the file you just retrieve the values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am checking this over and over and I think this induces code duplication everywhere. now caprieval has both the rearrange_output function (when less_io = false) and this extract_data_from_capri_class (when less_io = true)..both doing the same thing in a completely different manner.
Everything would be much easier if fewer output files were generated by each module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does yes, but it's already part of our technological debt because of how people decided this workflow execution would be done - unfortunately nothing we can do at this point except work around it 🤷🏽

but this pattern is in many places already, just add a comment in the code and add it to this list here: #970

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is something that can be done, which is simply having much fewer files produced by each analysis module. The code duplication this less_io pattern introduces is not worth the advantage imo. Probably something to be discussed in a code meeting. I won't add anything to that PR for the time being.

# initialize jobs
sasascore_jobs: list[AccScoreJob] = []

for core in range(ncores):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here, why are you doing this per core and why not per model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to launch hundreds/thousands of extremely short jobs, but rather concatenate them so that each core is dealing with a decent number of jobs

Copy link
Member

@rvhonorato rvhonorato Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already handled by the scheduler:

job_list = split_tasks(sorted_task_list, self.num_processes)

so it's not responsability of a scoring module to sort tasks - that's what the library is for heh so here you can just loop over the models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvhonorato I have been thinking about this. I don't think this way of handling the tasks is a good idea (related also to the caprieval code linked above). If the running mode is not local and less_io is not active (the if clause in the caprieval output), this still amounts at creating a huge amount of files (for example 10k in the higher sampling scenarios), which can be avoided by simply parallelising internally on the number of cores (much lower in general). Additionally, for super-short tasks like this one, this probably makes things faster. what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found out that pattern is found in many other places in the code, see here: #970

So please add this one to the list there and I can handle them in another PR

assert a_viols == {"A": set(), "B": set()}


def test_prettiy_df():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community contributions from people outside the haddock team enhancement Enhancing an existing feature of adding a new one m|accscoring related to accessibility (a posteriori) scoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

accessibility based scoring
4 participants