Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: data generation threading locked #330

Merged
merged 4 commits into from
Feb 2, 2023
Merged

fix: data generation threading locked #330

merged 4 commits into from
Feb 2, 2023

Conversation

gcroci2
Copy link
Collaborator

@gcroci2 gcroci2 commented Jan 24, 2023

One exception (AttributeError) wasn't catched and the entire process was blocked. Now we catch all exceptions.

@gcroci2 gcroci2 requested a review from DaniBodor January 24, 2023 11:22
Copy link
Collaborator

@DaniBodor DaniBodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@gcroci2
Copy link
Collaborator Author

gcroci2 commented Jan 24, 2023

Actually this is still not solved. I think the issue has been introduced with PR #274. The problem is that trying to process ~100k queries (on Snellius), at a certain point the process get stuck, and no more data points are appended to the hdf5 files. No error message is displayed and this makes highly difficult to isolate the problem.

The error I was getting before modifying the code as in this PR was AttributeError: 'Chain' object has no attribute '_model' (never seen before the edits in PR #274, that's why I thought that the issue has been introduced there). Now I'm catching all the exceptions but this didn't fix the problem. My guess is that one of the threads used during the processing of the queries get stuck for some unknown reason, and the script never ends. It stops only when the time allocated on Snellius times out.

To reproduce: run 3D-Vac/src/3_build_db4/GNN/1_generate_features.py changing run_day variable and using the corrispondent .sh script that is in the same location (3D-Vac/src/3_build_db4/GNN/1_generate_features.sh). I am using 96 cpus per task, only 1 task and only one node, as I was doing in the past. Note that with a sample dataset (1k data points) everything works perfectly and in the past I run the same script for 140k datapoints successfully. Either the pdb models have been corrupted for some reason, or the code is not handling anymore some kind of exception that was handled in the past. I'd really need your help here @cbaakman.

Models IDs that gave detected problems: BA-115668, BA-132474, BA-503344, BA-65401. Exception raised type: 'KeyError'. This shouldn't be the problem though, since we catch the exception and the code should go on running.
The pdb models and the pssms used are on Snellius in /projects/0/einf2380/data/pMHCI/features_input_folder/exp_nmers_all_HLA_quantitative, in pdb/ and pssm/ folders respectively.

@gcroci2 gcroci2 added the priority Solve this first label Jan 24, 2023
@gcroci2 gcroci2 requested a review from cbaakman January 24, 2023 18:02
@cbaakman
Copy link
Collaborator

The error I was getting before modifying the code as in this PR was AttributeError: 'Chain' object has no attribute '_model'.

In the deeprank code, the 'Chain' class does have an attribute '_model'. So I really don't understand where this error comes from.

I cannot reproduce this error, since I don't have the data file 'BA_pMHCI_human_quantitative_only_eq.csv'. Possibly I need other files to reproduce the error too.
Where can I find these?

@DaniBodor
Copy link
Collaborator

The error I was getting before modifying the code as in this PR was AttributeError: 'Chain' object has no attribute '_model'.

In the deeprank code, the 'Chain' class does have an attribute '_model'. So I really don't understand where this error comes from.

I cannot reproduce this error, since I don't have the data file 'BA_pMHCI_human_quantitative_only_eq.csv'. Possibly I need other files to reproduce the error too. Where can I find these?

sent via slack

@cbaakman
Copy link
Collaborator

sent via slack

Thanks! But it seems I also need pdb files and other files. Right now it prints:

Script running has started ...
Loaded CSV file containing clusters and targets data. Total number of data points is 100315.
0 PDBs found.
Selected 0 PDBs using CSV IDs (intersection).
Aligning clusters and targets data with selected PDBs IDs ...
Clusters for 0 data points loaded.
Targets for 0 data points loaded.
0 MHC PSSMs found.
Selected 0 MHC PSSMs using CSV IDs.
0 peptides PSSMs found.
Selected 0 peptides PSSMs using CSV IDs.
Verifying data consistency...
Adding 0 queries to the query collection ...
Queries ready to be processed.

@DaniBodor
Copy link
Collaborator

sent via slack

Thanks! But it seems I also need pdb files and other files. Right now it prints:

Script running has started ...
Loaded CSV file containing clusters and targets data. Total number of data points is 100315.
0 PDBs found.
Selected 0 PDBs using CSV IDs (intersection).
Aligning clusters and targets data with selected PDBs IDs ...
Clusters for 0 data points loaded.
Targets for 0 data points loaded.
0 MHC PSSMs found.
Selected 0 MHC PSSMs using CSV IDs.
0 peptides PSSMs found.
Selected 0 peptides PSSMs using CSV IDs.
Verifying data consistency...
Adding 0 queries to the query collection ...
Queries ready to be processed.

OK, probably be best to wait for @gcroci2 then. She'll be back in office tomorrow.

@gcroci2
Copy link
Collaborator Author

gcroci2 commented Jan 27, 2023

sent via slack

Thanks! But it seems I also need pdb files and other files. Right now it prints:

Script running has started ...
Loaded CSV file containing clusters and targets data. Total number of data points is 100315.
0 PDBs found.
Selected 0 PDBs using CSV IDs (intersection).
Aligning clusters and targets data with selected PDBs IDs ...
Clusters for 0 data points loaded.
Targets for 0 data points loaded.
0 MHC PSSMs found.
Selected 0 MHC PSSMs using CSV IDs.
0 peptides PSSMs found.
Selected 0 peptides PSSMs using CSV IDs.
Verifying data consistency...
Adding 0 queries to the query collection ...
Queries ready to be processed.

You need to run the script on Snellius, then all the files are already in there (thay refer to Snellius /projects/0/einf2380/ folder).

@gcroci2 gcroci2 changed the title fix: catching exceptions in process method fix: data generation threading locked Feb 1, 2023
@gcroci2
Copy link
Collaborator Author

gcroci2 commented Feb 2, 2023

The issue was one single data point (BA-248433) that was causing get_surface function from Bio.PDB.ResidueDepth (in deeprankcore.features.exposure, surface = get_surface(bio_model)) to run for indefinite time, likely because of an inner infinite loop.

I've handled it by using signal module, and raising a TimeoutError after 20 seconds if get_surface() doesn't come to an end. It isn't of course the cleanest solution, but I am already very happy to have understood the origin of this hellish bug that was causing the entire data generation to run forever.

@gcroci2 gcroci2 requested a review from cbaakman February 2, 2023 15:22
@gcroci2 gcroci2 merged commit 67a797b into main Feb 2, 2023
@gcroci2 gcroci2 deleted the fix_query_exception branch February 2, 2023 16:18
@gcroci2 gcroci2 removed the priority Solve this first label Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants