fix: data generation threading locked #330

gcroci2 · 2023-01-24T11:21:33Z

One exception (AttributeError) wasn't catched and the entire process was blocked. Now we catch all exceptions.

DaniBodor

Nice catch!

gcroci2 · 2023-01-24T18:01:49Z

Actually this is still not solved. I think the issue has been introduced with PR #274. The problem is that trying to process ~100k queries (on Snellius), at a certain point the process get stuck, and no more data points are appended to the hdf5 files. No error message is displayed and this makes highly difficult to isolate the problem.

The error I was getting before modifying the code as in this PR was AttributeError: 'Chain' object has no attribute '_model' (never seen before the edits in PR #274, that's why I thought that the issue has been introduced there). Now I'm catching all the exceptions but this didn't fix the problem. My guess is that one of the threads used during the processing of the queries get stuck for some unknown reason, and the script never ends. It stops only when the time allocated on Snellius times out.

To reproduce: run 3D-Vac/src/3_build_db4/GNN/1_generate_features.py changing run_day variable and using the corrispondent .sh script that is in the same location (3D-Vac/src/3_build_db4/GNN/1_generate_features.sh). I am using 96 cpus per task, only 1 task and only one node, as I was doing in the past. Note that with a sample dataset (1k data points) everything works perfectly and in the past I run the same script for 140k datapoints successfully. Either the pdb models have been corrupted for some reason, or the code is not handling anymore some kind of exception that was handled in the past. I'd really need your help here @cbaakman.

Models IDs that gave detected problems: BA-115668, BA-132474, BA-503344, BA-65401. Exception raised type: 'KeyError'. This shouldn't be the problem though, since we catch the exception and the code should go on running.
The pdb models and the pssms used are on Snellius in /projects/0/einf2380/data/pMHCI/features_input_folder/exp_nmers_all_HLA_quantitative, in pdb/ and pssm/ folders respectively.

cbaakman · 2023-01-25T11:41:03Z

The error I was getting before modifying the code as in this PR was AttributeError: 'Chain' object has no attribute '_model'.

In the deeprank code, the 'Chain' class does have an attribute '_model'. So I really don't understand where this error comes from.

I cannot reproduce this error, since I don't have the data file 'BA_pMHCI_human_quantitative_only_eq.csv'. Possibly I need other files to reproduce the error too.
Where can I find these?

DaniBodor · 2023-01-25T13:56:28Z

The error I was getting before modifying the code as in this PR was AttributeError: 'Chain' object has no attribute '_model'.

In the deeprank code, the 'Chain' class does have an attribute '_model'. So I really don't understand where this error comes from.

I cannot reproduce this error, since I don't have the data file 'BA_pMHCI_human_quantitative_only_eq.csv'. Possibly I need other files to reproduce the error too. Where can I find these?

sent via slack

cbaakman · 2023-01-26T09:13:07Z

sent via slack

Thanks! But it seems I also need pdb files and other files. Right now it prints:

Script running has started ...
Loaded CSV file containing clusters and targets data. Total number of data points is 100315.
0 PDBs found.
Selected 0 PDBs using CSV IDs (intersection).
Aligning clusters and targets data with selected PDBs IDs ...
Clusters for 0 data points loaded.
Targets for 0 data points loaded.
0 MHC PSSMs found.
Selected 0 MHC PSSMs using CSV IDs.
0 peptides PSSMs found.
Selected 0 peptides PSSMs using CSV IDs.
Verifying data consistency...
Adding 0 queries to the query collection ...
Queries ready to be processed.

DaniBodor · 2023-01-26T12:19:11Z

sent via slack

Thanks! But it seems I also need pdb files and other files. Right now it prints:

Script running has started ...
Loaded CSV file containing clusters and targets data. Total number of data points is 100315.
0 PDBs found.
Selected 0 PDBs using CSV IDs (intersection).
Aligning clusters and targets data with selected PDBs IDs ...
Clusters for 0 data points loaded.
Targets for 0 data points loaded.
0 MHC PSSMs found.
Selected 0 MHC PSSMs using CSV IDs.
0 peptides PSSMs found.
Selected 0 peptides PSSMs using CSV IDs.
Verifying data consistency...
Adding 0 queries to the query collection ...
Queries ready to be processed.

OK, probably be best to wait for @gcroci2 then. She'll be back in office tomorrow.

gcroci2 · 2023-01-27T08:13:23Z

sent via slack

Thanks! But it seems I also need pdb files and other files. Right now it prints:

Script running has started ...
Loaded CSV file containing clusters and targets data. Total number of data points is 100315.
0 PDBs found.
Selected 0 PDBs using CSV IDs (intersection).
Aligning clusters and targets data with selected PDBs IDs ...
Clusters for 0 data points loaded.
Targets for 0 data points loaded.
0 MHC PSSMs found.
Selected 0 MHC PSSMs using CSV IDs.
0 peptides PSSMs found.
Selected 0 peptides PSSMs using CSV IDs.
Verifying data consistency...
Adding 0 queries to the query collection ...
Queries ready to be processed.

You need to run the script on Snellius, then all the files are already in there (thay refer to Snellius /projects/0/einf2380/ folder).

gcroci2 · 2023-02-02T14:43:10Z

The issue was one single data point (BA-248433) that was causing get_surface function from Bio.PDB.ResidueDepth (in deeprankcore.features.exposure, surface = get_surface(bio_model)) to run for indefinite time, likely because of an inner infinite loop.

I've handled it by using signal module, and raising a TimeoutError after 20 seconds if get_surface() doesn't come to an end. It isn't of course the cleanest solution, but I am already very happy to have understood the origin of this hellish bug that was causing the entire data generation to run forever.

fix catching exceptions in process method

19b90bc

gcroci2 requested a review from DaniBodor January 24, 2023 11:22

DaniBodor approved these changes Jan 24, 2023

View reviewed changes

gcroci2 added the priority Solve this first label Jan 24, 2023

gcroci2 requested a review from cbaakman January 24, 2023 18:02

cbaakman approved these changes Jan 25, 2023

View reviewed changes

gcroci2 changed the title ~~fix: catching exceptions in process method~~ fix: data generation threading locked Feb 1, 2023

add exception handling with timeout in exposure.py module

5645f96

fix prospector errors

45ea139

gcroci2 mentioned this pull request Feb 2, 2023

Investigate PDBs causing infinite loop in QueryCollection.process #337

Closed

fix prospector error

52422fd

gcroci2 requested a review from cbaakman February 2, 2023 15:22

cbaakman approved these changes Feb 2, 2023

View reviewed changes

gcroci2 merged commit 67a797b into main Feb 2, 2023

gcroci2 deleted the fix_query_exception branch February 2, 2023 16:18

gcroci2 removed the priority Solve this first label Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: data generation threading locked #330

fix: data generation threading locked #330

gcroci2 commented Jan 24, 2023

DaniBodor left a comment

gcroci2 commented Jan 24, 2023 •

edited

Loading

cbaakman commented Jan 25, 2023

DaniBodor commented Jan 25, 2023

cbaakman commented Jan 26, 2023

DaniBodor commented Jan 26, 2023

gcroci2 commented Jan 27, 2023 •

edited

Loading

gcroci2 commented Feb 2, 2023

fix: data generation threading locked #330

fix: data generation threading locked #330

Conversation

gcroci2 commented Jan 24, 2023

DaniBodor left a comment

Choose a reason for hiding this comment

gcroci2 commented Jan 24, 2023 • edited Loading

cbaakman commented Jan 25, 2023

DaniBodor commented Jan 25, 2023

cbaakman commented Jan 26, 2023

DaniBodor commented Jan 26, 2023

gcroci2 commented Jan 27, 2023 • edited Loading

gcroci2 commented Feb 2, 2023

gcroci2 commented Jan 24, 2023 •

edited

Loading

gcroci2 commented Jan 27, 2023 •

edited

Loading