Speed up data loading process #376

dingquanyu · 2023-12-05T14:48:25Z

Now MSA files are parsed in parallel instead of in serial way

christinaflo · 2023-12-08T19:57:48Z

openfold/data/tools/parse_msa_files.py

+    parser.add_argument('--alignment_dir', type=str, help='path to alignment dir')
+    args = parser.parse_args()
+    alignment_dir = args.alignment_dir
+    stockholm_files = [i for i in os.listdir(alignment_dir) if (i.endswith('.sto') and ("hmm_output" not in i))]


here can you add an exclusion "uniprot_hits" as well? I changed this recently, it is only used for msa pairing.

christinaflo · 2023-12-08T20:00:15Z

openfold/data/data_pipeline.py

-                    continue
-
-                msa_data[f] = msa
+            # Now will split the following steps into multiple processes 


If we already generated the pkl file, then we should check that it exists before re-parsing the msas. Or does it get removed somewhere?

Oh also, is there reason we couldn't just call a function to do this instead of running the script with subprocess?

Dingquan Yu added 11 commits November 30, 2023 11:51

added timing steps

aec1276

now used asynchronised version in parse_msa_data

6f3e0c0

now using multiprocessing style

2e1941a

now run in a subprocess

c3c627e

fixed errors when running in subprocess

2204bbb

now use ThreadPoolExecutor

4e58a6a

update config.py for the development for now

53c03a6

remove debugging statement

e72e4e6

moved pase_msa_file into tools subfolder

28b9e2b

reverse back to multimer branch version

08bfb1f

remove unnecessary imports and statements

78ecfc6

dingquanyu changed the title ~~Speedup data loading process~~ Speed up data loading process Dec 5, 2023

Merge branch 'multimer' into speedup-dataloader

6f26b0a

christinaflo reviewed Dec 8, 2023

View reviewed changes

christinaflo merged commit f861ff3 into aqlaboratory:multimer Dec 11, 2023
1 check passed

dingquanyu deleted the speedup-dataloader branch January 19, 2024 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up data loading process #376

Speed up data loading process #376

dingquanyu commented Dec 5, 2023

christinaflo Dec 8, 2023

christinaflo Dec 8, 2023

christinaflo Dec 8, 2023

Speed up data loading process #376

Speed up data loading process #376

Conversation

dingquanyu commented Dec 5, 2023

christinaflo Dec 8, 2023

Choose a reason for hiding this comment

christinaflo Dec 8, 2023

Choose a reason for hiding this comment

christinaflo Dec 8, 2023

Choose a reason for hiding this comment