Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory overflow with large dataset preprocessing #65

Open
danielkaifeng opened this issue Oct 25, 2023 · 4 comments
Open

memory overflow with large dataset preprocessing #65

danielkaifeng opened this issue Oct 25, 2023 · 4 comments
Assignees
Labels
question Request for help or information

Comments

@danielkaifeng
Copy link

Dear author, I am trying to train the model with over 10 millions datapoints and even though I set --num-processes as 3 by molecule_generation preprocess data/merged_lib results/merged_lib_full traces/merged_lib_full --pretrained-model-path xxx_best.pkl --num-processes 3, the memory keeps growing and overflow.

Any approach to reduce memory for extremely large dataset?
Thanks!

@danielkaifeng
Copy link
Author

I guess the reason of memory overflow largely due to preprocessing using pretrained-model.
To overcome this, it's reasonable if I preprocess data without pretrained-model, but train model with pretrained-model checkpoint for training initialization?

@kmaziarz
Copy link
Collaborator

It would be surprising if plugging in the pretrained model checkpoint was to blame here (but maybe that is the case, I'm not sure). If you want to use the checkpoint for training initialization, then the atom metadata (e.g. atom type / motif vocabulary) has to be kept in sync, this is why the checkpoint has to be provided during preprocessing.

Two thoughts:

  • Do you actually need to start with a pretrained model if you want to train on 10M samples (and e.g. our pretrained model was trained on Guacamol with ~1M samples)? This many samples should be more than enough to just train from scratch. Training from initialization was more intended for e.g. if someone wants to fine-tune on hundreds/thousands of molecules of particular interest.
  • At which point during preprocessing are you getting the error? There should be an initial shorter phase which produces *.jsonl.gz files and then a longer phase that further processes them. Are you able to get through the first phase (i.e. get those files to be saved)? If so, it could be a good idea to then kill the processing and restart from the same directory, then it would notice the files exist and go right to the second phase. Separating the phases like this might help prevent e.g. some resources not being freed between one and the other, which could reduce peak memory usage.

@kmaziarz kmaziarz self-assigned this Oct 25, 2023
@kmaziarz kmaziarz added the question Request for help or information label Oct 25, 2023
@danielkaifeng
Copy link
Author

It is on the first stage of initializing feature extractors, in generating FeaturisedData before xxx.jsonl.gz generated.
I think there are some approaches to solve the question:

  1. Skip pretrain model in preprocessing as you mentioned, this helps reducing some memory.
  2. The overall memory still keeps growing in feature extraction, I guess it might be caused by storing large list and smiles_datapoints in memory. I will try to seperate multiple FeaturisedData to batched *.jsonl.gz files and make some modification in training dataloader to train.
  3. Write FeaturisedData datapoints to xxx.h5 file during generation by h5py.

@kmaziarz
Copy link
Collaborator

kmaziarz commented Nov 9, 2023

The overall memory still keeps growing in feature extraction, I guess it might be caused by storing large list and smiles_datapoints in memory.

While the SMILES are indeed all read into memory, the processing then proceeds in an online fashion based on iterables. I think in principle the processed samples do not have to all fit in memory, while 10M samples in SMILES form should not take that much.

At the point when the code prints out the sizes of the folds and says "beginning featurization", is the memory usage already high then? This should be a point when all the SMILES are already in memory. If memory is not high then but continues to grow later, maybe this is because the parallel processes are faster in processing the samples than the main process is in consuming them, leading to more and more samples being "queued up".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Request for help or information
Projects
None yet
Development

No branches or pull requests

2 participants