-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERROR in run_hp_search_optuna when trying to use multi-GPU #27487
Comments
Sorry for the delay, will be looking into it over this week! |
I'm running into the same issue. Any updates on this please? |
Gentle ping @muellerzr @pacman100 |
Another ping @muellerzr @pacman100 |
Running into the same issue. Using the latest version of transformers (4.40.1) and python: 3.11 |
Having the same issue with my Trainer subclass when doing HPO with DDP and optuna. |
Gentle ping @muellerzr |
Same issue here. Also trying to run hyperparameter search with DDP (accelerate launch) using Trainer and Optuna as the backend. The following error is returned:
|
Another ping @muellerzr @SunMarc |
Hey @sstoia and @tomaarsen and everyone who has this issue ! I was able to reproduce the error and fix with the PR above! Let me know if this works on your side ! |
Is the library updated? I'm still having the same issue |
Please install the latest version of transformers that was release today and let us know if this is fixed ! |
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The problem appears when using
run_hp_search_optuna
method from transformers/integrations.py . This method is called when trying to perform an hyperparameter search with theTrainer.hyperparameter_search
method:The error obtained is the next one:
Traceback (most recent call last): File "/mnt/beegfs/sstoia/proyectos/LLM_finetuning_stratified_multiclass_optuna.py", line 266, in <module> best_trial = trainer.hyperparameter_search( File "/mnt/beegfs/sstoia/.conda/envs/env/lib/python3.9/site-packages/transformers/trainer.py", line 2592, in hyperparameter_search best_run = backend_dict[backend](self, n_trials, direction, **kwargs) File "/mnt/beegfs/sstoia/.conda/envs/env/lib/python3.9/site-packages/transformers/integrations.py", line 218, in run_hp_search_optuna args = pickle.loads(bytes(args_main_rank)) _pickle.UnpicklingError: pickle data was truncated
Expected behavior
It should work, as the same function without multi-GPU works fine. I guess the problem comes from a parallelization error, as both GPUs may write on the same file.
The text was updated successfully, but these errors were encountered: