Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR in run_hp_search_optuna when trying to use multi-GPU #27487

Closed
2 of 4 tasks
sstoia opened this issue Nov 14, 2023 · 13 comments · Fixed by #34073
Closed
2 of 4 tasks

ERROR in run_hp_search_optuna when trying to use multi-GPU #27487

sstoia opened this issue Nov 14, 2023 · 13 comments · Fixed by #34073
Assignees
Labels

Comments

@sstoia
Copy link

sstoia commented Nov 14, 2023

System Info

  • transformers version: 4.28.1
  • Platform: Linux-3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.9.16
  • Huggingface_hub version: 0.14.1
  • Safetensors version: not installed
  • PyTorch version (GPU?): 1.13.1 (False)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The problem appears when using run_hp_search_optuna method from transformers/integrations.py . This method is called when trying to perform an hyperparameter search with the Trainer.hyperparameter_search method:

best_trial = trainer.hyperparameter_search(
     direction='maximize',
     backend='optuna',
     hp_space=optuna_hp_space,
     n_trials=10,
)

The error obtained is the next one:

Traceback (most recent call last): File "/mnt/beegfs/sstoia/proyectos/LLM_finetuning_stratified_multiclass_optuna.py", line 266, in <module> best_trial = trainer.hyperparameter_search( File "/mnt/beegfs/sstoia/.conda/envs/env/lib/python3.9/site-packages/transformers/trainer.py", line 2592, in hyperparameter_search best_run = backend_dict[backend](self, n_trials, direction, **kwargs) File "/mnt/beegfs/sstoia/.conda/envs/env/lib/python3.9/site-packages/transformers/integrations.py", line 218, in run_hp_search_optuna args = pickle.loads(bytes(args_main_rank)) _pickle.UnpicklingError: pickle data was truncated

Expected behavior

It should work, as the same function without multi-GPU works fine. I guess the problem comes from a parallelization error, as both GPUs may write on the same file.

@amyeroberts
Copy link
Collaborator

cc @muellerzr @pacman100

@muellerzr muellerzr self-assigned this Nov 17, 2023
@muellerzr
Copy link
Contributor

Sorry for the delay, will be looking into it over this week!

@github-actions github-actions bot closed this as completed Jan 1, 2024
@huggingface huggingface deleted a comment from github-actions bot Jan 2, 2024
@amyeroberts amyeroberts reopened this Jan 2, 2024
@huggingface huggingface deleted a comment from github-actions bot Jan 28, 2024
@linhdvu14
Copy link

I'm running into the same issue. Any updates on this please?

@huggingface huggingface deleted a comment from github-actions bot Feb 27, 2024
@amyeroberts
Copy link
Collaborator

Gentle ping @muellerzr @pacman100

@huggingface huggingface deleted a comment from github-actions bot Mar 23, 2024
@huggingface huggingface deleted a comment from github-actions bot Apr 18, 2024
@amyeroberts
Copy link
Collaborator

Another ping @muellerzr @pacman100

@NishchalPrasad
Copy link

Running into the same issue. Using the latest version of transformers (4.40.1) and python: 3.11

@tomaarsen
Copy link
Member

Having the same issue with my Trainer subclass when doing HPO with DDP and optuna.

@amyeroberts
Copy link
Collaborator

Gentle ping @muellerzr

@svduplessis
Copy link

Same issue here. Also trying to run hyperparameter search with DDP (accelerate launch) using Trainer and Optuna as the backend.

The following error is returned:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 1054, in <module>
[rank2]:     main()
[rank2]:   File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 956, in main
[rank2]:     best_trial = trainer.hyperparameter_search(
[rank2]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/trainer.py", line 3206, in hyperparameter_search
[rank2]:     best_run = backend_obj.run(self, n_trials, direction, **kwargs)
[rank2]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/hyperparameter_search.py", line 72, in run
[rank2]:     return run_hp_search_optuna(trainer, n_trials, direction, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/integrations/integration_utils.py", line 237, in run_hp_search_optuna
[rank2]:     args = pickle.loads(bytes(args_main_rank))
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: _pickle.UnpicklingError: pickle data was truncated
[rank1]: Traceback (most recent call last):
[rank1]:   File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 1054, in <module>
[rank1]:     main()
[rank1]:   File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 956, in main
[rank1]:     best_trial = trainer.hyperparameter_search(
[rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/trainer.py", line 3206, in hyperparameter_search
[rank1]:     best_run = backend_obj.run(self, n_trials, direction, **kwargs)
[rank1]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/hyperparameter_search.py", line 72, in run
[rank1]:     return run_hp_search_optuna(trainer, n_trials, direction, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/integrations/integration_utils.py", line 237, in run_hp_search_optuna
[rank1]:     args = pickle.loads(bytes(args_main_rank))
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: _pickle.UnpicklingError: pickle data was truncated

@huggingface huggingface deleted a comment from github-actions bot Jul 29, 2024
@amyeroberts
Copy link
Collaborator

Another ping @muellerzr @SunMarc

@SunMarc
Copy link
Member

SunMarc commented Oct 10, 2024

Hey @sstoia and @tomaarsen and everyone who has this issue ! I was able to reproduce the error and fix with the PR above! Let me know if this works on your side !

@aakash0017
Copy link

Is the library updated? I'm still having the same issue

@SunMarc
Copy link
Member

SunMarc commented Oct 24, 2024

Please install the latest version of transformers that was release today and let us know if this is fixed !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants