ERROR in run_hp_search_optuna when trying to use multi-GPU #27487

sstoia · 2023-11-14T12:58:53Z

System Info

transformers version: 4.28.1
Platform: Linux-3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.9.16
Huggingface_hub version: 0.14.1
Safetensors version: not installed
PyTorch version (GPU?): 1.13.1 (False)

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The problem appears when using run_hp_search_optuna method from transformers/integrations.py . This method is called when trying to perform an hyperparameter search with the Trainer.hyperparameter_search method:

best_trial = trainer.hyperparameter_search(
     direction='maximize',
     backend='optuna',
     hp_space=optuna_hp_space,
     n_trials=10,
)

The error obtained is the next one:

Traceback (most recent call last): File "/mnt/beegfs/sstoia/proyectos/LLM_finetuning_stratified_multiclass_optuna.py", line 266, in <module> best_trial = trainer.hyperparameter_search( File "/mnt/beegfs/sstoia/.conda/envs/env/lib/python3.9/site-packages/transformers/trainer.py", line 2592, in hyperparameter_search best_run = backend_dict[backend](self, n_trials, direction, **kwargs) File "/mnt/beegfs/sstoia/.conda/envs/env/lib/python3.9/site-packages/transformers/integrations.py", line 218, in run_hp_search_optuna args = pickle.loads(bytes(args_main_rank)) _pickle.UnpicklingError: pickle data was truncated

Expected behavior

It should work, as the same function without multi-GPU works fine. I guess the problem comes from a parallelization error, as both GPUs may write on the same file.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2023-11-14T13:04:37Z

cc @muellerzr @pacman100

muellerzr · 2023-11-28T13:55:40Z

Sorry for the delay, will be looking into it over this week!

linhdvu14 · 2024-02-02T09:13:45Z

I'm running into the same issue. Any updates on this please?

amyeroberts · 2024-02-27T09:40:40Z

Gentle ping @muellerzr @pacman100

amyeroberts · 2024-04-18T08:21:52Z

Another ping @muellerzr @pacman100

NishchalPrasad · 2024-04-24T21:09:35Z

Running into the same issue. Using the latest version of transformers (4.40.1) and python: 3.11

tomaarsen · 2024-05-17T07:49:44Z

Having the same issue with my Trainer subclass when doing HPO with DDP and optuna.

amyeroberts · 2024-06-10T09:45:19Z

Gentle ping @muellerzr

svduplessis · 2024-07-04T20:14:09Z

Same issue here. Also trying to run hyperparameter search with DDP (accelerate launch) using Trainer and Optuna as the backend.

The following error is returned:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 1054, in <module>
[rank2]:     main()
[rank2]:   File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 956, in main
[rank2]:     best_trial = trainer.hyperparameter_search(
[rank2]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/trainer.py", line 3206, in hyperparameter_search
[rank2]:     best_run = backend_obj.run(self, n_trials, direction, **kwargs)
[rank2]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/hyperparameter_search.py", line 72, in run
[rank2]:     return run_hp_search_optuna(trainer, n_trials, direction, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/integrations/integration_utils.py", line 237, in run_hp_search_optuna
[rank2]:     args = pickle.loads(bytes(args_main_rank))
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: _pickle.UnpicklingError: pickle data was truncated
[rank1]: Traceback (most recent call last):
[rank1]:   File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 1054, in <module>
[rank1]:     main()
[rank1]:   File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 956, in main
[rank1]:     best_trial = trainer.hyperparameter_search(
[rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/trainer.py", line 3206, in hyperparameter_search
[rank1]:     best_run = backend_obj.run(self, n_trials, direction, **kwargs)
[rank1]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/hyperparameter_search.py", line 72, in run
[rank1]:     return run_hp_search_optuna(trainer, n_trials, direction, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/integrations/integration_utils.py", line 237, in run_hp_search_optuna
[rank1]:     args = pickle.loads(bytes(args_main_rank))
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: _pickle.UnpicklingError: pickle data was truncated

amyeroberts · 2024-07-29T16:03:57Z

Another ping @muellerzr @SunMarc

SunMarc · 2024-10-10T16:30:10Z

Hey @sstoia and @tomaarsen and everyone who has this issue ! I was able to reproduce the error and fix with the PR above! Let me know if this works on your side !

aakash0017 · 2024-10-23T20:33:01Z

Is the library updated? I'm still having the same issue

SunMarc · 2024-10-24T14:48:15Z

Please install the latest version of transformers that was release today and let us know if this is fixed !

muellerzr self-assigned this Nov 17, 2023

github-actions bot closed this as completed Jan 1, 2024

huggingface deleted a comment from github-actions bot Jan 2, 2024

amyeroberts reopened this Jan 2, 2024

huggingface deleted a comment from github-actions bot Jan 28, 2024

huggingface deleted a comment from github-actions bot Feb 27, 2024

huggingface deleted a comment from github-actions bot Mar 23, 2024

amyeroberts added the trainer label Mar 24, 2024

huggingface deleted a comment from github-actions bot Apr 18, 2024

tomaarsen mentioned this issue May 17, 2024

[v3] Add hyperparameter optimization support by letting loss be a Callable that accepts a model UKPLab/sentence-transformers#2655

Merged

huggingface deleted a comment from github-actions bot Jun 10, 2024

huggingface deleted a comment from github-actions bot Jul 29, 2024

huggingface deleted a comment from github-actions bot Aug 23, 2024

ArthurZucker mentioned this issue Sep 6, 2024

Accelerate x Trainer issue tracker: #33345

Open

43 tasks

huggingface deleted a comment from github-actions bot Sep 17, 2024

SunMarc mentioned this issue Oct 10, 2024

Fix optuna ddp hp search #34073

Merged

SunMarc closed this as completed in #34073 Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR in run_hp_search_optuna when trying to use multi-GPU #27487

ERROR in run_hp_search_optuna when trying to use multi-GPU #27487

sstoia commented Nov 14, 2023

amyeroberts commented Nov 14, 2023

muellerzr commented Nov 28, 2023

linhdvu14 commented Feb 2, 2024

amyeroberts commented Feb 27, 2024

amyeroberts commented Apr 18, 2024

NishchalPrasad commented Apr 24, 2024

tomaarsen commented May 17, 2024

amyeroberts commented Jun 10, 2024

svduplessis commented Jul 4, 2024

amyeroberts commented Jul 29, 2024

SunMarc commented Oct 10, 2024

aakash0017 commented Oct 23, 2024

SunMarc commented Oct 24, 2024

ERROR in run_hp_search_optuna when trying to use multi-GPU #27487

ERROR in run_hp_search_optuna when trying to use multi-GPU #27487

Comments

sstoia commented Nov 14, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Nov 14, 2023

muellerzr commented Nov 28, 2023

linhdvu14 commented Feb 2, 2024

amyeroberts commented Feb 27, 2024

amyeroberts commented Apr 18, 2024

NishchalPrasad commented Apr 24, 2024

tomaarsen commented May 17, 2024

amyeroberts commented Jun 10, 2024

svduplessis commented Jul 4, 2024

amyeroberts commented Jul 29, 2024

SunMarc commented Oct 10, 2024

aakash0017 commented Oct 23, 2024

SunMarc commented Oct 24, 2024