Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training was successful on a single card 4090GPU, but an error was reported on a 3*4090GPU. why #841

Open
orderer0001 opened this issue May 22, 2024 · 1 comment

Comments

@orderer0001
Copy link

(lmflow_train) root@duxact:/data/projects/lmflow/LMFlow# ./scripts/run_finetune_with_lisa.sh
--model_name_or_path /data/guihunmodel8.8B
--dataset_path /data/projects/lmflow/case_report_data
--output_model_path /data/projects/lmflow/guihun_fintune_model
--lisa_activated_layers 1
--lisa_interval_steps 20
[2024-05-22 14:32:20,602] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
Traceback (most recent call last):
File "/data/projects/lmflow/LMFlow/examples/finetune.py", line 61, in
main()
File "/data/projects/lmflow/LMFlow/examples/finetune.py", line 44, in main
model_args, data_args, pipeline_args = parser.parse_args_into_dataclasses()
File "/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 135, in init
File "/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/training_args.py", line 1641, in post_init
and (self.device.type == "cpu" and not is_torch_greater_or_equal_than_2_3)
File "/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/training_args.py", line 2149, in device
return self._setup_devices
File "/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/utils/generic.py", line 59, in get
cached = self.fget(obj)
File "/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/training_args.py", line 2081, in _setup_devices
self.distributed_state = PartialState(
File "/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/accelerate/state.py", line 293, in init
raise NotImplementedError(
NotImplementedError: Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. Please set NCCL_P2P_DISABLE="1" and NCCL_IB_DISABLE="1" or use accelerate launch` which will do this automatically.

@wheresmyhair
Copy link
Collaborator

Thanks for your interest in LMFlow! Currently we are working on the full multi-GPU support for LISA. Please stay tuned for our latest update, thanks for your understanding 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants