training_args.parallel_mode param questions #6766

boyu-zhu · 2025-01-27T02:47:09Z

Reminder

I have read the above rules and searched the existing issues.

System Info

I want to use train.py to run the training process but however it keep giving the error message say

File "/mnt/remote-data/zby/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 52, in _training_function
   model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                            ^^^^^^^^^^^^^^^^^^^^
 File "/mnt/remote-data/zby/LLaMA-Factory-main/src/llamafactory/hparams/parser.py", line 222, in get_train_args
   raise ValueError("Please launch distributed training with `llamafactory-cli` or `torchrun`.")
ValueError: Please launch distributed training with `llamafactory-cli` or `torchrun`.

The code causing the error from parser.py is :

LLaMA-Factory/src/llamafactory/hparams/parser.py

Lines 221 to 222 in 0f45982

    
           if training_args.parallel_mode == ParallelMode.NOT_DISTRIBUTED: 
        
               raise ValueError("Please launch distributed training with `llamafactory-cli` or `torchrun`.")

The command I use to run the script is

python src/train.py \
--stage sft \
--do_train True \
--model_name_or_path /mnt/remote-data/downloads/models/LLM-Research/Llama-3.2-3B-Instruct \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--template llama3 \
--flash_attn auto \
--dataset_dir data \
--dataset alpaca_en_demo \
--cutoff_len 2048 \
--learning_rate 5e-05 \
--num_train_epochs 3.0 \
--max_samples 100000 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--packing False \
--report_to none \
--output_dir saves/Llama-3.2-3B-Instruct/lora/train_2025-01-26-23-45-45 \
--bf16 True \
--plot_loss True \
--trust_remote_code True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--optim adamw_torch \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0 \
--lora_target all

I have questions regarding with training_args.parallel_mode, I searched in the code but I didn't find where it's given default value, and I also didn't find doc or issue about how this should be set. Can anyone help?

Reproduction

Put your message here.

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2025-01-30T18:59:05Z

Please use llamafactory-cli train instead

boyu-zhu · 2025-02-01T09:52:26Z

Please use llamafactory-cli train instead

@hiyouga Thanks for the reply! Actually my original intention is to see the debugging process of the llamafactory process, and that's why I use train.py to run instead of using llamafactory-cli train. Also, I'm confused about why this ParallelMode.NOT_DISTRIBUTED is related to whether using the cli or not?

hiyouga · 2025-02-01T16:00:34Z

Yeah, they are related

boyu-zhu · 2025-02-01T17:09:03Z

@hiyouga Can you possibly briefly explain why is this the case? I looked over the repository but didn't found where training_args.parallel_mode is given value or changed value.

hiyouga · 2025-02-01T17:44:00Z

See FAQs for debug llamafactory #4614

boyu-zhu added bug Something isn't working pending This problem is yet to be addressed labels Jan 27, 2025

hiyouga closed this as completed Jan 30, 2025

hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training_args.parallel_mode param questions #6766

training_args.parallel_mode param questions #6766

boyu-zhu commented Jan 27, 2025

hiyouga commented Jan 30, 2025

boyu-zhu commented Feb 1, 2025

hiyouga commented Feb 1, 2025

boyu-zhu commented Feb 1, 2025

hiyouga commented Feb 1, 2025

training_args.parallel_mode param questions #6766

training_args.parallel_mode param questions #6766

Comments

boyu-zhu commented Jan 27, 2025

Reminder

System Info

Reproduction

Others

hiyouga commented Jan 30, 2025

boyu-zhu commented Feb 1, 2025

hiyouga commented Feb 1, 2025

boyu-zhu commented Feb 1, 2025

hiyouga commented Feb 1, 2025