Why do we need to take care of num_process in lr_scheduler? #1382

Hannibal046 · 2023-05-02T17:40:58Z

Hi, I am a little bit confused about the behavior of lr_scheduler especially this line of code:

Lines 73 to 82 in 995563f

    
           # Otherwise the training dataloader batch size was multiplied by `num_processes`, so we need to do 
        
           # num_processes steps per training step 
        
           num_processes = AcceleratorState().num_processes 
        
           for _ in range(num_processes): 
        
               # Special case when using OneCycle and `drop_last` was not used 
        
               if hasattr(self.scheduler, "total_steps"): 
        
                   if self.scheduler._step_count <= self.scheduler.total_steps: 
        
                       self.scheduler.step(*args, **kwargs) 
        
               else: 
        
                   self.scheduler.step(*args, **kwargs)

in the DDP mode, gradient_accumulation_steps and more gpus are essentially increase the batch size . Suppose the batch_size parameter we give to dataloader is 10, we have 8 gpus, and gradient accumulation steps is 4. So the real batch size is 10*8*4 = 320. This is i believe when we should do one optimization step as well as one lr schedule step. But in the above line of code, it would do more than one lr schedule step.

Briefly speaking, my question is: why do we need to make lr_sceduler depend on the number of GPUs when we already assign a value to warmup_steps in user config?

The text was updated successfully, but these errors were encountered:

Hannibal046 · 2023-05-02T17:44:21Z

a relevant problem is here: huggingface/transformers#18436 (comment)

sgugger · 2023-05-02T17:44:47Z

That's because the lr_scheduler has been defined taking into account the unprepared dataloader which has the original batch size.

Hannibal046 · 2023-05-02T17:54:11Z

But in my case(2 GPUs), the size of train_dataloader doesn't change after the prepare method. I believe the DDP sampler wouldn't change the the size of train_dataloader?

BTW, thanks for the prompt response!

Hannibal046 · 2023-05-02T17:56:58Z

So, if i want to assign warmup_steps to 4000, and i want my lr to exactly peak at 4000. (no matter how many GPUs or batch size i set). What would you suggest to be the best practice?

sgugger · 2023-05-02T18:09:24Z

@Hannibal046 Yes your dataloader size will be divided by 2 after you have prepared it on 2 GPUs, because your total batch size has been multiplied by 2. If you want your scheduler to peak at 4000 steps independently of the number of GPUs, you shouldn't send it to accelerator.prepare, but then, it won't get to 0 at the end of training.

Hannibal046 · 2023-05-02T18:50:11Z

Sorry, it my mistake. The dataloader size does change after the prepare method.

Could you please explain some conceptual guides on how to set the warmup_steps in accelerate. Take this paper, Attention is all you need as example. In this paper, it said the model was trained on 8GPUs and warmup_steps is 4000. So, i should give 4000*8 to the scheduler function?

from transformers import get_polynomial_decay_schedule_with_warmup,
lr_scheduler = get_polynomial_decay_schedule_with_warmup(
        optimizer, 
        num_warmup_steps = 4000*8*cfg.trainer.gradient_accumulation_steps,
        num_training_steps = total_steps
    )

Hannibal046 · 2023-05-02T18:52:33Z

if that is the case, what if i only have 4 GPUs? Typically, i would keep per_device_train_batch_size unchanged and double my gradient_accumulation_steps to match the total batch size of 8 GPUs. In accelerate, do i need to change the warmup_steps?

Hannibal046 · 2023-05-04T08:56:34Z

Hi, @sgugger , would you mind spare some time to check this?

sgugger · 2023-05-04T13:26:03Z

I don't really get what you want me to say. It looks like you want a scheduler that always does the same number of steps regardless of distributed/gradient accumulation so maybe don't pass it to accelerator.prepare?

Hannibal046 · 2023-05-04T14:52:50Z

Hi @sgugger , really sorry for bothering you. I am new to Accelerate and the prepare method works like a magic black box for me to handle multi GPUs (which it is intended to be). So the obvious solution of not preparing lr_scheduler is not so obvious for me.

For anyone who is seeking for fixed warmup steps in Accelerate, a simple workaround would be:

lr_scheduler = get_scheduler()

# not prepare lr_scheduler

for batch in dataloader:
	with accelerator.accumulate(model):
		loss = compute_loss()
		if accelerator.sync_gradients and not accelerator.optimizer_step_was_skipped:
			lr_scheduler.step()

Hannibal046 mentioned this issue May 2, 2023

[Feature Request] gather non-tensor objects #1383

Closed

Hannibal046 closed this as completed May 4, 2023

Hannibal046 mentioned this issue May 6, 2023

Update no_trainer scripts to include gradient accumulation huggingface/transformers#18436

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do we need to take care of num_process in lr_scheduler? #1382

Why do we need to take care of num_process in lr_scheduler? #1382

Hannibal046 commented May 2, 2023 •

edited

Loading

Hannibal046 commented May 2, 2023

sgugger commented May 2, 2023

Hannibal046 commented May 2, 2023

Hannibal046 commented May 2, 2023

sgugger commented May 2, 2023

Hannibal046 commented May 2, 2023

Hannibal046 commented May 2, 2023

Hannibal046 commented May 4, 2023

sgugger commented May 4, 2023

Hannibal046 commented May 4, 2023

Why do we need to take care of num_process in lr_scheduler? #1382

Why do we need to take care of num_process in lr_scheduler? #1382

Comments

Hannibal046 commented May 2, 2023 • edited Loading

Hannibal046 commented May 2, 2023

sgugger commented May 2, 2023

Hannibal046 commented May 2, 2023

Hannibal046 commented May 2, 2023

sgugger commented May 2, 2023

Hannibal046 commented May 2, 2023

Hannibal046 commented May 2, 2023

Hannibal046 commented May 4, 2023

sgugger commented May 4, 2023

Hannibal046 commented May 4, 2023

Hannibal046 commented May 2, 2023 •

edited

Loading