Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do we need to take care of num_process in lr_scheduler? #1382

Closed
Hannibal046 opened this issue May 2, 2023 · 10 comments
Closed

Why do we need to take care of num_process in lr_scheduler? #1382

Hannibal046 opened this issue May 2, 2023 · 10 comments

Comments

@Hannibal046
Copy link

Hannibal046 commented May 2, 2023

Hi, I am a little bit confused about the behavior of lr_scheduler especially this line of code:

# Otherwise the training dataloader batch size was multiplied by `num_processes`, so we need to do
# num_processes steps per training step
num_processes = AcceleratorState().num_processes
for _ in range(num_processes):
# Special case when using OneCycle and `drop_last` was not used
if hasattr(self.scheduler, "total_steps"):
if self.scheduler._step_count <= self.scheduler.total_steps:
self.scheduler.step(*args, **kwargs)
else:
self.scheduler.step(*args, **kwargs)

in the DDP mode, gradient_accumulation_steps and more gpus are essentially increase the batch size . Suppose the batch_size parameter we give to dataloader is 10, we have 8 gpus, and gradient accumulation steps is 4. So the real batch size is 10*8*4 = 320. This is i believe when we should do one optimization step as well as one lr schedule step. But in the above line of code, it would do more than one lr schedule step.

Briefly speaking, my question is: why do we need to make lr_sceduler depend on the number of GPUs when we already assign a value to warmup_steps in user config?

@Hannibal046
Copy link
Author

a relevant problem is here: huggingface/transformers#18436 (comment)

@sgugger
Copy link
Collaborator

sgugger commented May 2, 2023

That's because the lr_scheduler has been defined taking into account the unprepared dataloader which has the original batch size.

@Hannibal046
Copy link
Author

But in my case(2 GPUs), the size of train_dataloader doesn't change after the prepare method. I believe the DDP sampler wouldn't change the the size of train_dataloader?

BTW, thanks for the prompt response!

@Hannibal046
Copy link
Author

So, if i want to assign warmup_steps to 4000, and i want my lr to exactly peak at 4000. (no matter how many GPUs or batch size i set). What would you suggest to be the best practice?

@sgugger
Copy link
Collaborator

sgugger commented May 2, 2023

@Hannibal046 Yes your dataloader size will be divided by 2 after you have prepared it on 2 GPUs, because your total batch size has been multiplied by 2. If you want your scheduler to peak at 4000 steps independently of the number of GPUs, you shouldn't send it to accelerator.prepare, but then, it won't get to 0 at the end of training.

@Hannibal046
Copy link
Author

Sorry, it my mistake. The dataloader size does change after the prepare method.

Could you please explain some conceptual guides on how to set the warmup_steps in accelerate. Take this paper, Attention is all you need as example. In this paper, it said the model was trained on 8GPUs and warmup_steps is 4000. So, i should give 4000*8 to the scheduler function?
image

from transformers import get_polynomial_decay_schedule_with_warmup,
lr_scheduler = get_polynomial_decay_schedule_with_warmup(
        optimizer, 
        num_warmup_steps = 4000*8*cfg.trainer.gradient_accumulation_steps,
        num_training_steps = total_steps
    )

@Hannibal046
Copy link
Author

if that is the case, what if i only have 4 GPUs? Typically, i would keep per_device_train_batch_size unchanged and double my gradient_accumulation_steps to match the total batch size of 8 GPUs. In accelerate, do i need to change the warmup_steps?

@Hannibal046
Copy link
Author

Hi, @sgugger , would you mind spare some time to check this?

@sgugger
Copy link
Collaborator

sgugger commented May 4, 2023

I don't really get what you want me to say. It looks like you want a scheduler that always does the same number of steps regardless of distributed/gradient accumulation so maybe don't pass it to accelerator.prepare?

@Hannibal046
Copy link
Author

Hi @sgugger , really sorry for bothering you. I am new to Accelerate and the prepare method works like a magic black box for me to handle multi GPUs (which it is intended to be). So the obvious solution of not preparing lr_scheduler is not so obvious for me.

For anyone who is seeking for fixed warmup steps in Accelerate, a simple workaround would be:

lr_scheduler = get_scheduler()

# not prepare lr_scheduler

for batch in dataloader:
	with accelerator.accumulate(model):
		loss = compute_loss()
		if accelerator.sync_gradients and not accelerator.optimizer_step_was_skipped:
			lr_scheduler.step()
	
 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants