-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set drop_last to always True #1761
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1761
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 0b9a3ec with merge base a8a64ec (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right fix? Personally I find a top-level config field called drop_last
to be pretty unclear. Why not just infer drop_last
in the existing if/else for ConcatDataset
here? https://github.com/pytorch/torchtune/blob/main/recipes/full_finetune_distributed.py#L487-L496
E.g.
if isinstance(cfg_dataset, ListConfig):
# same stuff as before
drop_last = any([single_cfg_dataset.get("drop_last", True) for single_cfg_dataset in cfg_dataset])
else:
# same stuff as before
drop_last = cfg_dataset.get("drop_last", True)
drop_last being in the dataset config does not make sense, as dataset points to a component and all kwargs should be related to that component/builder. drop_last is not in any of our dataset builders. I think having it top level is less confusing in that sense. Frankly, we should have a more configurable dataloader or a section in the config for dataloader args, as this is becoming more customized. |
OK yeah it is a dataloader config more than a dataset config so I do see your point there. Mainly I just do not want us to wind up with a hodgepodge of a bunch of random top-level configs, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than trying to conceive of edge cases that users might not even care about, can we just default drop_last to True, document it, and then if users want to have this be configurable, we can add a config.
Either way it's a one line change. For most cases, the last batch should be dropped.
Agreed |
Context
What is the purpose of this PR? Is it to
Closes #1754. Instead of specifying drop_last with the dataset config (which will break when passed into the dataset builder), just hardcode to True. No configs expose this. We can make it configurable down the line if users request it.
Test plan
CI