Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoresume Validation with Max Duration #3358

Merged
merged 3 commits into from
Jun 4, 2024

Conversation

mvpatel2000
Copy link
Contributor

What does this PR do?

Additional checks based on #3357:

  • Autoresume requires max_duration be specified on init and cannot accept duration in fit. Otherwise, Composer will autoresume from last checkpoint, and then the specified value of max_duration from fit will be incremented to the loaded max duration. In short, fit(duration=...) is not idempotent with autoresume.
  • reset_time in fit does not work with autoresume as it will force a restart in training. Instead, time should be ignored on load, which is used in initial load and not on autoresume load
  • Implicitly prevents multiple calls to fit -- if you cannot increment duration in fit or reset time, it is not possible to call fit twice.

@eracah
Copy link
Contributor

eracah commented Jun 4, 2024

Do we want to keep the incremented functionality? If I load in a 10ba checkpoint with Trainer(..., max_duration='15ba') and I say fit(duration='12ba'), wouldn't I want it to train for 2 batches instead of 5. Wouldn't what I specify in fit override what I did in the constructor?

@mvpatel2000
Copy link
Contributor Author

mvpatel2000 commented Jun 4, 2024

Do we want to keep the incremented functionality? If I load in a 10ba checkpoint with Trainer(..., max_duration='15ba') and I say fit(duration='12ba'), wouldn't I want it to train for 2 batches instead of 5. Wouldn't what I specify in fit override what I did in the constructor?

Iirc the original behavior was modeled after what other training libraries do... but I agree it's perhaps not ideal. We should revisit this in a follow-on PR

@eracah
Copy link
Contributor

eracah commented Jun 4, 2024

Do we want to keep the incremented functionality? If I load in a 10ba checkpoint with Trainer(..., max_duration='15ba') and I say fit(duration='12ba'), wouldn't I want it to train for 2 batches instead of 5. Wouldn't what I specify in fit override what I did in the constructor?

Iirc the original behavior was modeled after what other training libraries do... but I agree it's perhaps not ideal. We should revisit this in a follow-on PR

yeah we should this is very counter-intuitive

@mvpatel2000
Copy link
Contributor Author

Do we want to keep the incremented functionality? If I load in a 10ba checkpoint with Trainer(..., max_duration='15ba') and I say fit(duration='12ba'), wouldn't I want it to train for 2 batches instead of 5. Wouldn't what I specify in fit override what I did in the constructor?

@antoinebrl curious if you have any input given your recent github issue referencing multiple fit calls

@mvpatel2000 mvpatel2000 merged commit f0eae8a into mosaicml:dev Jun 4, 2024
15 checks passed
@mvpatel2000 mvpatel2000 deleted the mvpatel2000/autoresume-check branch June 4, 2024 17:53
@antoinebrl
Copy link
Contributor

I must say this was a very counter intuitive bug to track down. This PR implements some fences which push the user towards the expected behavior but I agree the notion of training duration can be improved.

Regarding the multiple calls to .fit, we currently don't have that use-case. However, I remember reading a LLM paper where they do 2.1 epochs: 2 epochs with a very large dataset, and 0.1 for fine-tuning on a high quality dataset tailored for the task.

mvpatel2000 added a commit to mvpatel2000/composer that referenced this pull request Jun 5, 2024
mvpatel2000 added a commit that referenced this pull request Jun 5, 2024
bigning pushed a commit that referenced this pull request Jun 5, 2024
mvpatel2000 added a commit that referenced this pull request Jun 5, 2024
* Revert "Autoresume Validation with Max Duration (#3358)"

This reverts commit f0eae8a.

* fix docs

* remove
mvpatel2000 added a commit that referenced this pull request Jun 5, 2024
* Revert "Autoresume Validation with Max Duration (#3358)"

This reverts commit f0eae8a.

* bump images to 2.3.1

* bump torchvision
mvpatel2000 added a commit that referenced this pull request Jun 6, 2024
* Revert "Autoresume Validation with Max Duration (#3358)"

This reverts commit f0eae8a.

* pin setuptools
mvpatel2000 added a commit that referenced this pull request Jun 6, 2024
* Revert "Autoresume Validation with Max Duration (#3358)"

This reverts commit f0eae8a.

* add bump

* bump deepspeed

* revert deepspeed bump

* add state dict

---------

Co-authored-by: Saaketh Narayan <saaketh@mosaicml.com>
mvpatel2000 added a commit that referenced this pull request Jul 22, 2024
* Revert "Autoresume Validation with Max Duration (#3358)"

This reverts commit f0eae8a.

* add bump

* bump deepspeed

* revert deepspeed bump

* add state dict

---------

Co-authored-by: Saaketh Narayan <saaketh@mosaicml.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants