Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make olmo-core checkpointer more robust on weka #624

Merged
merged 4 commits into from
Jun 17, 2024
Merged

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Jun 14, 2024

No description provided.

@epwalsh epwalsh requested a review from 2015aroras June 14, 2024 21:03
local_files_created = save_model_and_optim_state(
checkpoint_dir, fsdp_model, optim, save_overwrite=self.cfg.save_overwrite
)
local_files_created = save_model_and_optim_state(checkpoint_dir, fsdp_model, optim)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove save_overwrite?

Copy link
Member Author

@epwalsh epwalsh Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We handle save_overwrite in this checkpointer class already. At this point in the code we're passing a temporary directory that we've guaranteed exists and is empty, so if we pass save_overwrite=True, then olmo-core will delete and recreate that directory, which can cause issues on weka/nfs.

@epwalsh epwalsh merged commit 2417b11 into main Jun 17, 2024
10 of 12 checks passed
@epwalsh epwalsh deleted the epwalsh/checkpoint-fix branch June 17, 2024 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants