Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: only run save full model on main process #838

Merged
merged 1 commit into from
Aug 21, 2024

Conversation

ErwannMillon
Copy link
Contributor

was running into a bug where _save_full_model would break during multi-gpu full finetuning.
The function was being run by all processes, and

        for item in os.listdir(temporary_dir):
            s = os.path.join(temporary_dir, item)
            d = os.path.join(output_dir, item)
            if os.path.isdir(s):
                shutil.copytree(s, d, dirs_exist_ok=True)  # Python 3.8+
            else:
                shutil.copy2(s, d)

        # Remove the temporary directory
        shutil.rmtree(temporary_dir)

would error out because the temporary_dir would get deleted while other processes were trying to copy files that no longer existed.

Fixed by returning early if not main process.

@bghira
Copy link
Owner

bghira commented Aug 21, 2024

this will break DeepSpeed

@ErwannMillon
Copy link
Contributor Author

hm, i'm running w/ deepspeed level 1 and still seems to work. The return is only in the save hook, the sharded deepspeed params still get saved on each gpu if i'm not mistaken

@ErwannMillon
Copy link
Contributor Author

alternatively maybe wrapping only:

        for item in os.listdir(temporary_dir):
            s = os.path.join(temporary_dir, item)
            d = os.path.join(output_dir, item)
            if os.path.isdir(s):
                shutil.copytree(s, d, dirs_exist_ok=True)  # Python 3.8+
            else:
                shutil.copy2(s, d)

        # Remove the temporary directory
        shutil.rmtree(temporary_dir)
        

in the if self.accelerator.is_main_process ?

@bghira
Copy link
Owner

bghira commented Aug 21, 2024

ah, ok.

@bghira bghira merged commit d627c92 into bghira:main Aug 21, 2024
@bghira
Copy link
Owner

bghira commented Aug 21, 2024

oh, but multi-node training will probably want a local copy of the training state on each node, right?

@ErwannMillon
Copy link
Contributor Author

ErwannMillon commented Aug 21, 2024

then maybe is_local_main_process ? just pushed a commit w/ this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants