fix: only run save full model on main process #838

ErwannMillon · 2024-08-21T19:59:47Z

was running into a bug where _save_full_model would break during multi-gpu full finetuning.
The function was being run by all processes, and

        for item in os.listdir(temporary_dir):
            s = os.path.join(temporary_dir, item)
            d = os.path.join(output_dir, item)
            if os.path.isdir(s):
                shutil.copytree(s, d, dirs_exist_ok=True)  # Python 3.8+
            else:
                shutil.copy2(s, d)

        # Remove the temporary directory
        shutil.rmtree(temporary_dir)

would error out because the temporary_dir would get deleted while other processes were trying to copy files that no longer existed.

Fixed by returning early if not main process.

bghira · 2024-08-21T21:42:09Z

this will break DeepSpeed

ErwannMillon · 2024-08-21T21:43:48Z

hm, i'm running w/ deepspeed level 1 and still seems to work. The return is only in the save hook, the sharded deepspeed params still get saved on each gpu if i'm not mistaken

ErwannMillon · 2024-08-21T21:45:02Z

alternatively maybe wrapping only:

        for item in os.listdir(temporary_dir):
            s = os.path.join(temporary_dir, item)
            d = os.path.join(output_dir, item)
            if os.path.isdir(s):
                shutil.copytree(s, d, dirs_exist_ok=True)  # Python 3.8+
            else:
                shutil.copy2(s, d)

        # Remove the temporary directory
        shutil.rmtree(temporary_dir)

in the if self.accelerator.is_main_process ?

bghira · 2024-08-21T21:46:14Z

ah, ok.

bghira · 2024-08-21T21:47:26Z

oh, but multi-node training will probably want a local copy of the training state on each node, right?

ErwannMillon · 2024-08-21T21:59:22Z

then maybe is_local_main_process ? just pushed a commit w/ this change

fix: only run save full model on main process

5043980

bghira merged commit d627c92 into bghira:main Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: only run save full model on main process #838

fix: only run save full model on main process #838

ErwannMillon commented Aug 21, 2024

bghira commented Aug 21, 2024

ErwannMillon commented Aug 21, 2024

ErwannMillon commented Aug 21, 2024

bghira commented Aug 21, 2024

bghira commented Aug 21, 2024

ErwannMillon commented Aug 21, 2024 •

edited

Loading

fix: only run save full model on main process #838

fix: only run save full model on main process #838

Conversation

ErwannMillon commented Aug 21, 2024

bghira commented Aug 21, 2024

ErwannMillon commented Aug 21, 2024

ErwannMillon commented Aug 21, 2024

bghira commented Aug 21, 2024

bghira commented Aug 21, 2024

ErwannMillon commented Aug 21, 2024 • edited Loading

ErwannMillon commented Aug 21, 2024 •

edited

Loading