Unable to save checkpoints per epoch on a single GPU #646

aenaliph · 2024-08-27T05:26:39Z

I am able to finetune the llama3.1 base model on a custom dataset with the alpaca format by following the quickstart peft finetuning notebook. However, I am unable to save the model after each epoch. I have installed llama-recipes from source.

Below is my train config and the error trace. Any pointers? Could it be related to this fix?
#629

I am on torch 2.4 and using an NVIDIA GeForce RTX 4090

train_config = TRAIN_CONFIG()
train_config.model_name = "/home/aen/models/llama3_1/hf-llama-3.1-8B"
train_config.num_epochs = 3
train_config.run_validation = True
train_config.gradient_accumulation_steps = 4
train_config.batch_size_training = 2
train_config.lr = 1e-4
train_config.use_fast_kernels = True
train_config.use_fp16 = True
train_config.context_length = 1024 if torch.cuda.get_device_properties(0).total_memory < 16e9 else 2048 # T4 16GB or A10 24GB
train_config.batching_strategy = "packing"
train_config.output_dir = "/home/aen/models/llama3_1/hf-llama-3.1-8B/finetuned_alpaca"
train_config.use_wandb = True
train_config.enable_fsdp = False
train_config.save_model = True ## Setting this to True will throw an error if the run_validation is True
train_config.val_batch_size = 2

AttributeError                            Traceback (most recent call last)
Cell In[9], [line 15](vscode-notebook-cell:?execution_count=9&line=15)
     [12](vscode-notebook-cell:?execution_count=9&line=12) scheduler = StepLR(optimizer, step_size=1, gamma=train_config.gamma)
     [14](vscode-notebook-cell:?execution_count=9&line=14) # Start the training process
---> [15](vscode-notebook-cell:?execution_count=9&line=15) results = train(
     [16](vscode-notebook-cell:?execution_count=9&line=16)     model,
     [17](vscode-notebook-cell:?execution_count=9&line=17)     train_dataloader,
     [18](vscode-notebook-cell:?execution_count=9&line=18)     eval_dataloader,
     [19](vscode-notebook-cell:?execution_count=9&line=19)     tokenizer,
     [20](vscode-notebook-cell:?execution_count=9&line=20)     optimizer,
     [21](vscode-notebook-cell:?execution_count=9&line=21)     scheduler,
     [22](vscode-notebook-cell:?execution_count=9&line=22)     train_config.gradient_accumulation_steps,
     [23](vscode-notebook-cell:?execution_count=9&line=23)     train_config,
     [24](vscode-notebook-cell:?execution_count=9&line=24)     None,
     [25](vscode-notebook-cell:?execution_count=9&line=25)     None,
     [26](vscode-notebook-cell:?execution_count=9&line=26)     None,
     [27](vscode-notebook-cell:?execution_count=9&line=27)     wandb_run=run,
     [28](vscode-notebook-cell:?execution_count=9&line=28) )
     [29](vscode-notebook-cell:?execution_count=9&line=29) wandb.finish()
     [31](vscode-notebook-cell:?execution_count=9&line=31) # Save the model and tokenizer

File ~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:246, in train(model, train_dataloader, eval_dataloader, tokenizer, optimizer, lr_scheduler, gradient_accumulation_steps, train_config, fsdp_config, local_rank, rank, wandb_run)
    [243](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:243)         print(f"PEFT modules are saved in {train_config.output_dir} directory")
    [245](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:245) else:
--> [246](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:246)     if not train_config.use_peft and fsdp_config.checkpoint_type == StateDictType.FULL_STATE_DICT:
    [248](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:248)         save_model_checkpoint(
    [249](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:249)             model, optimizer, rank, train_config, epoch=epoch
    [250](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:250)         )
    [251](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:251)     elif not train_config.use_peft and fsdp_config.checkpoint_type == StateDictType.SHARDED_STATE_DICT:

AttributeError: 'NoneType' object has no attribute 'checkpoint_type'

At the end of each epoch I would like the validation dataset to be run and the checkpoint to be saved.

The text was updated successfully, but these errors were encountered:

mreso · 2024-08-27T18:54:15Z

Hi @aenaliph
seems like there are actually two issues here:

We're assuming FSDP is always used for full weights pretraining
Peft is not activated in the peft quickstart notebook.

I'll create patches for both issue asap but to get you unblocked you can already add train_config.use_paft = True to enable peft training which should successfully save the checkpoint. If you are actually trying to do full weights fine tuning you'll need to wait for the patch. Will try to patch this by EOD.

aenaliph · 2024-08-27T19:42:40Z

Thanks for the pointers @mreso. I am indeed training with peft. Following the quickstart notebook the peft config is passed to the model in Step 4: Prepare model for PEFT

I did try earlier with train_config.use_peft = True but then I run into the following error.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[7], [line 15](vscode-notebook-cell:?execution_count=7&line=15)
     [12](vscode-notebook-cell:?execution_count=7&line=12) scheduler = StepLR(optimizer, step_size=1, gamma=train_config.gamma)
     [14](vscode-notebook-cell:?execution_count=7&line=14) # Start the training process
---> [15](vscode-notebook-cell:?execution_count=7&line=15) results = train(
     [16](vscode-notebook-cell:?execution_count=7&line=16)     model,
     [17](vscode-notebook-cell:?execution_count=7&line=17)     train_dataloader,
     [18](vscode-notebook-cell:?execution_count=7&line=18)     eval_dataloader,
     [19](vscode-notebook-cell:?execution_count=7&line=19)     tokenizer,
     [20](vscode-notebook-cell:?execution_count=7&line=20)     optimizer,
     [21](vscode-notebook-cell:?execution_count=7&line=21)     scheduler,
     [22](vscode-notebook-cell:?execution_count=7&line=22)     train_config.gradient_accumulation_steps,
     [23](vscode-notebook-cell:?execution_count=7&line=23)     train_config,
     [24](vscode-notebook-cell:?execution_count=7&line=24)     None,
     [25](vscode-notebook-cell:?execution_count=7&line=25)     None,
     [26](vscode-notebook-cell:?execution_count=7&line=26)     None,
     [27](vscode-notebook-cell:?execution_count=7&line=27)     wandb_run=run,
     [28](vscode-notebook-cell:?execution_count=7&line=28) )
     [29](vscode-notebook-cell:?execution_count=7&line=29) wandb.finish()
     [31](vscode-notebook-cell:?execution_count=7&line=31) # Save the model and tokenizer

File ~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:238, in train(model, train_dataloader, eval_dataloader, tokenizer, optimizer, lr_scheduler, gradient_accumulation_steps, train_config, fsdp_config, local_rank, rank, wandb_run)
    [236](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:236) else:
    [237](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:237)     print(f"we are about to save the PEFT modules")
--> [238](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:238) save_peft_checkpoint(model, train_config.output_dir)
    [239](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:239) if train_config.enable_fsdp:
    [240](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:240)     if rank==0:

File ~/tools/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py:275, in save_peft_checkpoint(model, model_path)
    [271](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py:271) """save_pretrained peft model"""
    [273](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py:273) options = StateDictOptions(full_state_dict=True, cpu_offload=True)
--> [275](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py:275) state_dict = get_model_state_dict(model, options=options)
    [276](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py:276) model.save_pretrained(model_path, state_dict=state_dict)

File ~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:976, in get_model_state_dict(model, submodules, options)
    [968](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:968) with _gc_context():
    [969](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:969)     info = _verify_options(
    [970](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:970)         model,
    [971](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:971)         tuple(),
   (...)
    [974](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:974)         options=options,
    [975](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:975)     )
--> [976](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:976)     model_state_dict = _get_model_state_dict(model, info)
    [977](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:977)     _verify_state_dict(model_state_dict, {}, info)
    [978](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:978)     return model_state_dict

File ~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:469, in _get_model_state_dict(model, info)
    [466](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:466)     state_dict = _state_dict_fn(model, "state_dict")()
    [468](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:468) for key in list(state_dict.keys()):
--> [469](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:469)     fqns = _get_fqns(model, key)
    [470](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:470)     assert len(fqns) == 1, (key, fqns)
    [471](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:471)     fqn = next(iter(fqns))

File ~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:222, in _get_fqns(model, name, skip_ddp_prefix, skip_compiler_prefix)
    [220](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:220)                 raise RuntimeError("Expect `_extra_state` to be the last obj name")
    [221](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:221)         else:
--> [222](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:222)             curr_obj = getattr(curr_obj, curr_obj_name)
    [224](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:224) return {".".join(fqn_obj_names).replace(_CHECKPOINT_PREFIX, "")}

File ~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/nn/modules/module.py:1729, in Module.__getattr__(self, name)
   [1727](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/nn/modules/module.py:1727)     if name in modules:
   [1728](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/nn/modules/module.py:1728)         return modules[name]
-> [1729](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/nn/modules/module.py:1729) raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")

AttributeError: 'Linear8bitLt' object has no attribute 'SCB'

mreso · 2024-08-28T22:53:41Z

@aenaliph I was able to reproduce the error in the notebook and prepared a fix for that as well. To get you unblocked you can set train_config.run_validation back to False and then the saving of the checkpoint should be skipped but will be successfully done in step 6.

aenaliph · 2024-08-29T12:37:35Z

@mreso Yes, I am able to save the model as in step 6, just not with validation on. I will await the fix. Thanks for looking into it.

mreso · 2024-08-30T19:10:31Z

Closing this as #650 got merged. Feel free to reopen if the issues persist with the fixes.

mreso self-assigned this Aug 27, 2024

mreso added the bug Something isn't working label Aug 27, 2024

mreso mentioned this issue Aug 28, 2024

Fix checkpoint saving #650

Merged

8 tasks

mreso closed this as completed Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to save checkpoints per epoch on a single GPU #646

Unable to save checkpoints per epoch on a single GPU #646

aenaliph commented Aug 27, 2024

mreso commented Aug 27, 2024

aenaliph commented Aug 27, 2024

mreso commented Aug 28, 2024

aenaliph commented Aug 29, 2024

mreso commented Aug 30, 2024

Unable to save checkpoints per epoch on a single GPU #646

Unable to save checkpoints per epoch on a single GPU #646

Comments

aenaliph commented Aug 27, 2024

mreso commented Aug 27, 2024

aenaliph commented Aug 27, 2024

mreso commented Aug 28, 2024

aenaliph commented Aug 29, 2024

mreso commented Aug 30, 2024