Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to save checkpoints per epoch on a single GPU #646

Closed
aenaliph opened this issue Aug 27, 2024 · 5 comments
Closed

Unable to save checkpoints per epoch on a single GPU #646

aenaliph opened this issue Aug 27, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@aenaliph
Copy link

I am able to finetune the llama3.1 base model on a custom dataset with the alpaca format by following the quickstart peft finetuning notebook. However, I am unable to save the model after each epoch. I have installed llama-recipes from source.

Below is my train config and the error trace. Any pointers? Could it be related to this fix?
#629

I am on torch 2.4 and using an NVIDIA GeForce RTX 4090

train_config = TRAIN_CONFIG()
train_config.model_name = "/home/aen/models/llama3_1/hf-llama-3.1-8B"
train_config.num_epochs = 3
train_config.run_validation = True
train_config.gradient_accumulation_steps = 4
train_config.batch_size_training = 2
train_config.lr = 1e-4
train_config.use_fast_kernels = True
train_config.use_fp16 = True
train_config.context_length = 1024 if torch.cuda.get_device_properties(0).total_memory < 16e9 else 2048 # T4 16GB or A10 24GB
train_config.batching_strategy = "packing"
train_config.output_dir = "/home/aen/models/llama3_1/hf-llama-3.1-8B/finetuned_alpaca"
train_config.use_wandb = True
train_config.enable_fsdp = False
train_config.save_model = True ## Setting this to True will throw an error if the run_validation is True
train_config.val_batch_size = 2



AttributeError                            Traceback (most recent call last)
Cell In[9], [line 15](vscode-notebook-cell:?execution_count=9&line=15)
     [12](vscode-notebook-cell:?execution_count=9&line=12) scheduler = StepLR(optimizer, step_size=1, gamma=train_config.gamma)
     [14](vscode-notebook-cell:?execution_count=9&line=14) # Start the training process
---> [15](vscode-notebook-cell:?execution_count=9&line=15) results = train(
     [16](vscode-notebook-cell:?execution_count=9&line=16)     model,
     [17](vscode-notebook-cell:?execution_count=9&line=17)     train_dataloader,
     [18](vscode-notebook-cell:?execution_count=9&line=18)     eval_dataloader,
     [19](vscode-notebook-cell:?execution_count=9&line=19)     tokenizer,
     [20](vscode-notebook-cell:?execution_count=9&line=20)     optimizer,
     [21](vscode-notebook-cell:?execution_count=9&line=21)     scheduler,
     [22](vscode-notebook-cell:?execution_count=9&line=22)     train_config.gradient_accumulation_steps,
     [23](vscode-notebook-cell:?execution_count=9&line=23)     train_config,
     [24](vscode-notebook-cell:?execution_count=9&line=24)     None,
     [25](vscode-notebook-cell:?execution_count=9&line=25)     None,
     [26](vscode-notebook-cell:?execution_count=9&line=26)     None,
     [27](vscode-notebook-cell:?execution_count=9&line=27)     wandb_run=run,
     [28](vscode-notebook-cell:?execution_count=9&line=28) )
     [29](vscode-notebook-cell:?execution_count=9&line=29) wandb.finish()
     [31](vscode-notebook-cell:?execution_count=9&line=31) # Save the model and tokenizer

File ~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:246, in train(model, train_dataloader, eval_dataloader, tokenizer, optimizer, lr_scheduler, gradient_accumulation_steps, train_config, fsdp_config, local_rank, rank, wandb_run)
    [243](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:243)         print(f"PEFT modules are saved in {train_config.output_dir} directory")
    [245](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:245) else:
--> [246](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:246)     if not train_config.use_peft and fsdp_config.checkpoint_type == StateDictType.FULL_STATE_DICT:
    [248](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:248)         save_model_checkpoint(
    [249](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:249)             model, optimizer, rank, train_config, epoch=epoch
    [250](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:250)         )
    [251](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:251)     elif not train_config.use_peft and fsdp_config.checkpoint_type == StateDictType.SHARDED_STATE_DICT:

AttributeError: 'NoneType' object has no attribute 'checkpoint_type'

At the end of each epoch I would like the validation dataset to be run and the checkpoint to be saved.

@mreso
Copy link
Contributor

mreso commented Aug 27, 2024

Hi @aenaliph
seems like there are actually two issues here:

  1. We're assuming FSDP is always used for full weights pretraining
  2. Peft is not activated in the peft quickstart notebook.

I'll create patches for both issue asap but to get you unblocked you can already add train_config.use_paft = True to enable peft training which should successfully save the checkpoint. If you are actually trying to do full weights fine tuning you'll need to wait for the patch. Will try to patch this by EOD.

@mreso mreso self-assigned this Aug 27, 2024
@mreso mreso added the bug Something isn't working label Aug 27, 2024
@aenaliph
Copy link
Author

Thanks for the pointers @mreso. I am indeed training with peft. Following the quickstart notebook the peft config is passed to the model in Step 4: Prepare model for PEFT

I did try earlier with train_config.use_peft = True but then I run into the following error.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[7], [line 15](vscode-notebook-cell:?execution_count=7&line=15)
     [12](vscode-notebook-cell:?execution_count=7&line=12) scheduler = StepLR(optimizer, step_size=1, gamma=train_config.gamma)
     [14](vscode-notebook-cell:?execution_count=7&line=14) # Start the training process
---> [15](vscode-notebook-cell:?execution_count=7&line=15) results = train(
     [16](vscode-notebook-cell:?execution_count=7&line=16)     model,
     [17](vscode-notebook-cell:?execution_count=7&line=17)     train_dataloader,
     [18](vscode-notebook-cell:?execution_count=7&line=18)     eval_dataloader,
     [19](vscode-notebook-cell:?execution_count=7&line=19)     tokenizer,
     [20](vscode-notebook-cell:?execution_count=7&line=20)     optimizer,
     [21](vscode-notebook-cell:?execution_count=7&line=21)     scheduler,
     [22](vscode-notebook-cell:?execution_count=7&line=22)     train_config.gradient_accumulation_steps,
     [23](vscode-notebook-cell:?execution_count=7&line=23)     train_config,
     [24](vscode-notebook-cell:?execution_count=7&line=24)     None,
     [25](vscode-notebook-cell:?execution_count=7&line=25)     None,
     [26](vscode-notebook-cell:?execution_count=7&line=26)     None,
     [27](vscode-notebook-cell:?execution_count=7&line=27)     wandb_run=run,
     [28](vscode-notebook-cell:?execution_count=7&line=28) )
     [29](vscode-notebook-cell:?execution_count=7&line=29) wandb.finish()
     [31](vscode-notebook-cell:?execution_count=7&line=31) # Save the model and tokenizer

File ~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:238, in train(model, train_dataloader, eval_dataloader, tokenizer, optimizer, lr_scheduler, gradient_accumulation_steps, train_config, fsdp_config, local_rank, rank, wandb_run)
    [236](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:236) else:
    [237](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:237)     print(f"we are about to save the PEFT modules")
--> [238](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:238) save_peft_checkpoint(model, train_config.output_dir)
    [239](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:239) if train_config.enable_fsdp:
    [240](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/utils/train_utils.py:240)     if rank==0:

File ~/tools/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py:275, in save_peft_checkpoint(model, model_path)
    [271](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py:271) """save_pretrained peft model"""
    [273](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py:273) options = StateDictOptions(full_state_dict=True, cpu_offload=True)
--> [275](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py:275) state_dict = get_model_state_dict(model, options=options)
    [276](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/tools/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py:276) model.save_pretrained(model_path, state_dict=state_dict)

File ~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:976, in get_model_state_dict(model, submodules, options)
    [968](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:968) with _gc_context():
    [969](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:969)     info = _verify_options(
    [970](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:970)         model,
    [971](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:971)         tuple(),
   (...)
    [974](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:974)         options=options,
    [975](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:975)     )
--> [976](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:976)     model_state_dict = _get_model_state_dict(model, info)
    [977](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:977)     _verify_state_dict(model_state_dict, {}, info)
    [978](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:978)     return model_state_dict

File ~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:469, in _get_model_state_dict(model, info)
    [466](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:466)     state_dict = _state_dict_fn(model, "state_dict")()
    [468](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:468) for key in list(state_dict.keys()):
--> [469](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:469)     fqns = _get_fqns(model, key)
    [470](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:470)     assert len(fqns) == 1, (key, fqns)
    [471](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:471)     fqn = next(iter(fqns))

File ~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:222, in _get_fqns(model, name, skip_ddp_prefix, skip_compiler_prefix)
    [220](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:220)                 raise RuntimeError("Expect `_extra_state` to be the last obj name")
    [221](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:221)         else:
--> [222](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:222)             curr_obj = getattr(curr_obj, curr_obj_name)
    [224](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/distributed/checkpoint/state_dict.py:224) return {".".join(fqn_obj_names).replace(_CHECKPOINT_PREFIX, "")}

File ~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/nn/modules/module.py:1729, in Module.__getattr__(self, name)
   [1727](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/nn/modules/module.py:1727)     if name in modules:
   [1728](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/nn/modules/module.py:1728)         return modules[name]
-> [1729](https://vscode-remote+ssh-002dremote-002bhim4090.vscode-resource.vscode-cdn.net/home/aen/projects/try_llms/code/~/anaconda3/envs/llm/lib/python3.9/site-packages/torch/nn/modules/module.py:1729) raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")

AttributeError: 'Linear8bitLt' object has no attribute 'SCB'

@mreso mreso mentioned this issue Aug 28, 2024
8 tasks
@mreso
Copy link
Contributor

mreso commented Aug 28, 2024

@aenaliph I was able to reproduce the error in the notebook and prepared a fix for that as well. To get you unblocked you can set train_config.run_validation back to False and then the saving of the checkpoint should be skipped but will be successfully done in step 6.

@aenaliph
Copy link
Author

@mreso Yes, I am able to save the model as in step 6, just not with validation on. I will await the fix. Thanks for looking into it.

@mreso
Copy link
Contributor

mreso commented Aug 30, 2024

Closing this as #650 got merged. Feel free to reopen if the issues persist with the fixes.

@mreso mreso closed this as completed Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants