Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to save last checkpoint #1613

Closed
7 of 8 tasks
Nero10578 opened this issue May 13, 2024 · 7 comments · Fixed by #1615
Closed
7 of 8 tasks

Fail to save last checkpoint #1613

Nero10578 opened this issue May 13, 2024 · 7 comments · Fixed by #1615
Labels
bug Something isn't working

Comments

@Nero10578
Copy link
Contributor

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Expected behavior is to save the last checkpoint like the previous intermediate checkpoints. It has no failed to save the final checkpoint multiple times. I am running this on Ubuntu WSL2 in Windows 11.

Current behaviour

At the end of a training run, it will not save the last checkpoint.

{'loss': 0.3905, 'grad_norm': 0.240234375, 'learning_rate': 2.0071391760856373e-10, 'epoch': 2.0}
{'loss': 0.3887, 'grad_norm': 0.259765625, 'learning_rate': 0.0, 'epoch': 2.0}
{'train_runtime': 23410.1952, 'train_samples_per_second': 16.764, 'train_steps_per_second': 0.067, 'train_loss': 0.09891741436697231, 'epoch': 2.0}
100%|█████████████████████████████████████████████████████████████████████████████| 1578/1578 [6:30:10<00:00, 14.84s/it]
[2024-05-13 14:47:28,053] [INFO] [axolotl.train.log:61] [PID:123985] [RANK:0] Training Completed!!! Saving pre-trained model to ./qlora-out
/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/peft/utils/save_and_load.py:154: UserWarning: Could not find a config file in /home/owen/models/Meta-Llama-3-8B-Instruct - will assume that the vocabulary was not modified.

Nothing wrong seems to happen as shown.

Steps to reproduce

Just run any training run both SFT or DPO both I've tried failed to save the last checkpoint. Not sure if there is something wrong in my config yaml for the train or a bug on Axolotl.

I've tried enabling and also disabling wandb since that caused this issue sometime a few months ago as well. This time it made no difference.

Config yaml

base_model: /home/owen/models/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
  
train_on_inputs: false
group_by_length: false
load_in_8bit: false
load_in_4bit: true
strict: false
sequence_len: 2048
bf16: true
fp16: false
tf32: false
flash_attention: true

# Data
datasets:
  - path: /home/owen/datasets/no-robots-sharegpt.jsonl
    type: sharegpt
    conversation: llama-3
  - path: /home/owen/datasets/fixed-dolphin201-sharegpt2.jsonl
    type: sharegpt
    conversation: llama-3
  - path: /home/owen/datasets/cleaned-WizardLM_alpaca_evol_instruct_70k.jsonl
    type: sharegpt
    conversation: llama-3
 
warmup_steps: 10
dataset_prepared_path: ./last_run_prepared

# Iterations
num_epochs: 2
saves_per_epoch: 2

# Evaluation
val_set_size: 0.01
eval_table_size:
eval_table_max_new_tokens:
eval_sample_packing: false
evals_per_epoch: 4

# LoRA
output_dir: ./qlora-out
adapter: qlora
lora_model_dir:
lora_r: 64
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
save_safetensors: true

# Sampling
sample_packing: true
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 32
micro_batch_size: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true
  
# wandb
wandb_mode: disabled

# Optimizer
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

# Misc
early_stopping_patience:
auto_resume_from_checkpoints: true
logging_steps: 1
debug:
deepspeed:
weight_decay: 0.1
special_tokens:
  eos_token: "<|eot_id|>"
  pad_token: "<|end_of_text|>"

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.11.9

axolotl branch-commit

2147cf6

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@Nero10578 Nero10578 added the bug Something isn't working label May 13, 2024
@winglian
Copy link
Collaborator

hi @Nero10578 , can you verify for me what checkpoint step numbers it did save on as well as the total number of steps in the training? thanks!

@Nero10578
Copy link
Contributor Author

hi @Nero10578 , can you verify for me what checkpoint step numbers it did save on as well as the total number of steps in the training? thanks!

It saved on these checkpoints:
Screenshot 2024-05-13 231731

There should be 1578 number of steps as can be seen here:

{'train_runtime': 23470.402, 'train_samples_per_second': 16.721, 'train_steps_per_second': 0.067, 'train_loss': 0.09891688005930269, 'epoch': 2.0}
100%|█████████████████████████████████████████████████████████████████████████████| 1578/1578 [6:31:06<00:00, 14.87s/it]
[2024-05-13 22:54:29,715] [INFO] [axolotl.train.log:61] [PID:802] [RANK:0] Training Completed!!! Saving pre-trained model to ./qlora-out
wandb:
wandb: Run history:
wandb:               eval/loss ▁█
wandb:            eval/runtime ▁█
wandb: eval/samples_per_second █▁
wandb:   eval/steps_per_second █▁
wandb:             train/epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
wandb:       train/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:         train/grad_norm ▁▅▃█▃▃▄▂▂▃▃▂▃▆▄▃▂▅▃▃▃▂▄█▅▄▄▃▁▂▄▂▄▃▃▃▃▄▂▂
wandb:     train/learning_rate ██▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
wandb:              train/loss ▇▃▆▅▄▄▅▄▅▄▂▅▄▆▃▅▄▅▅▅▅▃▄▄█▃▄▃▃▃▃▅▁▄▄▆▇▄▆▄
wandb:
wandb: Run summary:
wandb:                eval/loss 0.56355
wandb:             eval/runtime 563.5022
wandb:  eval/samples_per_second 3.519
wandb:    eval/steps_per_second 1.76
wandb:               total_flos 1.01735945815311e+19
wandb:              train/epoch 2.0
wandb:        train/global_step 1578
wandb:          train/grad_norm 0.26172
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.3887
wandb:               train_loss 0.09892
wandb:            train_runtime 23470.402
wandb: train_samples_per_second 16.721
wandb:   train_steps_per_second 0.067

I've tried running it again to see if its a fluke and no its still failing to save at the end. I've tried with a test super short dataset and it saves fine otherwise. Is there something wrong with my set number of steps? It say this when resuming:

[2024-05-13 23:20:40,426] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:100043] [RANK:0] packing_efficiency_estimate: 0.93 total_num_tokens per device: 97218015
[2024-05-13 23:20:40,583] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:100043] [RANK:0] packing_efficiency_estimate: 0.93 total_num_tokens per device: 97218015
Warning: The training argument 'eval_steps' value (0.125) does not match the trainer state 'eval_steps' value (198). This argument will be overridden by the one found in trainer_state.json within the checkpoint directory.
Warning: The training argument 'save_steps' value (0.25) does not match the trainer state 'save_steps' value (395). This argument will be overridden by the one found in trainer_state.json within the checkpoint directory.

@winglian
Copy link
Collaborator

it looks like the reason it didn't save the last step is that it is saving every 395 steps, so that means the next step it would save at is 1580, but your last step is 1548. Let me see if there is a good way to workaround that.

@winglian
Copy link
Collaborator

@Nero10578 Fixed in #1615

@Nero10578
Copy link
Contributor Author

it looks like the reason it didn't save the last step is that it is saving every 395 steps, so that means the next step it would save at is 1580, but your last step is 1548. Let me see if there is a good way to workaround that.

Awesome fix! Thank you for all your work on this! So essentially this was just a problem with odd saving steps? Explains why it only happens sometimes.

@winglian
Copy link
Collaborator

winglian commented May 14, 2024

yeah, what's happening is you have 1578 steps, and it saves 4 times, so 1578 * 0.25 = 394.5. Since it uses ceiling, it saves every 395 steps, and misses the last step. It might be worth raising an upstream issue with HF transformers for this to math.floor instead.

Screenshot 2024-05-14 at 9 35 24 AM

@Nero10578
Copy link
Contributor Author

yeah, what's happening is you have 1578 steps, and it saves 4 times, so 1578 * 0.25 = 394.5. Since it uses ceiling, it saves every 395 steps, and misses the last step. It might be worth raising an upstream issue with HF transformers for this to math.floor instead.

Screenshot 2024-05-14 at 9 35 24 AM

Ah I see okay. Thanks for explaining that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants