Fail to save last checkpoint #1613

Nero10578 · 2024-05-13T21:54:37Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Expected behavior is to save the last checkpoint like the previous intermediate checkpoints. It has no failed to save the final checkpoint multiple times. I am running this on Ubuntu WSL2 in Windows 11.

Current behaviour

At the end of a training run, it will not save the last checkpoint.

{'loss': 0.3905, 'grad_norm': 0.240234375, 'learning_rate': 2.0071391760856373e-10, 'epoch': 2.0}
{'loss': 0.3887, 'grad_norm': 0.259765625, 'learning_rate': 0.0, 'epoch': 2.0}
{'train_runtime': 23410.1952, 'train_samples_per_second': 16.764, 'train_steps_per_second': 0.067, 'train_loss': 0.09891741436697231, 'epoch': 2.0}
100%|█████████████████████████████████████████████████████████████████████████████| 1578/1578 [6:30:10<00:00, 14.84s/it]
[2024-05-13 14:47:28,053] [INFO] [axolotl.train.log:61] [PID:123985] [RANK:0] Training Completed!!! Saving pre-trained model to ./qlora-out
/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/peft/utils/save_and_load.py:154: UserWarning: Could not find a config file in /home/owen/models/Meta-Llama-3-8B-Instruct - will assume that the vocabulary was not modified.

Nothing wrong seems to happen as shown.

Steps to reproduce

Just run any training run both SFT or DPO both I've tried failed to save the last checkpoint. Not sure if there is something wrong in my config yaml for the train or a bug on Axolotl.

I've tried enabling and also disabling wandb since that caused this issue sometime a few months ago as well. This time it made no difference.

Config yaml

base_model: /home/owen/models/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
  
train_on_inputs: false
group_by_length: false
load_in_8bit: false
load_in_4bit: true
strict: false
sequence_len: 2048
bf16: true
fp16: false
tf32: false
flash_attention: true

# Data
datasets:
  - path: /home/owen/datasets/no-robots-sharegpt.jsonl
    type: sharegpt
    conversation: llama-3
  - path: /home/owen/datasets/fixed-dolphin201-sharegpt2.jsonl
    type: sharegpt
    conversation: llama-3
  - path: /home/owen/datasets/cleaned-WizardLM_alpaca_evol_instruct_70k.jsonl
    type: sharegpt
    conversation: llama-3
 
warmup_steps: 10
dataset_prepared_path: ./last_run_prepared

# Iterations
num_epochs: 2
saves_per_epoch: 2

# Evaluation
val_set_size: 0.01
eval_table_size:
eval_table_max_new_tokens:
eval_sample_packing: false
evals_per_epoch: 4

# LoRA
output_dir: ./qlora-out
adapter: qlora
lora_model_dir:
lora_r: 64
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
save_safetensors: true

# Sampling
sample_packing: true
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 32
micro_batch_size: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true
  
# wandb
wandb_mode: disabled

# Optimizer
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

# Misc
early_stopping_patience:
auto_resume_from_checkpoints: true
logging_steps: 1
debug:
deepspeed:
weight_decay: 0.1
special_tokens:
  eos_token: "<|eot_id|>"
  pad_token: "<|end_of_text|>"

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.11.9

axolotl branch-commit

2147cf6

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

winglian · 2024-05-14T00:21:31Z

hi @Nero10578 , can you verify for me what checkpoint step numbers it did save on as well as the total number of steps in the training? thanks!

Nero10578 · 2024-05-14T06:22:19Z

hi @Nero10578 , can you verify for me what checkpoint step numbers it did save on as well as the total number of steps in the training? thanks!

It saved on these checkpoints:

There should be 1578 number of steps as can be seen here:

{'train_runtime': 23470.402, 'train_samples_per_second': 16.721, 'train_steps_per_second': 0.067, 'train_loss': 0.09891688005930269, 'epoch': 2.0}
100%|█████████████████████████████████████████████████████████████████████████████| 1578/1578 [6:31:06<00:00, 14.87s/it]
[2024-05-13 22:54:29,715] [INFO] [axolotl.train.log:61] [PID:802] [RANK:0] Training Completed!!! Saving pre-trained model to ./qlora-out

wandb:
wandb: Run history:
wandb:               eval/loss ▁█
wandb:            eval/runtime ▁█
wandb: eval/samples_per_second █▁
wandb:   eval/steps_per_second █▁
wandb:             train/epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
wandb:       train/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:         train/grad_norm ▁▅▃█▃▃▄▂▂▃▃▂▃▆▄▃▂▅▃▃▃▂▄█▅▄▄▃▁▂▄▂▄▃▃▃▃▄▂▂
wandb:     train/learning_rate ██▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
wandb:              train/loss ▇▃▆▅▄▄▅▄▅▄▂▅▄▆▃▅▄▅▅▅▅▃▄▄█▃▄▃▃▃▃▅▁▄▄▆▇▄▆▄
wandb:
wandb: Run summary:
wandb:                eval/loss 0.56355
wandb:             eval/runtime 563.5022
wandb:  eval/samples_per_second 3.519
wandb:    eval/steps_per_second 1.76
wandb:               total_flos 1.01735945815311e+19
wandb:              train/epoch 2.0
wandb:        train/global_step 1578
wandb:          train/grad_norm 0.26172
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.3887
wandb:               train_loss 0.09892
wandb:            train_runtime 23470.402
wandb: train_samples_per_second 16.721
wandb:   train_steps_per_second 0.067

I've tried running it again to see if its a fluke and no its still failing to save at the end. I've tried with a test super short dataset and it saves fine otherwise. Is there something wrong with my set number of steps? It say this when resuming:

[2024-05-13 23:20:40,426] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:100043] [RANK:0] packing_efficiency_estimate: 0.93 total_num_tokens per device: 97218015
[2024-05-13 23:20:40,583] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:100043] [RANK:0] packing_efficiency_estimate: 0.93 total_num_tokens per device: 97218015
Warning: The training argument 'eval_steps' value (0.125) does not match the trainer state 'eval_steps' value (198). This argument will be overridden by the one found in trainer_state.json within the checkpoint directory.
Warning: The training argument 'save_steps' value (0.25) does not match the trainer state 'save_steps' value (395). This argument will be overridden by the one found in trainer_state.json within the checkpoint directory.

winglian · 2024-05-14T11:38:11Z

it looks like the reason it didn't save the last step is that it is saving every 395 steps, so that means the next step it would save at is 1580, but your last step is 1548. Let me see if there is a good way to workaround that.

winglian · 2024-05-14T11:47:13Z

@Nero10578 Fixed in #1615

Nero10578 · 2024-05-14T13:27:17Z

it looks like the reason it didn't save the last step is that it is saving every 395 steps, so that means the next step it would save at is 1580, but your last step is 1548. Let me see if there is a good way to workaround that.

Awesome fix! Thank you for all your work on this! So essentially this was just a problem with odd saving steps? Explains why it only happens sometimes.

winglian · 2024-05-14T13:36:59Z

yeah, what's happening is you have 1578 steps, and it saves 4 times, so 1578 * 0.25 = 394.5. Since it uses ceiling, it saves every 395 steps, and misses the last step. It might be worth raising an upstream issue with HF transformers for this to math.floor instead.

Nero10578 · 2024-05-14T15:45:35Z

yeah, what's happening is you have 1578 steps, and it saves 4 times, so 1578 * 0.25 = 394.5. Since it uses ceiling, it saves every 395 steps, and misses the last step. It might be worth raising an upstream issue with HF transformers for this to math.floor instead.

Ah I see okay. Thanks for explaining that.

Nero10578 added the bug Something isn't working label May 13, 2024

winglian mentioned this issue May 14, 2024

make sure to save on the last step #1615

Merged

winglian closed this as completed in #1615 May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to save last checkpoint #1613

Fail to save last checkpoint #1613

Nero10578 commented May 13, 2024

winglian commented May 14, 2024

Nero10578 commented May 14, 2024

winglian commented May 14, 2024

winglian commented May 14, 2024

Nero10578 commented May 14, 2024

winglian commented May 14, 2024 •

edited

Loading

Nero10578 commented May 14, 2024

Fail to save last checkpoint #1613

Fail to save last checkpoint #1613

Comments

Nero10578 commented May 13, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

winglian commented May 14, 2024

Nero10578 commented May 14, 2024

winglian commented May 14, 2024

winglian commented May 14, 2024

Nero10578 commented May 14, 2024

winglian commented May 14, 2024 • edited Loading

Nero10578 commented May 14, 2024

winglian commented May 14, 2024 •

edited

Loading