Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORPO seems broken with micro_batch_size or eval_batch_size > 1 #1489

Closed
6 of 8 tasks
xzuyn opened this issue Apr 7, 2024 · 1 comment · Fixed by #1551
Closed
6 of 8 tasks

ORPO seems broken with micro_batch_size or eval_batch_size > 1 #1489

xzuyn opened this issue Apr 7, 2024 · 1 comment · Fixed by #1551
Labels
bug Something isn't working

Comments

@xzuyn
Copy link
Contributor

xzuyn commented Apr 7, 2024

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

It should run without an error, as it does when you have micro_batch_size and eval_batch_size set to 1.

Current behaviour

Returns two errors;

ValueError: expected sequence of length 406 at dim 1 (got 75)

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (rejected_input_ids in this case) have excessive nesting (inputs type list where type int is expected).

Traceback (most recent call last):
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 759, in convert_to_tensors
    tensor = as_tensor(value)
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 721, in as_tensor
    return torch.tensor(value)
ValueError: expected sequence of length 406 at dim 1 (got 75)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/media/xzuyn/NVMe/LClones/axolotl/src/axolotl/cli/train.py", line 59, in <module>
    fire.Fire(do_cli)
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/media/xzuyn/NVMe/LClones/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
    return do_train(parsed_cfg, parsed_cli_args)
  File "/media/xzuyn/NVMe/LClones/axolotl/src/axolotl/cli/train.py", line 55, in do_train
    return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
  File "/media/xzuyn/NVMe/LClones/axolotl/src/axolotl/train.py", line 160, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2085, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 451, in __iter__
    current_batch = next(dataloader_iter)
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/media/xzuyn/NVMe/LClones/axolotl/src/axolotl/monkeypatch/data/batch_dataset_fetcher.py", line 32, in fetch
    return self.collate_fn(data)
  File "/media/xzuyn/NVMe/LClones/axolotl/src/axolotl/utils/collators.py", line 106, in __call__
    features = self.tokenizer.pad(
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3369, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 224, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
  File "/media/xzuyn/NVMe/LClones/axolotl/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 775, in convert_to_tensors
    raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`rejected_input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Steps to reproduce

Run the YAML provided, which has a micro_batch_size and eval_batch_size of 2.

I tested:

micro_batch_size: 1 & eval_batch_size: 1 - Works
micro_batch_size: 2 & eval_batch_size: 2 - Errors
micro_batch_size: 2 & eval_batch_size: 1 - Errors
micro_batch_size: 1 & eval_batch_size: 2 - Errors

Config yaml

wandb_project: MV02-7B
wandb_entity:
wandb_watch:
wandb_name: ORPO-QLoRA-run_1-Test-1
wandb_log_model:

output_dir: ./MV02-Test-1-run_1-ORPO-7B-QLoRA
resume_from_checkpoint:
save_steps: 10
saves_per_epoch:
save_safetensors: true
save_total_limit: 5
hub_model_id:
hub_strategy:

base_model: alpindale/Mistral-7B-v0.2-hf
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
is_llama_derived_model: false
is_mistral_derived_model: true
is_falcon_derived_model: false
is_qwen_derived_model: false

bf16: true
fp16: false
tf32: false

load_in_8bit: false
load_in_4bit: true
strict: false

sequence_len: 4096
s2_attention: false
sample_packing: false
pad_to_sequence_len: false
train_on_inputs: false
group_by_length: false

adapter: qlora
lora_model_dir:
lora_r: 64
lora_alpha: 64
lora_dropout: 0.125
lora_fan_in_fan_out:
lora_target_linear:
save_embedding_layers:
peft_layers_to_transform:
peft_use_dora:
peft_use_rslora: true
peft_layer_replication:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj
lora_modules_to_save:

unfrozen_parameters:

rl: orpo
orpo_alpha: 0.1
remove_unused_columns: false
chat_template: chatml
datasets:
  - path: argilla/ultrafeedback-binarized-preferences-cleaned
    type: orpo.chat_template
val_set_size: 0.01
eval_sample_packing: false
evaluation_strategy: steps
eval_steps: 10
evals_per_epoch:
test_datasets:
dataset_prepared_path: ./Test-1-seed42
push_dataset_to_hub:
hf_use_auth_token:
shuffle_merged_datasets: true

num_epochs: 1
gradient_accumulation_steps: 8
micro_batch_size: 2
eval_batch_size: 2
warmup_steps: 0
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.00001
loraplus_lr_ratio: 8
loraplus_lr_embedding:
cosine_min_lr_ratio:
weight_decay: 0.01
max_grad_norm: 1.0
logging_steps: 1

gradient_checkpointing: true
early_stopping_patience: false
local_rank:
xformers_attention: false
flash_attention: false
sdp_attention: true

loss_watchdog_threshold: 100.0
loss_watchdog_patience: 3

debug: true
seed: 42
deepspeed:
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10.12

axolotl branch-commit

main/bda48f0

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@xzuyn xzuyn added the bug Something isn't working label Apr 7, 2024
@LeeWonc
Copy link

LeeWonc commented Apr 8, 2024

Same issue....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants