"triu_tril_cuda_template" not implemented for 'BFloat16' #1532

ashmalvayani · 2024-04-17T20:20:31Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I am trying to fine-tune CohereForAI/c4ai-command-r-v01 with the axolotl framework. The yaml file is as follows:

I am getting the following error in this case:

Traceback (most recent call last):
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/beegfs/fahad.khan/axolotl/src/axolotl/cli/train.py", line 59, in
fire.Fire(do_cli)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/mnt/beegfs/fahad.khan/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
return do_train(parsed_cfg, parsed_cli_args)
File "/mnt/beegfs/fahad.khan/axolotl/src/axolotl/cli/train.py", line 55, in do_train
return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
File "/mnt/beegfs/fahad.khan/axolotl/src/axolotl/train.py", line 163, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 3036, in training_step
loss = self.compute_loss(model, inputs)
File "/mnt/beegfs/fahad.khan/axolotl/src/axolotl/core/trainer_builder.py", line 485, in compute_loss
return super().compute_loss(model, inputs, return_outputs=return_outputs)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 3059, in compute_loss
outputs = model(**inputs)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
return model_forward(*args, **kwargs)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/peft/peft_model.py", line 1129, in forward
return self.base_model(
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
return self.model.forward(*args, **kwargs)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/transformers/models/cohere/modeling_cohere.py", line 1099, in forward
outputs = self.model(
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/transformers/models/cohere/modeling_cohere.py", line 889, in forward
causal_mask = self._update_causal_mask(attention_mask, inputs_embeds, cache_position)
File "/home/ashmal.vayani/anaconda3/envs/axolotl/lib/python3.10/site-packages/transformers/models/cohere/modeling_cohere.py", line 975, in _update_causal_mask
causal_mask = torch.triu(causal_mask, diagonal=1)
RuntimeError: "triu_tril_cuda_template" not implemented for 'BFloat16'

Current behaviour

Transformer version: 4.39.3
Torch version: 2.0.1 cu11.7
accelerate version: 0.28.0

Steps to reproduce

Change the yaml file as below and run the .train command with lora.yaml config file as below.

Config yaml

base_model: CohereForAI/c4ai-command-r-v01
trust_remote_code: true

load_in_8bit: true
load_in_4bit: false
strict: false

datasets:
    - path: Data_Clean3.json
      ds_type: json
      type: alpaca
dataset_prepared_path: last_run_prepared/cohere-command/3308b18091e3a983103cbeb4cceb82d0
val_set_size: 0.0
output_dir: ./outputs/c4ai_lora

sequence_len: 2048
sample_packing: false
pad_to_sequence_len: false

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

adapter: lora
lora_model_dir:
sample_packing: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.0
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16: 
tf32: false

gradient_checkpointing: false  # don't use with fsdp_activation_checkpointing
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch:
saves_per_epoch: 1
debug:
weight_decay: 0.0
deepspeed: 

special_tokens:
  bos_token: "<BOS_TOKEN>"
  eos_token: "<|END_OF_TURN_TOKEN|>"
  pad_token: "<PAD>"



### Possible solution

_No response_

### Which Operating Systems are you using?

- [X] Linux
- [ ] macOS
- [ ] Windows

### Python Version

3.10

### axolotl branch-commit

main

### Acknowledgements

- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

NanoCode012 · 2024-04-18T14:22:15Z

Hm, I believe this issue should be asked upstream transformers as the issue is in their modeling code.

Here is a similar issue: huggingface/diffusers#3453

Would need to update to not use torch.triu due to pytorch not adding support for bf16.

NanoCode012 · 2024-04-18T17:41:24Z

Found a solution and it turns out to be your comment :) huggingface/transformers#30304 (comment)

Axolotl currently requires torch>2.1 I believe

Should this be closed then?

ashmalvayani · 2024-04-18T18:24:10Z

Axolotl works well with even torch<2.1 with 8_bit but causes problems with 4-bit. but I think for now it should be fine. Closing this issue. Thanks.

ashmalvayani added the bug Something isn't working label Apr 17, 2024

ashmalvayani closed this as completed Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"triu_tril_cuda_template" not implemented for 'BFloat16' #1532

"triu_tril_cuda_template" not implemented for 'BFloat16' #1532

ashmalvayani commented Apr 17, 2024

NanoCode012 commented Apr 18, 2024

NanoCode012 commented Apr 18, 2024 •

edited

Loading

ashmalvayani commented Apr 18, 2024

"triu_tril_cuda_template" not implemented for 'BFloat16' #1532

"triu_tril_cuda_template" not implemented for 'BFloat16' #1532

Comments

ashmalvayani commented Apr 17, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

NanoCode012 commented Apr 18, 2024

NanoCode012 commented Apr 18, 2024 • edited Loading

ashmalvayani commented Apr 18, 2024

NanoCode012 commented Apr 18, 2024 •

edited

Loading