Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: linalg.vector_norm: Expected a floating point or complex tensor as input. Got Long #34573

Closed
4 tasks
qmin2 opened this issue Nov 2, 2024 · 4 comments

Comments

@qmin2
Copy link

qmin2 commented Nov 2, 2024

System Info

transformers == 4.45
torch == 2.4.1 + cu118
accelerate == 1.0.1

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

dataset = load_dataset("pg19")
dataloader = {
    split: DataLoader(dataset[split], batch_size=args.batch_size, shuffle=(split == 'train'),
                      pin_memory=True) for split in ['train', 'validation', 'test']}

accelerator = Accelerator()
device = accelerator.device
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # tokenizer.pad_token = tokenizer.eos_token e.g.
model = LlamaForCausalLM.from_pretrained(model_name, config=config, torch_dtype = torch.bfloat16).to(device)
model.resize_token_embeddings(len(tokenizer))
train_dataloader, eval_dataloader, model, optimizer, lr_scheduler = accelerator.prepare(
    dataloader["train"], dataloader["validation"], model, optimizer, lr_scheduler
    )

for epoch in range(1, args.num_epochs + 1):
        start_time = perf_counter()

        model.train()
        train_loss = 0

        for idx, batch in enumerate(tqdm(train_dataloader, disable=args.disable_tqdm)):
            inputs = tokenizer(batch['text'], padding="longest", truncation=True, max_length=2200, return_tensors='pt', return_token_type_ids=False).to(device)
 
            inputs['labels'] = inputs['input_ids'].clone()

            label_mask = inputs['attention_mask'].bool()
            inputs['labels'][~label_mask] = -100
            
            loss = model(**inputs).loss

            accelerator.backward(loss)

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

Expected behavior

I'm using PyTorch 2.4.1 +cu118 and transformers 4.45, training with a batch size of 2 with 2 nvidia A100-80GB. When padding appeared in a batch, the attention_mask in LlamaSdpaAttention was activated(i.e. not None at this step).

causal_mask = attention_mask
if attention_mask is not None:
    causal_mask = causal_mask[:, :, :, : key_states.shape[-2]]

After performing the torch.nn.functional.scaled_dot_product_attention operation, I encountered the following error at this line
accelerator.backward(loss)

RuntimeError: linalg.vector_norm: Expected a floating point or complex tensor as input. Got Long
For now, I’ve resolved this by skipping batches that include padding, but I would like to understand the root cause and potential solutions for this issue.

@qmin2 qmin2 added the bug label Nov 2, 2024
@Rocketknight1
Copy link
Member

cc @muellerzr @SunMarc !

@SunMarc
Copy link
Member

SunMarc commented Nov 4, 2024

Hey @qmin2, can you share your accelerate config ? I see in other posts the same issue as you are facing, maybe this is relevant.

@qmin2
Copy link
Author

qmin2 commented Nov 10, 2024

Sorry for the late reply.

this is my accelerate config

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: /home/qmin2/3rd_semester_research/mixed_tokens/ds_config.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

this is my deepspeed_config

{
    "bf16": {
        "enabled": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-5,
            "weight_decay": 1e-5,
            "torch_adam": true,
            "adam_w_mode": true
        }
    },
    "scheduler": {
        "type": "WarmupCosineLR",
        "params":{
            "total_num_steps" : 7500,
            "warmup_min_ratio" : 0.1
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": "auto",
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

And I encounter another similar issue.
I am using a custom 4D attention mask in LlamaForCausalLM and passing it as input. The model(Llama3.1) is configured with bfloat16. I am encountering an issue with scaled_dot_product_attention in the following line:

attn_output = torch.nn.functional.scaled_dot_product_attention(
    query_states,
    key_states,
    value_states,
    attn_mask=causal_mask,
    dropout_p=self.attention_dropout if self.training else 0.0,
    is_causal=is_causal,
)

The error message I get is a dtype mismatch between query_states and attention_bias. To resolve this, I converted my custom attention_mask to bfloat16 to match the llama3.1 model's dtype. After making this change, the previous error disappears, but a new issue arises during the backward pass with accelerator.backward(loss):

RuntimeError: linalg.vector_norm: Expected a floating point or complex tensor as input. Got Long

I suspect that this issue is related to the activation of the causal_mask in LlamaSdpaAttention. The same error occurs when padding is present in the input, and the causal mask is activated.

Copy link

github-actions bot commented Dec 4, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants