微调Qwen2-1.5B-Instruct，loss始终是0 #279

frederichen01 · 2024-07-03T06:03:38Z

如题，请问怎么解决呢

frederichen01 · 2024-07-03T06:04:03Z

lora，lora_rank: 32

MikuAndRabbit · 2024-07-04T04:21:55Z

可以尝试在训练参数 JSON 文件中指定 "bf16": true，使用 bfloat16 进行训练。下面是一个训练参数文件的示例：

{
    "output_dir": "",
    "model_name_or_path": "",
    "train_file": "",
    "template_name": "qwen",
    "num_train_epochs": 2,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 16,
    "learning_rate": 2e-4,
    "max_seq_length": 1024,
    "logging_steps": 20,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 100,
    "lora_rank": 32,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "use_unsloth": false,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "paged_adamw_32bit",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 0.3,
    "remove_unused_columns": false
}

frederichen01 · 2024-07-08T05:56:01Z

可以尝试在训练参数 JSON 文件中指定 "bf16": true，使用 bfloat16 进行训练。下面是一个训练参数文件的示例：

{
    "output_dir": "",
    "model_name_or_path": "",
    "train_file": "",
    "template_name": "qwen",
    "num_train_epochs": 2,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 16,
    "learning_rate": 2e-4,
    "max_seq_length": 1024,
    "logging_steps": 20,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 100,
    "lora_rank": 32,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "use_unsloth": false,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "paged_adamw_32bit",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 0.3,
    "remove_unused_columns": false
}

改了确实能够训练了，但是推理的时候，出现RuntimeError: probability tensor contains either inf, nan or element < 0

MikuAndRabbit · 2024-07-08T06:02:09Z

可以尝试在训练参数 JSON 文件中指定 "bf16": true，使用 bfloat16 进行训练。下面是一个训练参数文件的示例：

{
    "output_dir": "",
    "model_name_or_path": "",
    "train_file": "",
    "template_name": "qwen",
    "num_train_epochs": 2,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 16,
    "learning_rate": 2e-4,
    "max_seq_length": 1024,
    "logging_steps": 20,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 100,
    "lora_rank": 32,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "use_unsloth": false,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "paged_adamw_32bit",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 0.3,
    "remove_unused_columns": false
}

改了确实能够训练了，但是推理的时候，出现RuntimeError: probability tensor contains either inf, nan or element < 0

根据 #272 中作者给出的描述，请将 ./component/utils.py 文件中的 torch_dtype = torch.float16 修改为 torch_dtype = torch.float32 或者 torch_dtype = torch.bfloat16。

具体修改位置详见：component/utils.py

frederichen01 · 2024-07-08T06:04:20Z

bfloat16

试过改成float32和bfloat16，还是不行

frederichen01 · 2024-07-08T06:04:37Z

可以尝试在训练参数 JSON 文件中指定 "bf16": true，使用 bfloat16 进行训练。下面是一个训练参数文件的示例：

{
    "output_dir": "",
    "model_name_or_path": "",
    "train_file": "",
    "template_name": "qwen",
    "num_train_epochs": 2,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 16,
    "learning_rate": 2e-4,
    "max_seq_length": 1024,
    "logging_steps": 20,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 100,
    "lora_rank": 32,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "use_unsloth": false,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "paged_adamw_32bit",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 0.3,
    "remove_unused_columns": false
}

改了确实能够训练了，但是推理的时候，出现RuntimeError: probability tensor contains either inf, nan or element < 0

试过改成float32和bfloat16，还是不行

Liuchunyangboy · 2024-07-18T09:44:27Z

qwen2貌似不支持fp16训练

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

微调Qwen2-1.5B-Instruct，loss始终是0 #279

微调Qwen2-1.5B-Instruct，loss始终是0 #279

frederichen01 commented Jul 3, 2024

frederichen01 commented Jul 3, 2024

MikuAndRabbit commented Jul 4, 2024

frederichen01 commented Jul 8, 2024

MikuAndRabbit commented Jul 8, 2024

frederichen01 commented Jul 8, 2024

frederichen01 commented Jul 8, 2024

Liuchunyangboy commented Jul 18, 2024

微调Qwen2-1.5B-Instruct，loss始终是0 #279

微调Qwen2-1.5B-Instruct，loss始终是0 #279

Comments

frederichen01 commented Jul 3, 2024

frederichen01 commented Jul 3, 2024

MikuAndRabbit commented Jul 4, 2024

frederichen01 commented Jul 8, 2024

MikuAndRabbit commented Jul 8, 2024

frederichen01 commented Jul 8, 2024

frederichen01 commented Jul 8, 2024

Liuchunyangboy commented Jul 18, 2024