Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

微调Qwen2-1.5B-Instruct,loss始终是0 #279

Open
frederichen01 opened this issue Jul 3, 2024 · 7 comments
Open

微调Qwen2-1.5B-Instruct,loss始终是0 #279

frederichen01 opened this issue Jul 3, 2024 · 7 comments

Comments

@frederichen01
Copy link

如题,请问怎么解决呢

@frederichen01
Copy link
Author

lora,lora_rank: 32

@MikuAndRabbit
Copy link

可以尝试在训练参数 JSON 文件中指定 "bf16": true,使用 bfloat16 进行训练。下面是一个训练参数文件的示例:

{
    "output_dir": "",
    "model_name_or_path": "",
    "train_file": "",
    "template_name": "qwen",
    "num_train_epochs": 2,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 16,
    "learning_rate": 2e-4,
    "max_seq_length": 1024,
    "logging_steps": 20,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 100,
    "lora_rank": 32,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "use_unsloth": false,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "paged_adamw_32bit",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 0.3,
    "remove_unused_columns": false
}

@frederichen01
Copy link
Author

可以尝试在训练参数 JSON 文件中指定 "bf16": true,使用 bfloat16 进行训练。下面是一个训练参数文件的示例:

{
    "output_dir": "",
    "model_name_or_path": "",
    "train_file": "",
    "template_name": "qwen",
    "num_train_epochs": 2,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 16,
    "learning_rate": 2e-4,
    "max_seq_length": 1024,
    "logging_steps": 20,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 100,
    "lora_rank": 32,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "use_unsloth": false,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "paged_adamw_32bit",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 0.3,
    "remove_unused_columns": false
}

改了确实能够训练了,但是推理的时候,出现RuntimeError: probability tensor contains either inf, nan or element < 0

@MikuAndRabbit
Copy link

可以尝试在训练参数 JSON 文件中指定 "bf16": true,使用 bfloat16 进行训练。下面是一个训练参数文件的示例:

{
    "output_dir": "",
    "model_name_or_path": "",
    "train_file": "",
    "template_name": "qwen",
    "num_train_epochs": 2,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 16,
    "learning_rate": 2e-4,
    "max_seq_length": 1024,
    "logging_steps": 20,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 100,
    "lora_rank": 32,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "use_unsloth": false,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "paged_adamw_32bit",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 0.3,
    "remove_unused_columns": false
}

改了确实能够训练了,但是推理的时候,出现RuntimeError: probability tensor contains either inf, nan or element < 0

根据 #272 中作者给出的描述,请将 ./component/utils.py 文件中的 torch_dtype = torch.float16 修改为 torch_dtype = torch.float32 或者 torch_dtype = torch.bfloat16

具体修改位置详见:component/utils.py

@frederichen01
Copy link
Author

bfloat16

试过改成float32和bfloat16,还是不行

@frederichen01
Copy link
Author

可以尝试在训练参数 JSON 文件中指定 "bf16": true,使用 bfloat16 进行训练。下面是一个训练参数文件的示例:

{
    "output_dir": "",
    "model_name_or_path": "",
    "train_file": "",
    "template_name": "qwen",
    "num_train_epochs": 2,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 16,
    "learning_rate": 2e-4,
    "max_seq_length": 1024,
    "logging_steps": 20,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 100,
    "lora_rank": 32,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "use_unsloth": false,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "paged_adamw_32bit",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 0.3,
    "remove_unused_columns": false
}

改了确实能够训练了,但是推理的时候,出现RuntimeError: probability tensor contains either inf, nan or element < 0

试过改成float32和bfloat16,还是不行

@Liuchunyangboy
Copy link

qwen2貌似不支持fp16训练

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants