自动被killed且无任何报错信息 #6687

duyu09 · 2025-01-17T11:11:20Z

Reminder

I have read the above rules and searched the existing issues.

System Info

llamafactory-cli env

llamafactory version: 0.9.2.dev0
Platform: Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.35
Python version: 3.10.12
PyTorch version: 2.5.0a0+b465a5843b.nv24.09 (GPU)
Transformers version: 4.45.2
Datasets version: 3.1.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA GeForce RTX 2080 Ti

Reproduction

执行的命令是：

FORCE_TORCHRUN=0 CUDA_VISIBLE_DEVICES=1 llamafactory-cli train ../qwen_pretrain.yaml

qwen_pretrain.yaml中的配置为：

### model
model_name_or_path: /home/s-duy20/qwen
trust_remote_code: true

### method
stage: pt
do_train: true
finetuning_type: lora
lora_target: all

### dataset
dataset: pretrain
cutoff_len: 500
max_samples: 127
overwrite_cache: true
# preprocessing_num_workers: 1

### output
output_dir: /home/s-duy20/saves/qwen/lora/pretrain
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
# ddp_timeout: 180000000
lora_rank: 7

### eval
val_size: 0.05
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

输出的日志的后半部分为：

[INFO|tokenization_utils_base.py:2204] 2025-01-17 11:01:20,810 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2204] 2025-01-17 11:01:20,810 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2204] 2025-01-17 11:01:20,810 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2204] 2025-01-17 11:01:20,810 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2204] 2025-01-17 11:01:20,810 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2204] 2025-01-17 11:01:20,810 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2470] 2025-01-17 11:01:21,138 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2025-01-17 11:01:21] llamafactory.data.loader:157 >> Loading dataset pretrain.json...
Generating train split: 0 examples [00:00, ? examples/s]Killed

机器配置：
RAM内存空间剩余有100GB左右（足够大），显存约12GB，磁盘空间足够大。

从日志中可以看到无任何报错信息，直接被Killed了，请问这是怎么回事？怎么解决？

Others

No response

The text was updated successfully, but these errors were encountered:

tristanwqy · 2025-01-17T15:38:19Z

观察下内存吧，只能是内存问题

duyu09 added bug Something isn't working pending This problem is yet to be addressed labels Jan 17, 2025

hiyouga self-assigned this Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

自动被killed且无任何报错信息 #6687

自动被killed且无任何报错信息 #6687

duyu09 commented Jan 17, 2025 •

edited

Loading

tristanwqy commented Jan 17, 2025

自动被killed且无任何报错信息 #6687

自动被killed且无任何报错信息 #6687

Comments

duyu09 commented Jan 17, 2025 • edited Loading

Reminder

System Info

llamafactory-cli env

Reproduction

Others

tristanwqy commented Jan 17, 2025

duyu09 commented Jan 17, 2025 •

edited

Loading