Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

自动被killed且无任何报错信息 #6687

Open
1 task done
duyu09 opened this issue Jan 17, 2025 · 1 comment
Open
1 task done

自动被killed且无任何报错信息 #6687

duyu09 opened this issue Jan 17, 2025 · 1 comment
Assignees
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@duyu09
Copy link

duyu09 commented Jan 17, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

llamafactory-cli env

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • PyTorch version: 2.5.0a0+b465a5843b.nv24.09 (GPU)
  • Transformers version: 4.45.2
  • Datasets version: 3.1.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 2080 Ti

Reproduction

  • 执行的命令是:
FORCE_TORCHRUN=0 CUDA_VISIBLE_DEVICES=1 llamafactory-cli train ../qwen_pretrain.yaml
  • qwen_pretrain.yaml中的配置为:
### model
model_name_or_path: /home/s-duy20/qwen
trust_remote_code: true

### method
stage: pt
do_train: true
finetuning_type: lora
lora_target: all

### dataset
dataset: pretrain
cutoff_len: 500
max_samples: 127
overwrite_cache: true
# preprocessing_num_workers: 1

### output
output_dir: /home/s-duy20/saves/qwen/lora/pretrain
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
# ddp_timeout: 180000000
lora_rank: 7

### eval
val_size: 0.05
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
  • 输出的日志的后半部分为:
[INFO|tokenization_utils_base.py:2204] 2025-01-17 11:01:20,810 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2204] 2025-01-17 11:01:20,810 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2204] 2025-01-17 11:01:20,810 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2204] 2025-01-17 11:01:20,810 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2204] 2025-01-17 11:01:20,810 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2204] 2025-01-17 11:01:20,810 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2470] 2025-01-17 11:01:21,138 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2025-01-17 11:01:21] llamafactory.data.loader:157 >> Loading dataset pretrain.json...
Generating train split: 0 examples [00:00, ? examples/s]Killed
  • 机器配置:
    RAM内存空间剩余有100GB左右(足够大),显存约12GB,磁盘空间足够大。

从日志中可以看到无任何报错信息,直接被Killed了,请问这是怎么回事?怎么解决?

Others

No response

@duyu09 duyu09 added bug Something isn't working pending This problem is yet to be addressed labels Jan 17, 2025
@hiyouga hiyouga self-assigned this Jan 17, 2025
@tristanwqy
Copy link

观察下内存吧,只能是内存问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

3 participants