Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train-text-from-scratch and finetune nan loss on iter=2 #3940

Closed
abb128 opened this issue Nov 4, 2023 · 2 comments · Fixed by #3974
Closed

train-text-from-scratch and finetune nan loss on iter=2 #3940

abb128 opened this issue Nov 4, 2023 · 2 comments · Fixed by #3974
Labels
bug Something isn't working

Comments

@abb128
Copy link

abb128 commented Nov 4, 2023

I was trying out the finetune example with my model but it kept going into nan loss. I eventually tried train-text-from-scratch, following the instructions on the README there and it goes into nan as well. I've reproduced this on two machines.

root@c5a10438d69e:/workspace/llama.cpp# ./train-text-from-scratch         --vocab-model ./models/ggml-vocab-llama.gguf         --ctx 64 --embd 256 --head 8 --layer 16         --checkpoint-in  chk-shakespeare-256x16-LATEST.gguf         --checkpoint-out chk-shakespeare-256x16-ITERATION.gguf         --model-out ggml-shakespeare-256x16-f32-ITERATION.gguf         --train-data "shakespeare.txt"         -t 6 -b 16 --seed 1 --adam-iter 256         --no-checkpointing
main: seed: 1
llama_model_loader: loaded meta data with 17 key-value pairs and 0 tensors from ./models/ggml-vocab-llama.gguf (version GGUF V3 (latest))
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                       tokenizer.ggml.model str     
llama_model_loader: - kv  11:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  12:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  14:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32     
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 0.00 B
llm_load_print_meta: model size       = 0.00 MiB (-nan BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llama_model_load: vocab only - skipping tensors
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
main: init model
print_params: n_vocab: 32000
print_params: n_ctx:   64
print_params: n_embd:  256
print_params: n_head:  8
print_params: n_ff:    768
print_params: n_layer: 16
print_params: n_rot:   32
main: total train_iterations 0
main: seen train_samples     0
main: seen train_tokens      0
main: completed train_epochs 0
main: model_size = 240304416 bytes (229.2 MB)
main: opt_size  = 360288432 bytes (343.6 MB)
main: opt iter 0
main: input_size = 131076128 bytes (125.0 MB)
main: compute_size = 701759840 bytes (669.3 MB)
main: evaluation order = LEFT_TO_RIGHT
main: tokenize training data
tokenize_file: total number of samples: 27520
main: number of training tokens: 27584
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 768376 bytes (0.7 MB)
train_opt_callback: iter=     0 sample=1/27520 sched=0.000000 loss=0.000000 |->
train_opt_callback: iter=     1 sample=17/27520 sched=0.010000 loss=10.373524 dt=00:00:03 eta=00:15:01 |->
train_opt_callback: iter=     2 sample=33/27520 sched=0.020000 loss=nan dt=00:00:03 eta=00:14:19 |>
train_opt_callback: iter=     3 sample=49/27520 sched=0.030000 loss=nan dt=00:00:03 eta=00:15:01 |>
^C
root@c5a10438d69e:/workspace/llama.cpp# ^C
root@c5a10438d69e:/workspace/llama.cpp# git log | head -1
commit d9b33fe95bd257b36c84ee5769cc048230067d6f
root@c5a10438d69e:/workspace/llama.cpp# lscpu | egrep "AMD|Flags"
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC Processor
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat npt nrip_save
Virtualization:                  AMD-V
root@c5a10438d69e:/workspace/llama.cpp# uname -a
Linux c5a10438d69e 5.4.0-139-generic #156-Ubuntu SMP Fri Jan 20 17:27:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
root@c5a10438d69e:/workspace/llama.cpp# g++ --version
g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@c5a10438d69e:/workspace/llama.cpp# make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
root@c5a10438d69e:/workspace/llama.cpp# 

I've bisected this and 898aeca is the first bad commit. Reverting to the previous commit, c43c2da, train-text-from-scratch and finetune appear to work fine (they don't go into nan)

@randaller
Copy link

Same here, finetune loss goes to nan on iter=2, windows build, with and without cublas, avx2.

@maxxk
Copy link

maxxk commented Nov 4, 2023

I have the same behavior on Haswell (AVX2) followed by segmentation fault.

I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   unknown
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -Ofast -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native -mavx -mavx2 -mssse3 -mfma -g
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mavx -mavx2 -mssse3 -mfma -g -Ofast -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native
I NVCCFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mavx -mavx2 -mssse3 -mfma -g -O3  -Wno-pedantic -Xcompiler "-Ofast -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native "
I LDFLAGS:
I CC:        gcc (GCC) 12.2.0
I CXX:       g++ (GCC) 12.2.0

Default build without AVX2 works fine (but 4 times slower). I was wrong, it crashes too.
M1 build works fine.

GDB on segmentation fault (don't really know where to look):

main: work_size = 512240 bytes (0.5 MB)
train_opt_callback: iter=     0 sample=1/446045 sched=0.000000 loss=0.000000 |->
[New Thread 0x7fff7257c6c0 (LWP 774976)]
[New Thread 0x7fff71d7b6c0 (LWP 774977)]
[New Thread 0x7fff7157a6c0 (LWP 774978)]
[Thread 0x7fff7157a6c0 (LWP 774978) exited]
[Thread 0x7fff71d7b6c0 (LWP 774977) exited]
[Thread 0x7fff7257c6c0 (LWP 774976) exited]
train_opt_callback: iter=     1 sample=17/446045 sched=0.010000 loss=10.373476 dt=00:00:05 eta=16:02:24 |->
[New Thread 0x7fff7157a6c0 (LWP 774981)]
[New Thread 0x7fff71d7b6c0 (LWP 774982)]
[New Thread 0x7fff7257c6c0 (LWP 774983)]
[Thread 0x7fff7257c6c0 (LWP 774983) exited]
[Thread 0x7fff71d7b6c0 (LWP 774982) exited]
[Thread 0x7fff7157a6c0 (LWP 774981) exited]
main: total training time: 00:00:11
save_checkpoint_file: saving to checkpoint-1.gguf
save_checkpoint_file: saving to checkpoint-LATEST.gguf
save_llama_model_file: saving to ggml-checkpoint-f32.bin
save_llama_model_file: saving to ggml-checkpoint-f32.bin

Thread 1 "train-text-from" received signal SIGSEGV, Segmentation fault.
ggml_backend_buffer_free (buffer=0x603662) at ggml-backend.c:37
37              buffer->iface.free_buffer(buffer);
(gdb) p buffer
$1 = (ggml_backend_buffer_t) 0x603662
(gdb) p buffer->iface
$2 = {free_buffer = 0x50000000000000, get_base = 0x3f23000000000000, get_alloc_size = 0x9888000000000060, init_tensor = 0x398000007fffffff, free_tensor = 0x36c0000000000060}
(gdb) p buffer->iface.free_buffer
$3 = (void (*)(ggml_backend_buffer_t)) 0x50000000000000
(gdb) x/i 0x50000000000000
   0x50000000000000:    Cannot access memory at address 0x50000000000000

Address sanitizers report 2 errors:

main: input_size = 262152224 bytes (250.0 MB)
ggml-alloc.c:212:32: runtime error: pointer index expression with base 0x000017001240 overflowed to 0xffffffffffffffff
main: compute_size = 1370751584 bytes (1307.3 MB)
main: evaluation order = LEFT_TO_RIGHT
main: tokenize training data
tokenize_file: total number of samples: 446045
main: number of training tokens: 446173
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 512240 bytes (0.5 MB)
train_opt_callback: iter=     0 sample=1/446045 sched=0.000000 loss=0.000000 |>
train_opt_callback: iter=     1 sample=17/446045 sched=0.010000 loss=10.373759 dt=00:00:49 eta=5d 16:25:50 |->
main: total training time: 00:01:35
save_checkpoint_file: saving to checkpoint-1.gguf
save_checkpoint_file: saving to checkpoint-LATEST.gguf
save_llama_model_file: saving to ggml-checkpoint-f32.bin
save_llama_model_file: saving to ggml-checkpoint-f32.bin
AddressSanitizer:DEADLYSIGNAL
=================================================================
==781429==ERROR: AddressSanitizer: SEGV on unknown address 0x7f06e145d808 (pc 0x000000d31a29 bp 0x7ffebac51620 sp 0x7ffebac51610 T0)
==781429==The signal is caused by a READ memory access.
    #0 0xd31a29 in ggml_allocr_free /TRASH/llama/llama.cpp/ggml-alloc.c:330
    #1 0x45f339 in main examples/train-text-from-scratch/train-text-from-scratch.cpp:1300
    #2 0x7f06e2e3dacd in __libc_start_call_main (/nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/libc.so.6+0x23acd)
    #3 0x7f06e2e3db88 in __libc_start_main_alias_2 (/nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/libc.so.6+0x23b88)
    #4 0x46bb94 in _start (/TRASH/llama/llama.cpp/train-text-from-scratch+0x46bb94)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /TRASH/llama/llama.cpp/ggml-alloc.c:330 in ggml_allocr_free
==781429==ABORTING

@ggerganov ggerganov added bug Something isn't working and removed bug-unconfirmed labels Nov 5, 2023
@cebtenzzre cebtenzzre linked a pull request Nov 6, 2023 that will close this issue
@cebtenzzre cebtenzzre removed their assignment Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants