train-text-from-scratch and finetune nan loss on iter=2 #3940

abb128 · 2023-11-04T04:42:06Z

I was trying out the finetune example with my model but it kept going into nan loss. I eventually tried train-text-from-scratch, following the instructions on the README there and it goes into nan as well. I've reproduced this on two machines.

root@c5a10438d69e:/workspace/llama.cpp# ./train-text-from-scratch         --vocab-model ./models/ggml-vocab-llama.gguf         --ctx 64 --embd 256 --head 8 --layer 16         --checkpoint-in  chk-shakespeare-256x16-LATEST.gguf         --checkpoint-out chk-shakespeare-256x16-ITERATION.gguf         --model-out ggml-shakespeare-256x16-f32-ITERATION.gguf         --train-data "shakespeare.txt"         -t 6 -b 16 --seed 1 --adam-iter 256         --no-checkpointing
main: seed: 1
llama_model_loader: loaded meta data with 17 key-value pairs and 0 tensors from ./models/ggml-vocab-llama.gguf (version GGUF V3 (latest))
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                       tokenizer.ggml.model str     
llama_model_loader: - kv  11:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  12:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  14:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32     
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 0.00 B
llm_load_print_meta: model size       = 0.00 MiB (-nan BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llama_model_load: vocab only - skipping tensors
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
main: init model
print_params: n_vocab: 32000
print_params: n_ctx:   64
print_params: n_embd:  256
print_params: n_head:  8
print_params: n_ff:    768
print_params: n_layer: 16
print_params: n_rot:   32
main: total train_iterations 0
main: seen train_samples     0
main: seen train_tokens      0
main: completed train_epochs 0
main: model_size = 240304416 bytes (229.2 MB)
main: opt_size  = 360288432 bytes (343.6 MB)
main: opt iter 0
main: input_size = 131076128 bytes (125.0 MB)
main: compute_size = 701759840 bytes (669.3 MB)
main: evaluation order = LEFT_TO_RIGHT
main: tokenize training data
tokenize_file: total number of samples: 27520
main: number of training tokens: 27584
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 768376 bytes (0.7 MB)
train_opt_callback: iter=     0 sample=1/27520 sched=0.000000 loss=0.000000 |->
train_opt_callback: iter=     1 sample=17/27520 sched=0.010000 loss=10.373524 dt=00:00:03 eta=00:15:01 |->
train_opt_callback: iter=     2 sample=33/27520 sched=0.020000 loss=nan dt=00:00:03 eta=00:14:19 |>
train_opt_callback: iter=     3 sample=49/27520 sched=0.030000 loss=nan dt=00:00:03 eta=00:15:01 |>
^C
root@c5a10438d69e:/workspace/llama.cpp# ^C
root@c5a10438d69e:/workspace/llama.cpp# git log | head -1
commit d9b33fe95bd257b36c84ee5769cc048230067d6f
root@c5a10438d69e:/workspace/llama.cpp# lscpu | egrep "AMD|Flags"
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC Processor
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat npt nrip_save
Virtualization:                  AMD-V
root@c5a10438d69e:/workspace/llama.cpp# uname -a
Linux c5a10438d69e 5.4.0-139-generic #156-Ubuntu SMP Fri Jan 20 17:27:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
root@c5a10438d69e:/workspace/llama.cpp# g++ --version
g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@c5a10438d69e:/workspace/llama.cpp# make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
root@c5a10438d69e:/workspace/llama.cpp#

I've bisected this and 898aeca is the first bad commit. Reverting to the previous commit, c43c2da, train-text-from-scratch and finetune appear to work fine (they don't go into nan)

The text was updated successfully, but these errors were encountered:

randaller · 2023-11-04T10:46:01Z

Same here, finetune loss goes to nan on iter=2, windows build, with and without cublas, avx2.

maxxk · 2023-11-04T19:53:18Z

I have the same behavior on Haswell (AVX2) followed by segmentation fault.

I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   unknown
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -Ofast -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native -mavx -mavx2 -mssse3 -mfma -g
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mavx -mavx2 -mssse3 -mfma -g -Ofast -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native
I NVCCFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mavx -mavx2 -mssse3 -mfma -g -O3  -Wno-pedantic -Xcompiler "-Ofast -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native "
I LDFLAGS:
I CC:        gcc (GCC) 12.2.0
I CXX:       g++ (GCC) 12.2.0

~~Default build without AVX2 works fine (but 4 times slower).~~ I was wrong, it crashes too.
M1 build works fine.

GDB on segmentation fault (don't really know where to look):

main: work_size = 512240 bytes (0.5 MB)
train_opt_callback: iter=     0 sample=1/446045 sched=0.000000 loss=0.000000 |->
[New Thread 0x7fff7257c6c0 (LWP 774976)]
[New Thread 0x7fff71d7b6c0 (LWP 774977)]
[New Thread 0x7fff7157a6c0 (LWP 774978)]
[Thread 0x7fff7157a6c0 (LWP 774978) exited]
[Thread 0x7fff71d7b6c0 (LWP 774977) exited]
[Thread 0x7fff7257c6c0 (LWP 774976) exited]
train_opt_callback: iter=     1 sample=17/446045 sched=0.010000 loss=10.373476 dt=00:00:05 eta=16:02:24 |->
[New Thread 0x7fff7157a6c0 (LWP 774981)]
[New Thread 0x7fff71d7b6c0 (LWP 774982)]
[New Thread 0x7fff7257c6c0 (LWP 774983)]
[Thread 0x7fff7257c6c0 (LWP 774983) exited]
[Thread 0x7fff71d7b6c0 (LWP 774982) exited]
[Thread 0x7fff7157a6c0 (LWP 774981) exited]
main: total training time: 00:00:11
save_checkpoint_file: saving to checkpoint-1.gguf
save_checkpoint_file: saving to checkpoint-LATEST.gguf
save_llama_model_file: saving to ggml-checkpoint-f32.bin
save_llama_model_file: saving to ggml-checkpoint-f32.bin

Thread 1 "train-text-from" received signal SIGSEGV, Segmentation fault.
ggml_backend_buffer_free (buffer=0x603662) at ggml-backend.c:37
37              buffer->iface.free_buffer(buffer);
(gdb) p buffer
$1 = (ggml_backend_buffer_t) 0x603662
(gdb) p buffer->iface
$2 = {free_buffer = 0x50000000000000, get_base = 0x3f23000000000000, get_alloc_size = 0x9888000000000060, init_tensor = 0x398000007fffffff, free_tensor = 0x36c0000000000060}
(gdb) p buffer->iface.free_buffer
$3 = (void (*)(ggml_backend_buffer_t)) 0x50000000000000
(gdb) x/i 0x50000000000000
   0x50000000000000:    Cannot access memory at address 0x50000000000000

Address sanitizers report 2 errors:

main: input_size = 262152224 bytes (250.0 MB)
ggml-alloc.c:212:32: runtime error: pointer index expression with base 0x000017001240 overflowed to 0xffffffffffffffff
main: compute_size = 1370751584 bytes (1307.3 MB)
main: evaluation order = LEFT_TO_RIGHT
main: tokenize training data
tokenize_file: total number of samples: 446045
main: number of training tokens: 446173
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 512240 bytes (0.5 MB)
train_opt_callback: iter=     0 sample=1/446045 sched=0.000000 loss=0.000000 |>
train_opt_callback: iter=     1 sample=17/446045 sched=0.010000 loss=10.373759 dt=00:00:49 eta=5d 16:25:50 |->
main: total training time: 00:01:35
save_checkpoint_file: saving to checkpoint-1.gguf
save_checkpoint_file: saving to checkpoint-LATEST.gguf
save_llama_model_file: saving to ggml-checkpoint-f32.bin
save_llama_model_file: saving to ggml-checkpoint-f32.bin
AddressSanitizer:DEADLYSIGNAL
=================================================================
==781429==ERROR: AddressSanitizer: SEGV on unknown address 0x7f06e145d808 (pc 0x000000d31a29 bp 0x7ffebac51620 sp 0x7ffebac51610 T0)
==781429==The signal is caused by a READ memory access.
    #0 0xd31a29 in ggml_allocr_free /TRASH/llama/llama.cpp/ggml-alloc.c:330
    #1 0x45f339 in main examples/train-text-from-scratch/train-text-from-scratch.cpp:1300
    #2 0x7f06e2e3dacd in __libc_start_call_main (/nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/libc.so.6+0x23acd)
    #3 0x7f06e2e3db88 in __libc_start_main_alias_2 (/nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/libc.so.6+0x23b88)
    #4 0x46bb94 in _start (/TRASH/llama/llama.cpp/train-text-from-scratch+0x46bb94)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /TRASH/llama/llama.cpp/ggml-alloc.c:330 in ggml_allocr_free
==781429==ABORTING

abb128 added the bug-unconfirmed label Nov 4, 2023

ggerganov added bug Something isn't working and removed bug-unconfirmed labels Nov 5, 2023

ggerganov assigned cebtenzzre Nov 5, 2023

cebtenzzre linked a pull request Nov 6, 2023 that will close this issue

Fix backward rope after YaRN #3974

Merged

ggerganov closed this as completed in #3974 Nov 7, 2023

cebtenzzre removed their assignment Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train-text-from-scratch and finetune nan loss on iter=2 #3940

train-text-from-scratch and finetune nan loss on iter=2 #3940

abb128 commented Nov 4, 2023

randaller commented Nov 4, 2023

maxxk commented Nov 4, 2023 •

edited

Loading

train-text-from-scratch and finetune nan loss on iter=2 #3940

train-text-from-scratch and finetune nan loss on iter=2 #3940

Comments

abb128 commented Nov 4, 2023

randaller commented Nov 4, 2023

maxxk commented Nov 4, 2023 • edited Loading

maxxk commented Nov 4, 2023 •

edited

Loading