Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model doesn't want to load #1128

Open
musikowskipawel opened this issue Sep 18, 2024 · 2 comments
Open

Model doesn't want to load #1128

musikowskipawel opened this issue Sep 18, 2024 · 2 comments

Comments

@musikowskipawel
Copy link

I have i5-6400, 16gb ram, rtx 3060ti 8gb

i'm trying to load model:
LLaMA2-13B-Tiefighter.Q4_K_S.gguf

i launch program koboldcpp.exe (exactly this version) and setting Gpu Layers to 41 as recommended (on -1 the issue is exactly the same though)

then i click launch, click my model and there's some text, but no vram, memory or cpu usage change has been monitored

the log:


***
Welcome to KoboldCpp - Version 1.74
For command line arguments, please refer to --help
***
Auto Selected CUDA Backend...

Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(benchmark=None, blasbatchsize=512, blasthreads=2, chatcompletionsadapter=None, config=None, contextsize=4096, debugmode=0, flashattention=False, forceversion=0, foreground=False, gpulayers=41, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=True, lora=None, mmproj=None, model='', model_param='C:/Users/Paul/Desktop/text-generation-webui-main/models/LLaMA2-13B-Tiefighter.Q4_K_S.gguf', multiuser=1, noavx2=False, noblas=False, nocertify=False, nommap=False, nomodel=False, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdquant=False, sdthreads=2, sdvae='', sdvaeauto=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=2, unpack='', useclblast=None, usecublas=['normal', '0', 'mmq'], usemlock=False, usevulkan=None, whispermodel='')
==========
Loading model: C:\Users\Paul\Desktop\text-generation-webui-main\models\LLaMA2-13B-Tiefighter.Q4_K_S.gguf

The reported GGUF Arch is: llama
Arch Category: 0

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
It means that the RoPE values written above will be replaced by the RoPE values indicated after loading.
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from C:\Users\Paul\Desktop\text-generation-webui-m¶¬tÕXllm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 6.90 GiB (4.56 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes

@LostRuins
Copy link
Owner

Hi, the terminal could have been accidentally paused. When this happens, please try closing the re-opening koboldcpp.

@henk717
Copy link

henk717 commented Sep 21, 2024

When this happens your driver is compiling the required support to run KoboldCpp, this is normal especially on the CUDA11 version of KoboldCpp. koboldcpp_cu12.exe is faster with this, but unlike what people advice the best solution is actually to leave it open and just do its thing. It will get unstuck once the driver compiled everything it needed to run the CUDA11 version on top of CUDA12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants