Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU CUDA - 8x performance degradation when splitting tensors -> let's split by layer as an option #4055

Closed
cmp-nct opened this issue Nov 13, 2023 · 17 comments · Fixed by #4766
Labels
enhancement New feature or request

Comments

@cmp-nct
Copy link
Contributor

cmp-nct commented Nov 13, 2023

Problem:
I am aware everyone has different results, in my case I am running llama.cpp on a 4090 primary and a 3090 secondary, so both are quite capable cards for llms.
I am getting around 800% slowdowns when using both cards on the same model and settings (basically regardless which model I tried), batch processing speed can go down from 2400t/sec to 200-300t/sec (8-10 times slower than on single GPU).
This happens as soon as any tiny bit of processing (-ts) is shifted to the 2nd card.

I assume it is a synchronization problem in the cuda loops, I also assume the issue does not affect every combination of GPUs, especially if one GPU is significantly slower.

Suggestion:
My suggestion is to add a parameter like -layer-split, when this is used the tensors are not split up, instead the layers are split into the cards (using -ls instead of -ts).
This means the calculations can all be computed without synchronization on a single GPU at highest possible performance of that GPU.

Caveat:
In theory tensor split should boost performance, as both cards can process a split tensor at the same time, so it's the better solution but currently that's so far from reality, the suggested layer split should significantly boost the processing speed.

@JohannesGaessler what do you think ?

@cmp-nct cmp-nct added the enhancement New feature or request label Nov 13, 2023
@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Nov 13, 2023

See the conversation starting at #3776 (comment) . I am aware of the parallelization scheme where the model is split into blocks of layers instead of splitting each layer into slices. As I said before: I have no intention of implementing it. Multi GPU only really makes sense for running something like 70b and for that purpose I think the best buys are either multiple P40s or multiple RTX 3090s. For multiple P40s the current scheme works better while for multiple RTX 3090s NVLink is available which should also result in low parallelization overhead. Synchronization overhead may also vary by OS: if you e.g. use Windows peer access between devices is only available via NVLink so the performance for multiple GPUs working on small batches should be worse.

This means the calculations can all be computed without synchronization on a single GPU at highest possible performance of that GPU.

No, for $N$ identical GPUs serving one request the maximum theoretical GPU utilization using that scheme is $\frac{1}{N}$ because the GPUs have to wait for each other. The only way to achieve 100% GPU utilization would be to serve multiple concurrent requests (for example by serving multiple users) in parallel.

Also: see #3814 and check whether that PR has unintentionally resulted in a performance regression for your hardware setup.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Nov 13, 2023

That's a pity, Nvlink has been deprecated in 2022 and is not likely going to come back to consumer GPUs.
I don't think relying on used 3090 GPUs is a viable approach for the future ? They are cheap now but will be scarce.

I am aware about the theory but in practice we have a 800-1000% slowdown with the current implementation of tensor split.
The modern larger models all need a ton of VRAM, which makes llama.cpp useless for them aside of testing purposes, python solutions are much better currently. For single GPU use llama.cpp is quite head on with python based inference.

Best would be to fix the synchronization problem, splitting by layers would be a simple solution solving that problem until synchronization works better.

@jezzarax
Copy link

jezzarax commented Nov 14, 2023

From what I see there might be an issue with the produced values being inconsistent between single and multi-GPU setups.
I have a 2xA100 PCIe machine, aside from difference in performance (0.15 ms/token for single GPU vs 0.52 ms/token for multigpu) I'm getting significantly different perplexity results for the same model&dataset, 8.8942 for single GPU vs 6.4202 for multi-GPU. Logs below.

Single-GPU:

root@hostname:/opt/datasets/llama.cpp# CUDA_VISIBLE_DEVICES=0 ./perplexity -ngl 100 -m ../gguf-models/Yi-6B/yi-6b.4k.vanilla.f16.gguf -f ../datasets/wikitext-2-raw/wiki.test.raw
main: build = 1515 (36eed0c)
main: built with cc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 for x86_64-linux-gnu
main: seed  = 1699970672
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../gguf-models/Yi-6B/yi-6b.4k.vanilla.f16.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  4096, 64000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    4:            blk.0.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    8:              blk.0.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    9:              blk.0.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   10:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   11:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   12:            blk.1.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   13:            blk.1.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   14:              blk.1.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   16:         blk.1.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   17:              blk.1.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   18:              blk.1.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   19:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   20:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   21:           blk.10.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   22:           blk.10.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   23:             blk.10.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   24:             blk.10.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   25:        blk.10.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   26:             blk.10.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   27:             blk.10.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   28:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   29:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   30:           blk.11.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   31:           blk.11.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   32:             blk.11.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   33:             blk.11.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   34:        blk.11.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   35:             blk.11.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   36:             blk.11.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   37:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   38:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   39:           blk.12.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   40:           blk.12.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   41:             blk.12.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   42:             blk.12.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   43:        blk.12.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   44:             blk.12.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   45:             blk.12.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   46:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   47:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   48:           blk.13.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   49:           blk.13.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   50:             blk.13.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   51:             blk.13.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   52:        blk.13.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   53:             blk.13.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   54:             blk.13.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   55:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   56:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   57:           blk.14.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   58:           blk.14.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   59:             blk.14.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   60:             blk.14.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   61:        blk.14.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   62:             blk.14.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   63:             blk.14.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   64:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   65:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   66:           blk.15.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   67:           blk.15.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   68:             blk.15.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   69:             blk.15.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   70:        blk.15.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   71:             blk.15.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   72:             blk.15.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   73:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   74:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   75:           blk.16.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   76:           blk.16.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   77:             blk.16.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   78:             blk.16.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   79:        blk.16.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   80:             blk.16.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   81:             blk.16.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   82:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   83:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   84:           blk.17.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   85:           blk.17.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   86:             blk.17.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   87:             blk.17.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   88:        blk.17.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   89:             blk.17.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   90:             blk.17.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   91:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   92:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   93:           blk.18.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   94:           blk.18.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   95:             blk.18.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   96:             blk.18.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   97:        blk.18.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   98:             blk.18.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   99:             blk.18.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  100:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  101:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  102:           blk.19.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  103:           blk.19.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  104:             blk.19.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  105:             blk.19.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  106:        blk.19.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  107:             blk.19.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  108:             blk.19.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  109:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  110:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  111:            blk.2.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  112:            blk.2.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  113:              blk.2.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  114:              blk.2.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  115:         blk.2.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  116:              blk.2.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  117:              blk.2.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  118:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  119:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  120:           blk.20.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  121:           blk.20.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  122:             blk.20.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  123:             blk.20.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  124:        blk.20.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  125:             blk.20.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  126:             blk.20.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  127:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  128:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  129:           blk.21.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  130:           blk.21.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  131:             blk.21.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  132:             blk.21.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  133:        blk.21.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  134:             blk.21.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  135:             blk.21.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  136:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  137:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  138:           blk.22.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  139:           blk.22.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  140:             blk.22.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  141:             blk.22.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  142:        blk.22.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  143:             blk.22.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  144:             blk.22.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  145:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  146:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  147:           blk.23.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  148:           blk.23.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  149:             blk.23.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  150:             blk.23.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  151:        blk.23.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  152:             blk.23.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  153:             blk.23.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  154:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  155:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  156:           blk.24.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  157:           blk.24.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  158:             blk.24.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  159:             blk.24.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  160:        blk.24.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  161:             blk.24.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  162:             blk.24.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  163:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  164:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  165:           blk.25.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  166:           blk.25.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  167:             blk.25.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  168:             blk.25.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  169:        blk.25.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  170:             blk.25.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  171:             blk.25.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  172:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  173:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  174:           blk.26.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  175:           blk.26.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  176:             blk.26.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  177:             blk.26.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  178:        blk.26.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  179:             blk.26.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  180:             blk.26.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  181:             blk.27.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  182:        blk.27.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  183:             blk.27.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  184:             blk.27.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  185:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  186:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  187:            blk.3.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  188:            blk.3.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  189:              blk.3.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  190:              blk.3.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  191:         blk.3.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  192:              blk.3.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  193:              blk.3.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  194:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  195:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  196:            blk.4.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  197:            blk.4.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  198:              blk.4.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  199:              blk.4.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  200:         blk.4.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  201:              blk.4.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  202:              blk.4.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  203:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  204:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  205:            blk.5.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  206:            blk.5.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  207:              blk.5.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  208:              blk.5.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  209:         blk.5.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  210:              blk.5.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  211:              blk.5.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  212:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  213:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  214:            blk.6.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  215:            blk.6.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  216:              blk.6.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  217:              blk.6.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  218:         blk.6.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  219:              blk.6.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  220:              blk.6.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  221:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  222:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  223:            blk.7.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  224:            blk.7.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  225:              blk.7.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  226:              blk.7.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  227:         blk.7.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  228:              blk.7.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  229:              blk.7.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  230:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  231:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  232:            blk.8.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  233:            blk.8.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  234:              blk.8.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  235:              blk.8.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  236:         blk.8.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  237:              blk.8.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  238:              blk.8.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  239:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  240:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  241:            blk.9.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  242:            blk.9.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  243:              blk.9.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  244:              blk.9.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  245:         blk.9.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  246:              blk.9.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  247:              blk.9.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  248:                    output.weight f16      [  4096, 64000,     1,     1 ]
llama_model_loader: - tensor  249:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  251:           blk.27.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  253:             blk.27.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  254:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  255:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  256:           blk.28.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  259:             blk.28.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  260:        blk.28.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  261:             blk.28.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  262:             blk.28.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  263:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  264:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  265:           blk.29.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  268:             blk.29.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  269:        blk.29.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  270:             blk.29.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  271:             blk.29.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  272:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  273:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  274:           blk.30.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  277:             blk.30.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  278:        blk.30.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  279:             blk.30.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  280:             blk.30.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  281:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  282:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  283:           blk.31.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  285:             blk.31.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  286:             blk.31.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  287:        blk.31.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  288:             blk.31.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  289:             blk.31.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  290:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
llama_model_loader: - kv  11:                          general.file_type u32
llama_model_loader: - kv  12:                       tokenizer.ggml.model str
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: mismatch in special tokens definition ( 498/64000 vs 267/64000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 64000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 5000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly F16
llm_load_print_meta: model params     = 6.06 B
llm_load_print_meta: model size       = 11.29 GiB (16.00 BPW)
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<|startoftext|>'
llm_load_print_meta: EOS token = 2 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 315 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  500.11 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 11061.02 MB
..............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 32.00 MB
llama_new_context_with_model: kv self size  =   32.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 134.56 MB
llama_new_context_with_model: VRAM scratch buffer: 133.00 MB
llama_new_context_with_model: total VRAM used: 11226.02 MB (model: 11061.02 MB, context: 165.00 MB)

system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 844.664 ms
perplexity: calculating perplexity over 656 chunks, batch_size=512
perplexity: 0.12 seconds per pass - ETA 1.35 minutes
[1]9.9801,[2]8.4253,[3]8.4032,[4]8.9842,[5]8.6767,[6]8.1543,[7]8.1036,[8]8.7699,[9]9.8988,[10]10.8227,[11]11.0948,[12]11.3991,[13]11.8201,[14]11.7653,[15]11.8371,[16]11.4720,[17]11.1678,[18]11.0250,[19]10.2875,[20]10.4186,[21]10.4021,[22]10.5950
,[23]10.3386,[24]10.0321,[25]10.1178,[26]9.8457,[27]9.4039,[28]9.1291,[29]8.8433,[30]8.4968,[31]8.3096,[32]8.4668,[33]8.3083,[34]8.3130,[35]8.3490,[36]8.3186,[37]8.5209,[38]8.5346,[39]8.5463,[40]8.5986,[41]8.8693,[42]8.8867,[43]8.8355,[44]8.8261
,[45]8.8377,[46]8.8909,[47]8.9887,[48]8.9840,[49]8.9671,[50]8.9277,[51]8.8623,[52]8.9980,[53]8.9855,[54]9.0657,[55]9.0041,[56]9.0259,[57]9.0211,[58]9.0434,[59]9.0267,[60]9.0051,[61]8.9353,[62]9.0490,[63]9.0554,[64]8.9871,[65]8.9965,[66]8.9302,[6
7]8.9038,[68]8.9808,[69]9.0689,[70]9.1188,[71]9.2167,[72]9.2385,[73]9.3835,[74]9.3895,[75]9.4328,[76]9.4294,[77]9.3961,[78]9.4321,[79]9.4094,[80]9.4295,[81]9.4715,[82]9.4181,[83]9.4125,[84]9.3508,[85]9.2445,[86]9.2159,[87]9.1764,[88]9.0874,[89]9
.1255,[90]9.0610,[91]9.0781,[92]9.0551,[93]9.1269,[94]9.0919,[95]9.1691,[96]9.1714,[97]9.2384,[98]9.2775,[99]9.2555,[100]9.2830,[101]9.3306,[102]9.3559,[103]9.3089,[104]9.2998,[105]9.3493,[106]9.3987,[107]9.3665,[108]9.3147,[109]9.3262,[110]9.35
44,[111]9.3558,[112]9.4086,[113]9.4672,[114]9.4351,[115]9.4911,[116]9.4419,[117]9.4631,[118]9.4348,[119]9.4305,[120]9.4997,[121]9.5983,[122]9.6348,[123]9.6689,[124]9.7052,[125]9.6934,[126]9.7348,[127]9.7533,[128]9.8050,[129]9.7713,[130]9.7637,[1
31]9.7581,[132]9.7267,[133]9.7424,[134]9.7331,[135]9.6875,[136]9.6585,[137]9.6284,[138]9.6283,[139]9.5814,[140]9.5426,[141]9.5055,[142]9.4643,[143]9.4280,[144]9.3779,[145]9.3216,[146]9.2899,[147]9.2544,[148]9.3021,[149]9.3117,[150]9.2863,[151]9.
2701,[152]9.2463,[153]9.2035,[154]9.1595,[155]9.1281,[156]9.1135,[157]9.1331,[158]9.1557,[159]9.1376,[160]9.1213,[161]9.1475,[162]9.1282,[163]9.0761,[164]9.0313,[165]8.9726,[166]8.9548,[167]8.9153,[168]8.8374,[169]8.8159,[170]8.7692,[171]8.7437,
[172]8.6904,[173]8.6629,[174]8.6639,[175]8.6448,[176]8.6451,[177]8.6439,[178]8.6064,[179]8.5561,[180]8.5423,[181]8.5132,[182]8.5174,[183]8.5009,[184]8.5276,[185]8.5039,[186]8.4932,[187]8.5314,[188]8.5147,[189]8.5350,[190]8.5330,[191]8.5383,[192]
8.5550,[193]8.5697,[194]8.5765,[195]8.6124,[196]8.6192,[197]8.6280,[198]8.6948,[199]8.7540,[200]8.7746,[201]8.7551,[202]8.8026,[203]8.8532,[204]8.8512,[205]8.8939,[206]8.8810,[207]8.8846,[208]8.9147,[209]8.9189,[210]8.9014,[211]8.8928,[212]8.888
2,[213]8.8992,[214]8.8960,[215]8.8730,[216]8.8790,[217]8.9050,[218]8.9079,[219]8.8991,[220]8.9170,[221]8.9207,[222]8.9015,[223]8.8919,[224]8.8656,[225]8.8349,[226]8.8568,[227]8.8621,[228]8.8658,[229]8.8623,[230]8.8413,[231]8.8397,[232]8.8124,[23
3]8.7860,[234]8.8055,[235]8.8228,[236]8.8136,[237]8.8372,[238]8.8186,[239]8.7921,[240]8.7942,[241]8.7835,[242]8.7742,[243]8.7767,[244]8.7645,[245]8.7329,[246]8.6893,[247]8.6859,[248]8.6629,[249]8.6358,[250]8.6176,[251]8.5968,[252]8.5680,[253]8.5
668,[254]8.5802,[255]8.5725,[256]8.5406,[257]8.5302,[258]8.5130,[259]8.4941,[260]8.5033,[261]8.5122,[262]8.5157,[263]8.5240,[264]8.4896,[265]8.5072,[266]8.4929,[267]8.4753,[268]8.4755,[269]8.4880,[270]8.4901,[271]8.4796,[272]8.4768,[273]8.5095,[
274]8.5054,[275]8.5123,[276]8.5263,[277]8.5248,[278]8.5343,[279]8.5332,[280]8.5513,[281]8.5553,[282]8.5717,[283]8.5944,[284]8.6004,[285]8.6215,[286]8.6422,[287]8.6599,[288]8.6902,[289]8.6664,[290]8.6331,[291]8.6073,[292]8.6050,[293]8.5805,[294]8
.5819,[295]8.5764,[296]8.5794,[297]8.5711,[298]8.5790,[299]8.5623,[300]8.5430,[301]8.5494,[302]8.5349,[303]8.5287,[304]8.5177,[305]8.5077,[306]8.5115,[307]8.5212,[308]8.5250,[309]8.5173,[310]8.5140,[311]8.5129,[312]8.5169,[313]8.4976,[314]8.4879
,[315]8.4774,[316]8.4974,[317]8.4649,[318]8.4298,[319]8.4554,[320]8.4663,[321]8.4736,[322]8.4807,[323]8.4640,[324]8.4793,[325]8.4824,[326]8.4846,[327]8.4980,[328]8.4930,[329]8.5102,[330]8.5141,[331]8.5399,[332]8.5599,[333]8.5597,[334]8.5355,[335
]8.5328,[336]8.5367,[337]8.5452,[338]8.5329,[339]8.5300,[340]8.5356,[341]8.5535,[342]8.5498,[343]8.5664,[344]8.5802,[345]8.5729,[346]8.5821,[347]8.5971,[348]8.6097,[349]8.6064,[350]8.6034,[351]8.5993,[352]8.5890,[353]8.5883,[354]8.6095,[355]8.61
05,[356]8.6123,[357]8.6027,[358]8.6115,[359]8.6317,[360]8.6229,[361]8.6228,[362]8.6494,[363]8.6599,[364]8.6779,[365]8.6741,[366]8.6799,[367]8.6774,[368]8.6730,[369]8.6905,[370]8.6809,[371]8.6761,[372]8.6772,[373]8.6753,[374]8.6748,[375]8.6631,[3
76]8.6885,[377]8.6798,[378]8.6910,[379]8.6809,[380]8.6627,[381]8.6614,[382]8.6461,[383]8.6275,[384]8.6391,[385]8.6272,[386]8.6304,[387]8.6348,[388]8.6413,[389]8.6369,[390]8.6409,[391]8.6398,[392]8.6403,[393]8.6390,[394]8.6331,[395]8.6185,[396]8.
6183,[397]8.6194,[398]8.6199,[399]8.6257,[400]8.6314,[401]8.6297,[402]8.6400,[403]8.6327,[404]8.6283,[405]8.6332,[406]8.6425,[407]8.6401,[408]8.6407,[409]8.6580,[410]8.6885,[411]8.6935,[412]8.7117,[413]8.7346,[414]8.7531,[415]8.7719,[416]8.7884,
[417]8.7980,[418]8.7946,[419]8.8018,[420]8.8070,[421]8.8163,[422]8.8238,[423]8.8440,[424]8.8660,[425]8.8737,[426]8.8879,[427]8.8827,[428]8.8887,[429]8.8864,[430]8.9117,[431]8.9248,[432]8.9308,[433]8.9254,[434]8.9111,[435]8.9152,[436]8.9278,[437]
8.9508,[438]8.9702,[439]8.9692,[440]8.9550,[441]8.9430,[442]8.9334,[443]8.9435,[444]8.9408,[445]8.9343,[446]8.9336,[447]8.9507,[448]8.9528,[449]8.9523,[450]8.9418,[451]8.9291,[452]8.9087,[453]8.9035,[454]8.9034,[455]8.9055,[456]8.9137,[457]8.912
2,[458]8.9066,[459]8.9007,[460]8.9090,[461]8.8995,[462]8.8938,[463]8.9056,[464]8.9145,[465]8.9046,[466]8.9064,[467]8.9109,[468]8.9046,[469]8.9060,[470]8.9072,[471]8.9030,[472]8.8964,[473]8.8866,[474]8.8858,[475]8.8771,[476]8.8699,[477]8.8709,[47
8]8.8699,[479]8.8812,[480]8.8981,[481]8.8969,[482]8.9074,[483]8.8997,[484]8.9062,[485]8.9016,[486]8.8831,[487]8.8946,[488]8.8877,[489]8.8780,[490]8.8733,[491]8.8790,[492]8.8690,[493]8.8605,[494]8.8509,[495]8.8641,[496]8.8547,[497]8.8466,[498]8.8
561,[499]8.8441,[500]8.8247,[501]8.8237,[502]8.8306,[503]8.8243,[504]8.8249,[505]8.8185,[506]8.8280,[507]8.8345,[508]8.8391,[509]8.8335,[510]8.8286,[511]8.8411,[512]8.8400,[513]8.8467,[514]8.8486,[515]8.8537,[516]8.8583,[517]8.8536,[518]8.8636,[
519]8.8642,[520]8.8755,[521]8.8838,[522]8.8948,[523]8.9071,[524]8.9174,[525]8.9337,[526]8.9353,[527]8.9500,[528]8.9354,[529]8.9471,[530]8.9361,[531]8.9337,[532]8.9308,[533]8.9477,[534]8.9522,[535]8.9409,[536]8.9340,[537]8.9265,[538]8.9329,[539]8
.9429,[540]8.9428,[541]8.9335,[542]8.9289,[543]8.9297,[544]8.9410,[545]8.9330,[546]8.9413,[547]8.9302,[548]8.9282,[549]8.9221,[550]8.9259,[551]8.9140,[552]8.9148,[553]8.8989,[554]8.8904,[555]8.8874,[556]8.8792,[557]8.8678,[558]8.8580,[559]8.8575
,[560]8.8603,[561]8.8650,[562]8.8627,[563]8.8516,[564]8.8544,[565]8.8714,[566]8.8794,[567]8.8710,[568]8.8749,[569]8.8665,[570]8.8728,[571]8.8779,[572]8.8753,[573]8.8746,[574]8.8723,[575]8.8703,[576]8.8703,[577]8.8637,[578]8.8600,[579]8.8627,[580
]8.8677,[581]8.8595,[582]8.8591,[583]8.8573,[584]8.8513,[585]8.8416,[586]8.8329,[587]8.8267,[588]8.8283,[589]8.8362,[590]8.8413,[591]8.8342,[592]8.8379,[593]8.8386,[594]8.8329,[595]8.8350,[596]8.8352,[597]8.8294,[598]8.8402,[599]8.8354,[600]8.83
56,[601]8.8312,[602]8.8410,[603]8.8395,[604]8.8374,[605]8.8463,[606]8.8511,[607]8.8433,[608]8.8257,[609]8.8182,[610]8.8328,[611]8.8278,[612]8.8376,[613]8.8294,[614]8.8299,[615]8.8172,[616]8.8168,[617]8.8182,[618]8.8038,[619]8.7925,[620]8.7893,[6
21]8.7709,[622]8.7730,[623]8.7723,[624]8.7727,[625]8.7713,[626]8.7800,[627]8.7923,[628]8.8031,[629]8.8184,[630]8.8234,[631]8.8377,[632]8.8412,[633]8.8543,[634]8.8591,[635]8.8562,[636]8.8632,[637]8.8661,[638]8.8710,[639]8.8622,[640]8.8591,[641]8.
8568,[642]8.8643,[643]8.8642,[644]8.8742,[645]8.8818,[646]8.8780,[647]8.8883,[648]8.8880,[649]8.8881,[650]8.8852,[651]8.8847,[652]8.8956,[653]8.8992,[654]8.9113,[655]8.8920,[656]8.8942,
Final estimate: PPL = 8.8942 +/- 0.05851

llama_print_timings:        load time =     958.13 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   51543.97 ms / 335872 tokens (    0.15 ms per token,  6516.22 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   66956.00 ms

Multi-GPU:

root@hostname:/opt/datasets/llama.cpp# ./perplexity -ngl 100 -m ../gguf-models/Yi-6B/yi-6b.4k.vanilla.f16.gguf -f ../datasets/wikitext-2-raw/wiki.test.raw
main: build = 1515 (36eed0c)
main: built with cc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 for x86_64-linux-gnu
main: seed  = 1699970757
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0
  Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../gguf-models/Yi-6B/yi-6b.4k.vanilla.f16.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  4096, 64000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    4:            blk.0.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    8:              blk.0.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    9:              blk.0.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   10:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   11:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   12:            blk.1.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   13:            blk.1.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   14:              blk.1.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   16:         blk.1.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   17:              blk.1.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   18:              blk.1.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   19:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   20:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   21:           blk.10.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   22:           blk.10.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   23:             blk.10.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   24:             blk.10.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   25:        blk.10.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   26:             blk.10.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   27:             blk.10.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   28:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   29:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   30:           blk.11.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   31:           blk.11.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   32:             blk.11.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   33:             blk.11.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   34:        blk.11.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   35:             blk.11.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   36:             blk.11.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   37:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   38:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   39:           blk.12.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   40:           blk.12.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   41:             blk.12.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   42:             blk.12.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   43:        blk.12.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   44:             blk.12.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   45:             blk.12.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   46:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   47:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   48:           blk.13.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   49:           blk.13.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   50:             blk.13.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   51:             blk.13.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   52:        blk.13.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   53:             blk.13.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   54:             blk.13.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   55:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   56:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   57:           blk.14.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   58:           blk.14.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   59:             blk.14.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   60:             blk.14.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   61:        blk.14.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   62:             blk.14.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   63:             blk.14.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   64:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   65:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   66:           blk.15.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   67:           blk.15.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   68:             blk.15.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   69:             blk.15.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   70:        blk.15.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   71:             blk.15.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   72:             blk.15.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   73:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   74:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   75:           blk.16.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   76:           blk.16.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   77:             blk.16.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   78:             blk.16.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   79:        blk.16.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   80:             blk.16.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   81:             blk.16.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   82:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   83:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   84:           blk.17.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   85:           blk.17.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   86:             blk.17.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   87:             blk.17.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   88:        blk.17.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   89:             blk.17.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   90:             blk.17.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   91:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   92:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   93:           blk.18.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   94:           blk.18.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   95:             blk.18.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   96:             blk.18.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor   97:        blk.18.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   98:             blk.18.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   99:             blk.18.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  100:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  101:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  102:           blk.19.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  103:           blk.19.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  104:             blk.19.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  105:             blk.19.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  106:        blk.19.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  107:             blk.19.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  108:             blk.19.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  109:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  110:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  111:            blk.2.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  112:            blk.2.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  113:              blk.2.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  114:              blk.2.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  115:         blk.2.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  116:              blk.2.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  117:              blk.2.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  118:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  119:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  120:           blk.20.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  121:           blk.20.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  122:             blk.20.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  123:             blk.20.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  124:        blk.20.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  125:             blk.20.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  126:             blk.20.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  127:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  128:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  129:           blk.21.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  130:           blk.21.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  131:             blk.21.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  132:             blk.21.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  133:        blk.21.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  134:             blk.21.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  135:             blk.21.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  136:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  137:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  138:           blk.22.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  139:           blk.22.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  140:             blk.22.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  141:             blk.22.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  142:        blk.22.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  143:             blk.22.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  144:             blk.22.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  145:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  146:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  147:           blk.23.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  148:           blk.23.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  149:             blk.23.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  150:             blk.23.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  151:        blk.23.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  152:             blk.23.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  153:             blk.23.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  154:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  155:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  156:           blk.24.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  157:           blk.24.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  158:             blk.24.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  159:             blk.24.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  160:        blk.24.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  161:             blk.24.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  162:             blk.24.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  163:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  164:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  165:           blk.25.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  166:           blk.25.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  167:             blk.25.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  168:             blk.25.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  169:        blk.25.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  170:             blk.25.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  171:             blk.25.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  172:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  173:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  174:           blk.26.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  175:           blk.26.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  176:             blk.26.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  177:             blk.26.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  178:        blk.26.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  179:             blk.26.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  180:             blk.26.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  181:             blk.27.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  182:        blk.27.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  183:             blk.27.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  184:             blk.27.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  185:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  186:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  187:            blk.3.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  188:            blk.3.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  189:              blk.3.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  190:              blk.3.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  191:         blk.3.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  192:              blk.3.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  193:              blk.3.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  194:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  195:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  196:            blk.4.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  197:            blk.4.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  198:              blk.4.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  199:              blk.4.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  200:         blk.4.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  201:              blk.4.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  202:              blk.4.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  203:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  204:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  205:            blk.5.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  206:            blk.5.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  207:              blk.5.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  208:              blk.5.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  209:         blk.5.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  210:              blk.5.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  211:              blk.5.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  212:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  213:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  214:            blk.6.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  215:            blk.6.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  216:              blk.6.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  217:              blk.6.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  218:         blk.6.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  219:              blk.6.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  220:              blk.6.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  221:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  222:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  223:            blk.7.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  224:            blk.7.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  225:              blk.7.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  226:              blk.7.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  227:         blk.7.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  228:              blk.7.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  229:              blk.7.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  230:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  231:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  232:            blk.8.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  233:            blk.8.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  234:              blk.8.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  235:              blk.8.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  236:         blk.8.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  237:              blk.8.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  238:              blk.8.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  239:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  240:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  241:            blk.9.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  242:            blk.9.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  243:              blk.9.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  244:              blk.9.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  245:         blk.9.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  246:              blk.9.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  247:              blk.9.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  248:                    output.weight f16      [  4096, 64000,     1,     1 ]
llama_model_loader: - tensor  249:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  251:           blk.27.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  253:             blk.27.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  254:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  255:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  256:           blk.28.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  259:             blk.28.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  260:        blk.28.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  261:             blk.28.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  262:             blk.28.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  263:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  264:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  265:           blk.29.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  268:             blk.29.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  269:        blk.29.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  270:             blk.29.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  271:             blk.29.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  272:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  273:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  274:           blk.30.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  277:             blk.30.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  278:        blk.30.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  279:             blk.30.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  280:             blk.30.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  281:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  282:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  283:           blk.31.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  285:             blk.31.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  286:             blk.31.attn_k.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  287:        blk.31.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  288:             blk.31.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  289:             blk.31.attn_v.weight f16      [  4096,   512,     1,     1 ]
llama_model_loader: - tensor  290:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
llama_model_loader: - kv  11:                          general.file_type u32
llama_model_loader: - kv  12:                       tokenizer.ggml.model str
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: mismatch in special tokens definition ( 498/64000 vs 267/64000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 64000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 5000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly F16
llm_load_print_meta: model params     = 6.06 B
llm_load_print_meta: model size       = 11.29 GiB (16.00 BPW)
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<|startoftext|>'
llm_load_print_meta: EOS token = 2 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 315 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA A100 80GB PCIe) as main device
llm_load_tensors: mem required  =  500.11 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 11061.02 MB
..............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 32.00 MB
llama_new_context_with_model: kv self size  =   32.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 134.56 MB
llama_new_context_with_model: VRAM scratch buffer: 133.00 MB
llama_new_context_with_model: total VRAM used: 11226.02 MB (model: 11061.02 MB, context: 165.00 MB)

system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 886.483 ms
perplexity: calculating perplexity over 656 chunks, batch_size=512
perplexity: 0.39 seconds per pass - ETA 4.28 minutes
[1]4.1101,[2]5.2892,[3]6.1509,[4]7.0355,[5]7.1662,[6]6.9472,[7]7.0714,[8]7.0884,[9]7.4015,[10]7.6309,[11]7.8408,[12]7.7778,[13]7.7804,[14]7.9562,[15]8.1729,[16]7.8097,[17]7.5996,[18]7.6182,[19]7.2514,[20]7.2082,[21]6.9958,[22]6.8169,[23]6.7855,[
24]6.6950,[25]6.6803,[26]6.4759,[27]6.2788,[28]6.1777,[29]6.0674,[30]5.9095,[31]5.8489,[32]5.9136,[33]5.8641,[34]5.9280,[35]5.9055,[36]5.9408,[37]5.9753,[38]6.0378,[39]6.1048,[40]6.1871,[41]6.2686,[42]6.3316,[43]6.2526,[44]6.2948,[45]6.2628,[46]
6.2376,[47]6.2734,[48]6.2135,[49]6.1710,[50]6.1878,[51]6.1781,[52]6.2557,[53]6.2825,[54]6.3145,[55]6.3138,[56]6.3030,[57]6.2741,[58]6.3294,[59]6.3551,[60]6.3781,[61]6.3648,[62]6.4143,[63]6.3933,[64]6.3806,[65]6.4179,[66]6.4024,[67]6.4145,[68]6.4
448,[69]6.4702,[70]6.5005,[71]6.5442,[72]6.5965,[73]6.6497,[74]6.6664,[75]6.6733,[76]6.7030,[77]6.7083,[78]6.6984,[79]6.7087,[80]6.7516,[81]6.7619,[82]6.7532,[83]6.7325,[84]6.7156,[85]6.6654,[86]6.6655,[87]6.6630,[88]6.6174,[89]6.6185,[90]6.5958
,[91]6.6158,[92]6.6224,[93]6.6454,[94]6.6394,[95]6.6494,[96]6.6732,[97]6.6932,[98]6.6944,[99]6.7011,[100]6.6968,[101]6.7506,[102]6.7448,[103]6.7304,[104]6.7453,[105]6.7355,[106]6.7462,[107]6.7438,[108]6.7300,[109]6.7585,[110]6.7531,[111]6.7723,[
112]6.8087,[113]6.8304,[114]6.8252,[115]6.8333,[116]6.8150,[117]6.8025,[118]6.8006,[119]6.8153,[120]6.8386,[121]6.8801,[122]6.8808,[123]6.9221,[124]6.9332,[125]6.9434,[126]6.9390,[127]6.9703,[128]6.9785,[129]6.9733,[130]6.9491,[131]6.9340,[132]6
.9298,[133]6.9194,[134]6.9193,[135]6.9038,[136]6.8990,[137]6.8738,[138]6.8554,[139]6.8383,[140]6.8277,[141]6.8178,[142]6.8044,[143]6.7938,[144]6.7708,[145]6.7448,[146]6.7366,[147]6.7255,[148]6.7360,[149]6.7351,[150]6.7309,[151]6.7318,[152]6.7272
,[153]6.7113,[154]6.6923,[155]6.6828,[156]6.6863,[157]6.6859,[158]6.7147,[159]6.7131,[160]6.7146,[161]6.7473,[162]6.7450,[163]6.7198,[164]6.6988,[165]6.6694,[166]6.6421,[167]6.6039,[168]6.5580,[169]6.5267,[170]6.5047,[171]6.4969,[172]6.4693,[173
]6.4607,[174]6.4414,[175]6.4094,[176]6.3876,[177]6.3668,[178]6.3372,[179]6.3108,[180]6.2985,[181]6.2880,[182]6.2668,[183]6.2653,[184]6.2643,[185]6.2556,[186]6.2966,[187]6.2983,[188]6.2959,[189]6.3008,[190]6.3101,[191]6.3234,[192]6.3438,[193]6.36
46,[194]6.3793,[195]6.4147,[196]6.4300,[197]6.4462,[198]6.4819,[199]6.5143,[200]6.5152,[201]6.5093,[202]6.5256,[203]6.5465,[204]6.5545,[205]6.5578,[206]6.5589,[207]6.5576,[208]6.5616,[209]6.5733,[210]6.5680,[211]6.5715,[212]6.5784,[213]6.5959,[2
14]6.6033,[215]6.5958,[216]6.6089,[217]6.6276,[218]6.6391,[219]6.6417,[220]6.6433,[221]6.6341,[222]6.6286,[223]6.6133,[224]6.6029,[225]6.5872,[226]6.6104,[227]6.6193,[228]6.6306,[229]6.6366,[230]6.6276,[231]6.6343,[232]6.6213,[233]6.5947,[234]6.
5972,[235]6.5902,[236]6.5870,[237]6.5874,[238]6.5821,[239]6.5700,[240]6.5586,[241]6.5590,[242]6.5522,[243]6.5434,[244]6.5294,[245]6.5134,[246]6.4890,[247]6.4806,[248]6.4713,[249]6.4594,[250]6.4508,[251]6.4432,[252]6.4284,[253]6.4173,[254]6.4155,
[255]6.4029,[256]6.3859,[257]6.3723,[258]6.3648,[259]6.3571,[260]6.3505,[261]6.3453,[262]6.3388,[263]6.3297,[264]6.3136,[265]6.3139,[266]6.3103,[267]6.3045,[268]6.3116,[269]6.3106,[270]6.3062,[271]6.3056,[272]6.3107,[273]6.3241,[274]6.3267,[275]
6.3374,[276]6.3392,[277]6.3449,[278]6.3587,[279]6.3645,[280]6.3657,[281]6.3754,[282]6.3807,[283]6.3958,[284]6.4059,[285]6.4113,[286]6.4311,[287]6.4346,[288]6.4412,[289]6.4300,[290]6.4126,[291]6.3999,[292]6.3875,[293]6.3751,[294]6.3760,[295]6.377
9,[296]6.3780,[297]6.3785,[298]6.3818,[299]6.3756,[300]6.3674,[301]6.3667,[302]6.3613,[303]6.3534,[304]6.3464,[305]6.3427,[306]6.3327,[307]6.3292,[308]6.3250,[309]6.3140,[310]6.3081,[311]6.3008,[312]6.2995,[313]6.2909,[314]6.2901,[315]6.2754,[31
6]6.2728,[317]6.2545,[318]6.2340,[319]6.2435,[320]6.2480,[321]6.2475,[322]6.2464,[323]6.2393,[324]6.2400,[325]6.2480,[326]6.2547,[327]6.2541,[328]6.2560,[329]6.2598,[330]6.2668,[331]6.2781,[332]6.2802,[333]6.2851,[334]6.2738,[335]6.2731,[336]6.2
725,[337]6.2736,[338]6.2705,[339]6.2673,[340]6.2615,[341]6.2674,[342]6.2704,[343]6.2705,[344]6.2693,[345]6.2699,[346]6.2674,[347]6.2729,[348]6.2775,[349]6.2769,[350]6.2791,[351]6.2822,[352]6.2807,[353]6.2733,[354]6.2761,[355]6.2826,[356]6.2887,[
357]6.2870,[358]6.3022,[359]6.3051,[360]6.3048,[361]6.2994,[362]6.3105,[363]6.3234,[364]6.3365,[365]6.3392,[366]6.3373,[367]6.3409,[368]6.3325,[369]6.3381,[370]6.3363,[371]6.3272,[372]6.3303,[373]6.3343,[374]6.3306,[375]6.3276,[376]6.3362,[377]6
.3279,[378]6.3270,[379]6.3249,[380]6.3169,[381]6.3105,[382]6.3040,[383]6.2952,[384]6.2993,[385]6.2957,[386]6.2977,[387]6.2955,[388]6.2893,[389]6.2838,[390]6.2789,[391]6.2707,[392]6.2648,[393]6.2593,[394]6.2602,[395]6.2546,[396]6.2515,[397]6.2572
,[398]6.2628,[399]6.2676,[400]6.2679,[401]6.2630,[402]6.2666,[403]6.2665,[404]6.2682,[405]6.2652,[406]6.2641,[407]6.2667,[408]6.2724,[409]6.2806,[410]6.2945,[411]6.3031,[412]6.3205,[413]6.3312,[414]6.3410,[415]6.3496,[416]6.3576,[417]6.3696,[418
]6.3715,[419]6.3765,[420]6.3869,[421]6.3974,[422]6.3991,[423]6.4060,[424]6.4163,[425]6.4269,[426]6.4325,[427]6.4331,[428]6.4420,[429]6.4453,[430]6.4552,[431]6.4696,[432]6.4705,[433]6.4662,[434]6.4554,[435]6.4547,[436]6.4547,[437]6.4656,[438]6.47
55,[439]6.4715,[440]6.4655,[441]6.4618,[442]6.4599,[443]6.4634,[444]6.4672,[445]6.4670,[446]6.4710,[447]6.4772,[448]6.4834,[449]6.4884,[450]6.4852,[451]6.4805,[452]6.4696,[453]6.4695,[454]6.4623,[455]6.4680,[456]6.4723,[457]6.4756,[458]6.4755,[4
59]6.4747,[460]6.4848,[461]6.4824,[462]6.4826,[463]6.4868,[464]6.4836,[465]6.4803,[466]6.4739,[467]6.4769,[468]6.4769,[469]6.4825,[470]6.4826,[471]6.4778,[472]6.4775,[473]6.4689,[474]6.4723,[475]6.4690,[476]6.4680,[477]6.4601,[478]6.4594,[479]6.
4702,[480]6.4764,[481]6.4797,[482]6.4821,[483]6.4803,[484]6.4809,[485]6.4819,[486]6.4727,[487]6.4726,[488]6.4718,[489]6.4646,[490]6.4655,[491]6.4638,[492]6.4603,[493]6.4577,[494]6.4546,[495]6.4543,[496]6.4515,[497]6.4495,[498]6.4497,[499]6.4450,
[500]6.4349,[501]6.4280,[502]6.4283,[503]6.4278,[504]6.4188,[505]6.4183,[506]6.4181,[507]6.4125,[508]6.4078,[509]6.4064,[510]6.4072,[511]6.4114,[512]6.4142,[513]6.4165,[514]6.4214,[515]6.4173,[516]6.4159,[517]6.4163,[518]6.4166,[519]6.4213,[520]
6.4240,[521]6.4245,[522]6.4289,[523]6.4309,[524]6.4348,[525]6.4401,[526]6.4405,[527]6.4437,[528]6.4374,[529]6.4358,[530]6.4318,[531]6.4281,[532]6.4302,[533]6.4356,[534]6.4351,[535]6.4312,[536]6.4306,[537]6.4291,[538]6.4318,[539]6.4403,[540]6.444
2,[541]6.4414,[542]6.4420,[543]6.4468,[544]6.4493,[545]6.4475,[546]6.4474,[547]6.4433,[548]6.4398,[549]6.4392,[550]6.4374,[551]6.4325,[552]6.4317,[553]6.4241,[554]6.4213,[555]6.4170,[556]6.4147,[557]6.4102,[558]6.4068,[559]6.4085,[560]6.4044,[56
1]6.4057,[562]6.4062,[563]6.4012,[564]6.4071,[565]6.4116,[566]6.4106,[567]6.4085,[568]6.4077,[569]6.4052,[570]6.4057,[571]6.4025,[572]6.4045,[573]6.4037,[574]6.4059,[575]6.4015,[576]6.3986,[577]6.3970,[578]6.3987,[579]6.3999,[580]6.4001,[581]6.3
977,[582]6.3979,[583]6.3954,[584]6.3944,[585]6.3903,[586]6.3875,[587]6.3866,[588]6.3845,[589]6.3850,[590]6.3923,[591]6.3908,[592]6.3904,[593]6.3854,[594]6.3849,[595]6.3834,[596]6.3868,[597]6.3859,[598]6.3861,[599]6.3860,[600]6.3897,[601]6.3901,[
602]6.3904,[603]6.3919,[604]6.3940,[605]6.3974,[606]6.4035,[607]6.4010,[608]6.3915,[609]6.3894,[610]6.3935,[611]6.3932,[612]6.3976,[613]6.3951,[614]6.3904,[615]6.3847,[616]6.3878,[617]6.3840,[618]6.3769,[619]6.3719,[620]6.3650,[621]6.3550,[622]6
.3525,[623]6.3553,[624]6.3588,[625]6.3610,[626]6.3626,[627]6.3658,[628]6.3675,[629]6.3726,[630]6.3788,[631]6.3829,[632]6.3884,[633]6.3919,[634]6.3983,[635]6.3989,[636]6.4018,[637]6.3980,[638]6.3976,[639]6.3947,[640]6.3952,[641]6.3969,[642]6.3995
,[643]6.4027,[644]6.4049,[645]6.4046,[646]6.4050,[647]6.4064,[648]6.4123,[649]6.4151,[650]6.4161,[651]6.4192,[652]6.4248,[653]6.4304,[654]6.4353,[655]6.4248,[656]6.4202,
Final estimate: PPL = 6.4202 +/- 0.03892

llama_print_timings:        load time =    1082.88 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  174588.07 ms / 335872 tokens (    0.52 ms per token,  1923.80 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  190700.33 ms

Edited by JG: use <details> when dumping logs into a conversation; this probably is an entirely different issue anyways.

@KerfuffleV2
Copy link
Collaborator

@jezzarax There's something really, really weird going on here. According to your logs you get 8+ ppl for single GPU and ~6.4, which is a gigantic difference. Also, multi-GPU is the "weird" scenario but apparently the more typical one is where you get the unexpected result. I'm very skeptical about 8+ being correct, the 6.4 sounds much more reasonable.

I don't know anything about multi-GPU so I can't help diagnose the actual problem.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Nov 14, 2023

@jezzarax There's something really, really weird going on here. According to your logs you get 8+ ppl for single GPU and ~6.4, which is a gigantic difference. Also, multi-GPU is the "weird" scenario but apparently the more typical one is where you get the unexpected result. I'm very skeptical about 8+ being correct, the 6.4 sounds much more reasonable.

I don't know anything about multi-GPU so I can't help diagnose the actual problem.

I also assume something weird happens is in addition to the performance problem.

  1. This could be related to tensor shapes, when doing ggllm.cpp I had a few fixes/changes in how tensors are split which originally could result in some operations to result zero tensors without error. You could try use a different -ts to see if perplexity reacts on it. If it reacts you'd know it's a tensor shape issue. (and file a dedicated bug)

  2. Also the codebase in ggllm.cpp (which was an optimized ggml/cuda from an older version) did not suffer from the same performance degradation, it was maybe 1.5 times slower in multi-gpu than in single GPU (still bad but not 5-10 times slower)
    There have been a lot of changes in how synchronization is done, how cuda runs. The single GPU speed improved but the multi GPU speed lowered.

  3. I recall analzing how broadcasted cuda operations would work and each tensor calculation would involve thousands of loops until finished. Thousands of loops which all had a GPU synchronization call when using tensor split.
    I'm sure that could be improved by a different method of tackling the operation.
    The simple solution I suggested (layer split) would replace all the ten thousands of synchronizations with one memory copy at the end of the layer, though I don't know how the performance end-result would be.

I think given the high quality state of llama.cpp and considering new models like llama2 70B and falcon 180B being open for our use it would be quite important to get multi GPU working better, closing the performance gap to python.

@KerfuffleV2
Copy link
Collaborator

The case where they got the unexpected result was for single GPU, as far as I could see. That's what makes it so weird.

@JohannesGaessler
Copy link
Collaborator

Also the codebase in ggllm.cpp (which was an optimized ggml/cuda from an older version) did not suffer from the same performance degradation, it was maybe 1.5 times slower in multi-gpu than in single GPU (still bad but not 5-10 times slower)
There have been a lot of changes in how synchronization is done, how cuda runs. The single GPU speed improved but the multi GPU speed lowered.

As I said before:

Also: see #3814 and check whether that PR has unintentionally resulted in a performance regression for your hardware setup.

@ggerganov
Copy link
Owner

Regarding multi-GPU:

  • Pipeline parallelism will be supported, likely when we integrate the backend interface
  • Tensor parallelism seems to have unreasonably poor performance atm. My last experiments were here: Running on an A100 node #3359. The main issue is that we don't have convenient access to multi-GPU hardware for development. Renting in the cloud is an option that I would likely explore in the future

Regarding the ppl differences:

We need to understand what is going on there.

  • @jezzarax could you do a CPU-only run for a few iterations to see if it matches either one of the GPU runs?
  • Could someone else run a single-GPU ppl run for this model and post the results?

@jezzarax
Copy link

Regarding the ppl differences:

We need to understand what is going on there.

  • @jezzarax could you do a CPU-only run for a few iterations to see if it matches either one of the GPU runs?
  • Could someone else run a single-GPU ppl run for this model and post the results?

I can do both, got access to 1x node as well.

Would -ngl 0 work as a CPU-only run, or should I better rebuild from scratch without cuBLAS?

@KerfuffleV2
Copy link
Collaborator

@jezzarax

should I better rebuild from scratch without cuBLAS?

You'd need to build without GPU support, prompt processing (which is all perplexity does) still uses the GPU even without any layers offloaded.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Nov 14, 2023

export CUDA_VISIBLE_DEVICES = "-1";
$env:CUDA_VISIBLE_DEVICES = "-1";

That should enumerate 0 devices to the cuda backend, so nothing could be initialized or sent to a GPU

@ggerganov
Copy link
Owner

Likely a bug was introduced in 4760e7c

@jezzarax
Copy link

I made multiple runs over two commits and two quantisation levels. I used some commit from two-ish weeks ago and one from yesterday. It looks like there's something strange about f16 quantisation, q8 results seem more consistent.

GPUs model quantization commit perplexity runtime (ms)
0 yi-6b f16 2756c4f 8.855 1300173.65
1 yi-6b f16 2756c4f 8.8942
2 yi-6b f16 2756c4f 6.4202 206444.33
CPU only build yi-6b f16 2756c4f 6.4308 4693429.04
0 yi-6b q8 2756c4f 7.509 693382.52
1 yi-6b q8 2756c4f 7.508 92870.73
2 yi-6b q8 2756c4f 7.5214 191602.34
1 yi-6b f16 6bb4908 8.8942 73072.7
2 yi-6b f16 6bb4908 6.4202 189718.74
CPU only build yi-6b f16 6bb4908 6.4308 4738153.19
0 yi-6b q8 6bb4908 7.5152 4091137.73
1 yi-6b q8 6bb4908 7.508 94022.87
2 yi-6b q8 6bb4908 7.5215 186745.06
CPU only build yi-6b q8 6bb4908 7.5152 4089037.75

I'm not able to run f16 for the current version of the code on bigger models for now due to #3930 (comment)

If there are any other tests I can run on multi-A100 setup, happy to contribute.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 18, 2023

I am not running batch but I obtain performance comparable to exllama on 3090s and the best multi-gpu P40 speeds.

It certainly beats transformers with accelerate or autogptq. I reach speeds similar to metal for large models like falcon with 2 or 3 P40 and 2x3090.

I know that pipeline style approaches were tried with llama_inference_offload in the GPTQ original version. They did speed things up past the normal 2 or 3t/s that would come from using accelerate but nowhere near to this.

This is all using the MMQ kernels though. The new batch kernels did not improve speeds, even on ampere. Could the eventual vulkan backend be faster than cublas?

I am just really confused how people could term multi-gpu in llama.cpp "bad" compared to all the other options. The only time I get slowdowns is prompt processing and I'm not aware how to use the kv_cache token swapping like is done in koboldcpp or if it exists here.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Nov 18, 2023

I am not running batch but I obtain performance comparable to exllama on 3090s and the best multi-gpu P40 speeds.

It certainly beats transformers with accelerate or autogptq. I reach speeds similar to metal for large models like falcon with 2 or 3 P40 and 2x3090.

I know that pipeline style approaches were tried with llama_inference_offload in the GPTQ original version. They did speed things up past the normal 2 or 3t/s that would come from using accelerate but nowhere near to this.

This is all using the MMQ kernels though. The new batch kernels did not improve speeds, even on ampere. Could the eventual vulkan backend be faster than cublas?

I am just really confused how people could term multi-gpu in llama.cpp "bad" compared to all the other options. The only time I get slowdowns is prompt processing and I'm not aware how to use the kv_cache token swapping like is done in koboldcpp or if it exists here.

When 2400 tokens/second drops down to 300 tokens/sec despite using twice the processing hardware, and while inferencing the same model we have a problem that needs solving. That's almost a magnitude in performance lost when adding a second card.
That was the reason why I raised the topic, the inference speed on multi GPU is by far too slow when using fast GPUs.

I didn't intend to trigger emotions when I used the term "bad" in my later comment, just to point to the problem.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 19, 2023

It's not emotion. It's just my experience with it. Splitting a model over multiple GPUs will always lower performance compared to a single GPU with contiguous memory. Have you tried any other inference engine that do not drop so badly and what was the ratio for 1 card vs 2?

@jezzarax
Copy link

It's not only about the performance drop. The numbers differ between single and multi-gpu runs, please check the table I've posted above. Producing correct results is crucial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants