Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge performance degradation using latest branch on Intel Core Ultra 7 155H #8328

Closed
aahouzi opened this issue Jul 5, 2024 · 15 comments
Closed
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale

Comments

@aahouzi
Copy link
Contributor

aahouzi commented Jul 5, 2024

Type of issue

  • I conducted some benchmarks on Intel Core Ultra 7 155H about 3 months ago using this release: b2568, and these are the results I obtain for llama-2-7B-Q4_0.gguf:
system_info: n_threads = 18 / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 256, n_keep = 1


 Building a website can be done in 10 simple steps:\nStep 1: Choosing a Web Hosting Company\n\nStep 2: Creating an Account\n\nStep 3: Choose a Domain\n\nStep 4: Design Your Website\n\nStep 5: Register a Domain Name\n\nStep 6: Set Up a Website Template\n\nStep 7: Link Your Domain to the Website\n\nStep 8: Use a Content Management System\n\nStep 9: Set Up Your Website's Email Addresses\n\nStep 10: Check the Website's SEO and Security\n\nStep 11: Link to Other Social Media Platforms\n\nStep 12: Add SEO to Your Website\n\nStep 13: Add More Content to Your Website\n\nStep 14: Start Publishing Regularly\n\nStep 15: Get Feedback and Make Improvements\n\n\n
Whether you’re a beginner or an expert, you can build a website using a simple tool like Wix. Wix makes it easy to design and publish your website, regardless of your technical expertise.
llama_print_timings:        load time =    1416.75 ms
llama_print_timings:      sample time =       6.75 ms /   256 runs   (    0.03 ms per token, 37897.85 tokens per second)
llama_print_timings: prompt eval time =     857.20 ms /    19 tokens (   45.12 ms per token,    22.17 tokens per second)
llama_print_timings:        eval time =   22132.23 ms /   255 runs   (   86.79 ms per token,    11.52 tokens per second)
llama_print_timings:       total time =   23052.86 ms /   274 tokens
Log end
  • Using the latest branch, I observe a drop in performance for next token generation Tpt (abt 2.4 tok/s):
system_info: n_threads = 18 / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 256, n_keep = 1


 Building a website can be done in 10 simple steps:
Step 1: Select your website type
Step 2: Choose a domain name
Step 4: Build your site
Step 5: Connect your site to your domain
Step 6: Install a content management system (CMS)
Step 7: Optimize the website
Step 8: Promote the website
Step 10: Maintain the site
How long does it take to create a website from scratch?
Can you learn to code in 10 days?
How can I build a website in 7 days?
Can I learn how to code in 10 days?
How long does it take to learn HTML and CSS?
How long does it take to learn JavaScript?
A website is a collection of web pages and associated files that are hosted on a server. The pages are typically written in HTML (hypertext markup language) and linked to each other by hypertext links.
Websites can be either static or dynamic. A static website consists of a single page with no interactive components, while a dynamic website can be updated and changed without the need for a web developer.
There are many different types of websites, but the most common are:
-Personal websites: These are typically created
llama_print_timings:        load time =    2072.39 ms
llama_print_timings:      sample time =      10.37 ms /   256 runs   (    0.04 ms per token, 24686.60 tokens per second)
llama_print_timings: prompt eval time =     876.59 ms /    19 tokens (   46.14 ms per token,    21.67 tokens per second)
llama_print_timings:        eval time =   27753.15 ms /   255 runs   (  108.84 ms per token,     9.19 tokens per second)
llama_print_timings:       total time =   28721.61 ms /   274 tokens
Log end
  • Is performance even being monitored across different HWs when you introduce new code changes ? Because it's great to get better performance, but not at the cost of degrading performance on other range of HWs...

Name and Version

./llama-cli.exe release b3317 vs ./main.exe release b2568

What operating system are you seeing the problem on?

Windows 11

Relevant log output

See issue description

@aahouzi aahouzi added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jul 5, 2024
@ggerganov
Copy link
Owner

This CPU has only 6 performance cores - how is the speed using -t 6? Use llama-bench for more reliable stats

@aahouzi
Copy link
Contributor Author

aahouzi commented Jul 5, 2024

@ggerganov sure, here are the results with llama-bench:

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 tg256 6.44 ± 0.84

build: be55134 (2568)

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 tg256 4.36 ± 0.45

build: 51d2ebad (3303)

@fairydreaming
Copy link
Collaborator

@aahouzi With the older release you have context length 512 and with the latest release you have context length 4096. At some point llama.cpp started using the longest possible context length by default, so simply set it to lower value and see if this restores performance.

@aahouzi
Copy link
Contributor Author

aahouzi commented Jul 5, 2024

@fairydreaming Thanks for the suggestion, I tried with context length of 512 but unfortunately this doesn't improve performance. I think it might be deeper than that, any guess ?

@fairydreaming
Copy link
Collaborator

@aahouzi Can you attach main.log from both releases? I'd like to see if there are any other differences.

@fairydreaming
Copy link
Collaborator

@aahouzi By the way I tried the older release you mentioned on my machine and the current master. With the older release I got:

llama_print_timings:        load time =     444.17 ms
llama_print_timings:      sample time =       4.60 ms /    96 runs   (    0.05 ms per token, 20874.10 tokens per second)
llama_print_timings: prompt eval time =    8149.21 ms /   103 tokens (   79.12 ms per token,    12.64 tokens per second)
llama_print_timings:        eval time =   24476.30 ms /    95 runs   (  257.65 ms per token,     3.88 tokens per second)
llama_print_timings:       total time =   32708.52 ms /   198 tokens

With the current one:

llama_print_timings:        load time =     344.67 ms
llama_print_timings:      sample time =       4.89 ms /    96 runs   (    0.05 ms per token, 19639.93 tokens per second)
llama_print_timings: prompt eval time =    4405.51 ms /   104 tokens (   42.36 ms per token,    23.61 tokens per second)
llama_print_timings:        eval time =   20069.47 ms /    95 runs   (  211.26 ms per token,     4.73 tokens per second)
llama_print_timings:       total time =   24549.47 ms /   199 tokens

This is LLaMa-3 70B Q8_0 with 512 context on Epyc 9374F, the current master is clearly faster - at least on my workstation.

@aahouzi
Copy link
Contributor Author

aahouzi commented Jul 5, 2024

@aahouzi By the way I tried the older release you mentioned on my machine and the current master. With the older release I got:

llama_print_timings:        load time =     444.17 ms
llama_print_timings:      sample time =       4.60 ms /    96 runs   (    0.05 ms per token, 20874.10 tokens per second)
llama_print_timings: prompt eval time =    8149.21 ms /   103 tokens (   79.12 ms per token,    12.64 tokens per second)
llama_print_timings:        eval time =   24476.30 ms /    95 runs   (  257.65 ms per token,     3.88 tokens per second)
llama_print_timings:       total time =   32708.52 ms /   198 tokens

With the current one:

llama_print_timings:        load time =     344.67 ms
llama_print_timings:      sample time =       4.89 ms /    96 runs   (    0.05 ms per token, 19639.93 tokens per second)
llama_print_timings: prompt eval time =    4405.51 ms /   104 tokens (   42.36 ms per token,    23.61 tokens per second)
llama_print_timings:        eval time =   20069.47 ms /    95 runs   (  211.26 ms per token,     4.73 tokens per second)
llama_print_timings:       total time =   24549.47 ms /   199 tokens

This is LLaMa-3 70B Q8_0 with 512 context on Epyc 9374F, the current master is clearly faster - at least on my workstation.

Again, your tests are on a CPU workstation, so it's different from Core Ultra in many ways ^^'

@aahouzi Can you attach main.log from both releases? I'd like to see if there are any other differences.

Log start
main: build = 3303 (51d2ebad)
main: built with MSVC 19.40.33811.0 for x64
main: seed  = 0
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    70.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
main: chat template example: <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 18 / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 256, n_keep = 1


 Building a website can be done in 10 simple steps:
Step 1: Select your website type
Step 2: Choose a domain name
Step 4: Build your site
Step 5: Connect your site to your domain
Step 6: Install a content management system (CMS)
Step 7: Optimize the website
Step 8: Promote the website
Step 10: Maintain the site
How long does it take to create a website from scratch?
Can you learn to code in 10 days?
How can I build a website in 7 days?
Can I learn how to code in 10 days?
How long does it take to learn HTML and CSS?
How long does it take to learn JavaScript?
A website is a collection of web pages and associated files that are hosted on a server. The pages are typically written in HTML (hypertext markup language) and linked to each other by hypertext links.
Websites can be either static or dynamic. A static website consists of a single page with no interactive components, while a dynamic website can be updated and changed without the need for a web developer.
There are many different types of websites, but the most common are:
-Personal websites: These are typically created
llama_print_timings:        load time =    1882.35 ms
llama_print_timings:      sample time =      11.29 ms /   256 runs   (    0.04 ms per token, 22678.95 tokens per second)
llama_print_timings: prompt eval time =    1455.39 ms /    19 tokens (   76.60 ms per token,    13.05 tokens per second)
llama_print_timings:        eval time =   32702.99 ms /   255 runs   (  128.25 ms per token,     7.80 tokens per second)
llama_print_timings:       total time =   34253.19 ms /   274 tokens
Log end
Log start
main: build = 2568 (be55134a)
main: built with MSVC 19.40.33811.0 for x64
main: seed  = 1720194315
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ..\..\llama.cpp\llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    70.50 MiB
llama_new_context_with_model: graph nodes  = 1062
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 18 / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 256, n_keep = 1


 Building a website can be done in 10 simple steps:\nStep 1: Choose Your Domain\nStep 2: Find a Web Host\nStep 3: Register Your Domain\nStep 4: Decide What You Want to Do with WordPress\nStep 5: Install WordPress on Your Server\nStep 6: Customize the Theme\nStep 7: Add and Customize a Plugin\nStep 8: Customize Your Plugin\nStep 9: Customize the Look and Feel\nStep 10: Make It Look Good\n"
\end{code}

Comment: What is your question?

Comment: The question is: How do I get to Step 10?

Comment: @BillAlanSmith - I'm not sure what you're asking, but [this](http://www.1and1.com/blog/make-your-website/web-design/10-steps-to-making-a-website-and-blog-for-beginners-with-wordpress-430.html) is a very simple and concise guide to creating your own website.

Comment: I'm looking for a good tutorial or guide that
llama_print_timings:        load time =    1395.34 ms
llama_print_timings:      sample time =       7.39 ms /   256 runs   (    0.03 ms per token, 34660.17 tokens per second)
llama_print_timings: prompt eval time =     834.47 ms /    19 tokens (   43.92 ms per token,    22.77 tokens per second)
llama_print_timings:        eval time =   21893.34 ms /   255 runs   (   85.86 ms per token,    11.65 tokens per second)
llama_print_timings:       total time =   22798.42 ms /   274 tokens
Log end

@fairydreaming
Copy link
Collaborator

@aahouzi I don't see any obvious problems, I guess the only thing left to do is to use bisection to find the release that introduced the performance degradation for your system. Alternatively you can try testing releases introducing changes in sgemm. Possible candidates are for example b2715 and b2816.

@aahouzi
Copy link
Contributor Author

aahouzi commented Jul 8, 2024

@fairydreaming After running a bisection between b2568 and b3303, it seems like the regression was introduced in b2715, more specifically in the commit resulting from #6796

@jart tagging you here since this was your PR, and in case you got any suggestions to remove this performance penalty below :)

model build size params backend threads test t/s
llama 7B Q4_0 b2568 3.56 GiB 6.74 B CPU 6 tg256 6.44 ± 0.84
model build size params backend threads test t/s
llama 7B Q4_0 b3303 3.56 GiB 6.74 B CPU 6 tg256 4.36 ± 0.45

@ggerganov I was wondering, when there are new changes introduced in gemm functions for instance, does your CI measure performance through multiple client HWs or only on Mac ?

@ggerganov
Copy link
Owner

We measure the performance manually - there is no CI for that

@zhangts20
Copy link

@aahouzi Hello, can you share some performance comparison between CPU and SYCL on Intel Core Ultra 7 155H?

@NeoZhangJianyu
Copy link
Collaborator

@aahouzi Let me check it.

@NeoZhangJianyu
Copy link
Collaborator

@aahouzi
I test on Intel Core Ultra 7 155H windows 11.
I run cmd examples\sycl\win-run-llama2.bat

Old version of b2568
8.7 - 9,.05 tokens per second

latest version:
9,.63 tokens per second

There are several PRs to increase the performance obviously on Arc 770.
They help on MTL too.

I know the MTL gpu driver will impact the performance more.
Could you confirm it?

@aahouzi
Copy link
Contributor Author

aahouzi commented Aug 5, 2024

@NeoZhangJianyu please read details of the issue, this is unrelated to the issue I'm talking about: The problem is with CPU performance degradation and not GPU, see the numbers I shared too

@github-actions github-actions bot added the stale label Sep 5, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale
Projects
None yet
Development

No branches or pull requests

5 participants