Huge performance degradation using latest branch on Intel Core Ultra 7 155H #8328

aahouzi · 2024-07-05T14:39:30Z

Type of issue

I conducted some benchmarks on Intel Core Ultra 7 155H about 3 months ago using this release: b2568, and these are the results I obtain for llama-2-7B-Q4_0.gguf:

system_info: n_threads = 18 / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 256, n_keep = 1


 Building a website can be done in 10 simple steps:\nStep 1: Choosing a Web Hosting Company\n\nStep 2: Creating an Account\n\nStep 3: Choose a Domain\n\nStep 4: Design Your Website\n\nStep 5: Register a Domain Name\n\nStep 6: Set Up a Website Template\n\nStep 7: Link Your Domain to the Website\n\nStep 8: Use a Content Management System\n\nStep 9: Set Up Your Website's Email Addresses\n\nStep 10: Check the Website's SEO and Security\n\nStep 11: Link to Other Social Media Platforms\n\nStep 12: Add SEO to Your Website\n\nStep 13: Add More Content to Your Website\n\nStep 14: Start Publishing Regularly\n\nStep 15: Get Feedback and Make Improvements\n\n\n
Whether you’re a beginner or an expert, you can build a website using a simple tool like Wix. Wix makes it easy to design and publish your website, regardless of your technical expertise.
llama_print_timings:        load time =    1416.75 ms
llama_print_timings:      sample time =       6.75 ms /   256 runs   (    0.03 ms per token, 37897.85 tokens per second)
llama_print_timings: prompt eval time =     857.20 ms /    19 tokens (   45.12 ms per token,    22.17 tokens per second)
llama_print_timings:        eval time =   22132.23 ms /   255 runs   (   86.79 ms per token,    11.52 tokens per second)
llama_print_timings:       total time =   23052.86 ms /   274 tokens
Log end

Using the latest branch, I observe a drop in performance for next token generation Tpt (abt 2.4 tok/s):

system_info: n_threads = 18 / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 256, n_keep = 1


 Building a website can be done in 10 simple steps:
Step 1: Select your website type
Step 2: Choose a domain name
Step 4: Build your site
Step 5: Connect your site to your domain
Step 6: Install a content management system (CMS)
Step 7: Optimize the website
Step 8: Promote the website
Step 10: Maintain the site
How long does it take to create a website from scratch?
Can you learn to code in 10 days?
How can I build a website in 7 days?
Can I learn how to code in 10 days?
How long does it take to learn HTML and CSS?
How long does it take to learn JavaScript?
A website is a collection of web pages and associated files that are hosted on a server. The pages are typically written in HTML (hypertext markup language) and linked to each other by hypertext links.
Websites can be either static or dynamic. A static website consists of a single page with no interactive components, while a dynamic website can be updated and changed without the need for a web developer.
There are many different types of websites, but the most common are:
-Personal websites: These are typically created
llama_print_timings:        load time =    2072.39 ms
llama_print_timings:      sample time =      10.37 ms /   256 runs   (    0.04 ms per token, 24686.60 tokens per second)
llama_print_timings: prompt eval time =     876.59 ms /    19 tokens (   46.14 ms per token,    21.67 tokens per second)
llama_print_timings:        eval time =   27753.15 ms /   255 runs   (  108.84 ms per token,     9.19 tokens per second)
llama_print_timings:       total time =   28721.61 ms /   274 tokens
Log end

Is performance even being monitored across different HWs when you introduce new code changes ? Because it's great to get better performance, but not at the cost of degrading performance on other range of HWs...

Name and Version

./llama-cli.exe release b3317 vs ./main.exe release b2568

What operating system are you seeing the problem on?

Windows 11

Relevant log output

See issue description

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-07-05T14:49:12Z

This CPU has only 6 performance cores - how is the speed using -t 6? Use llama-bench for more reliable stats

aahouzi · 2024-07-05T15:16:52Z

@ggerganov sure, here are the results with llama-bench:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	tg256	6.44 ± 0.84

build: be55134 (2568)

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	tg256	4.36 ± 0.45

build: 51d2ebad (3303)

fairydreaming · 2024-07-05T15:41:29Z

@aahouzi With the older release you have context length 512 and with the latest release you have context length 4096. At some point llama.cpp started using the longest possible context length by default, so simply set it to lower value and see if this restores performance.

aahouzi · 2024-07-05T15:49:14Z

@fairydreaming Thanks for the suggestion, I tried with context length of 512 but unfortunately this doesn't improve performance. I think it might be deeper than that, any guess ?

fairydreaming · 2024-07-05T15:59:47Z

@aahouzi Can you attach main.log from both releases? I'd like to see if there are any other differences.

fairydreaming · 2024-07-05T16:03:28Z

@aahouzi By the way I tried the older release you mentioned on my machine and the current master. With the older release I got:

llama_print_timings:        load time =     444.17 ms
llama_print_timings:      sample time =       4.60 ms /    96 runs   (    0.05 ms per token, 20874.10 tokens per second)
llama_print_timings: prompt eval time =    8149.21 ms /   103 tokens (   79.12 ms per token,    12.64 tokens per second)
llama_print_timings:        eval time =   24476.30 ms /    95 runs   (  257.65 ms per token,     3.88 tokens per second)
llama_print_timings:       total time =   32708.52 ms /   198 tokens

With the current one:

llama_print_timings:        load time =     344.67 ms
llama_print_timings:      sample time =       4.89 ms /    96 runs   (    0.05 ms per token, 19639.93 tokens per second)
llama_print_timings: prompt eval time =    4405.51 ms /   104 tokens (   42.36 ms per token,    23.61 tokens per second)
llama_print_timings:        eval time =   20069.47 ms /    95 runs   (  211.26 ms per token,     4.73 tokens per second)
llama_print_timings:       total time =   24549.47 ms /   199 tokens

This is LLaMa-3 70B Q8_0 with 512 context on Epyc 9374F, the current master is clearly faster - at least on my workstation.

aahouzi · 2024-07-05T17:21:34Z

@aahouzi By the way I tried the older release you mentioned on my machine and the current master. With the older release I got:

llama_print_timings:        load time =     444.17 ms
llama_print_timings:      sample time =       4.60 ms /    96 runs   (    0.05 ms per token, 20874.10 tokens per second)
llama_print_timings: prompt eval time =    8149.21 ms /   103 tokens (   79.12 ms per token,    12.64 tokens per second)
llama_print_timings:        eval time =   24476.30 ms /    95 runs   (  257.65 ms per token,     3.88 tokens per second)
llama_print_timings:       total time =   32708.52 ms /   198 tokens

With the current one:

llama_print_timings:        load time =     344.67 ms
llama_print_timings:      sample time =       4.89 ms /    96 runs   (    0.05 ms per token, 19639.93 tokens per second)
llama_print_timings: prompt eval time =    4405.51 ms /   104 tokens (   42.36 ms per token,    23.61 tokens per second)
llama_print_timings:        eval time =   20069.47 ms /    95 runs   (  211.26 ms per token,     4.73 tokens per second)
llama_print_timings:       total time =   24549.47 ms /   199 tokens

This is LLaMa-3 70B Q8_0 with 512 context on Epyc 9374F, the current master is clearly faster - at least on my workstation.

Again, your tests are on a CPU workstation, so it's different from Core Ultra in many ways ^^'

@aahouzi Can you attach main.log from both releases? I'd like to see if there are any other differences.

Log start
main: build = 3303 (51d2ebad)
main: built with MSVC 19.40.33811.0 for x64
main: seed  = 0
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    70.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
main: chat template example: <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 18 / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 256, n_keep = 1


 Building a website can be done in 10 simple steps:
Step 1: Select your website type
Step 2: Choose a domain name
Step 4: Build your site
Step 5: Connect your site to your domain
Step 6: Install a content management system (CMS)
Step 7: Optimize the website
Step 8: Promote the website
Step 10: Maintain the site
How long does it take to create a website from scratch?
Can you learn to code in 10 days?
How can I build a website in 7 days?
Can I learn how to code in 10 days?
How long does it take to learn HTML and CSS?
How long does it take to learn JavaScript?
A website is a collection of web pages and associated files that are hosted on a server. The pages are typically written in HTML (hypertext markup language) and linked to each other by hypertext links.
Websites can be either static or dynamic. A static website consists of a single page with no interactive components, while a dynamic website can be updated and changed without the need for a web developer.
There are many different types of websites, but the most common are:
-Personal websites: These are typically created
llama_print_timings:        load time =    1882.35 ms
llama_print_timings:      sample time =      11.29 ms /   256 runs   (    0.04 ms per token, 22678.95 tokens per second)
llama_print_timings: prompt eval time =    1455.39 ms /    19 tokens (   76.60 ms per token,    13.05 tokens per second)
llama_print_timings:        eval time =   32702.99 ms /   255 runs   (  128.25 ms per token,     7.80 tokens per second)
llama_print_timings:       total time =   34253.19 ms /   274 tokens
Log end

Log start
main: build = 2568 (be55134a)
main: built with MSVC 19.40.33811.0 for x64
main: seed  = 1720194315
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ..\..\llama.cpp\llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    70.50 MiB
llama_new_context_with_model: graph nodes  = 1062
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 18 / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 256, n_keep = 1


 Building a website can be done in 10 simple steps:\nStep 1: Choose Your Domain\nStep 2: Find a Web Host\nStep 3: Register Your Domain\nStep 4: Decide What You Want to Do with WordPress\nStep 5: Install WordPress on Your Server\nStep 6: Customize the Theme\nStep 7: Add and Customize a Plugin\nStep 8: Customize Your Plugin\nStep 9: Customize the Look and Feel\nStep 10: Make It Look Good\n"
\end{code}

Comment: What is your question?

Comment: The question is: How do I get to Step 10?

Comment: @BillAlanSmith - I'm not sure what you're asking, but [this](http://www.1and1.com/blog/make-your-website/web-design/10-steps-to-making-a-website-and-blog-for-beginners-with-wordpress-430.html) is a very simple and concise guide to creating your own website.

Comment: I'm looking for a good tutorial or guide that
llama_print_timings:        load time =    1395.34 ms
llama_print_timings:      sample time =       7.39 ms /   256 runs   (    0.03 ms per token, 34660.17 tokens per second)
llama_print_timings: prompt eval time =     834.47 ms /    19 tokens (   43.92 ms per token,    22.77 tokens per second)
llama_print_timings:        eval time =   21893.34 ms /   255 runs   (   85.86 ms per token,    11.65 tokens per second)
llama_print_timings:       total time =   22798.42 ms /   274 tokens
Log end

fairydreaming · 2024-07-05T18:05:26Z

@aahouzi I don't see any obvious problems, I guess the only thing left to do is to use bisection to find the release that introduced the performance degradation for your system. Alternatively you can try testing releases introducing changes in sgemm. Possible candidates are for example b2715 and b2816.

aahouzi · 2024-07-08T16:20:24Z

@fairydreaming After running a bisection between b2568 and b3303, it seems like the regression was introduced in b2715, more specifically in the commit resulting from #6796

@jart tagging you here since this was your PR, and in case you got any suggestions to remove this performance penalty below :)

model	build	size	params	backend	threads	test	t/s
llama 7B Q4_0	b2568	3.56 GiB	6.74 B	CPU	6	tg256	6.44 ± 0.84

model	build	size	params	backend	threads	test	t/s
llama 7B Q4_0	b3303	3.56 GiB	6.74 B	CPU	6	tg256	4.36 ± 0.45

@ggerganov I was wondering, when there are new changes introduced in gemm functions for instance, does your CI measure performance through multiple client HWs or only on Mac ?

ggerganov · 2024-07-09T06:34:40Z

We measure the performance manually - there is no CI for that

zhangts20 · 2024-08-05T08:21:51Z

@aahouzi Hello, can you share some performance comparison between CPU and SYCL on Intel Core Ultra 7 155H?

NeoZhangJianyu · 2024-08-05T09:21:53Z

@aahouzi Let me check it.

NeoZhangJianyu · 2024-08-05T10:25:07Z

@aahouzi
I test on Intel Core Ultra 7 155H windows 11.
I run cmd examples\sycl\win-run-llama2.bat

Old version of b2568
8.7 - 9,.05 tokens per second

latest version:
9,.63 tokens per second

There are several PRs to increase the performance obviously on Arc 770.
They help on MTL too.

I know the MTL gpu driver will impact the performance more.
Could you confirm it?

aahouzi · 2024-08-05T11:25:57Z

@NeoZhangJianyu please read details of the issue, this is unrelated to the issue I'm talking about: The problem is with CPU performance degradation and not GPU, see the numbers I shared too

github-actions · 2024-09-19T01:08:13Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

aahouzi added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jul 5, 2024

github-actions bot added the stale label Sep 5, 2024

github-actions bot closed this as completed Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge performance degradation using latest branch on Intel Core Ultra 7 155H #8328

Huge performance degradation using latest branch on Intel Core Ultra 7 155H #8328

aahouzi commented Jul 5, 2024 •

edited

Loading

ggerganov commented Jul 5, 2024

aahouzi commented Jul 5, 2024

fairydreaming commented Jul 5, 2024

aahouzi commented Jul 5, 2024

fairydreaming commented Jul 5, 2024

fairydreaming commented Jul 5, 2024

aahouzi commented Jul 5, 2024

fairydreaming commented Jul 5, 2024

aahouzi commented Jul 8, 2024

ggerganov commented Jul 9, 2024

zhangts20 commented Aug 5, 2024

NeoZhangJianyu commented Aug 5, 2024

NeoZhangJianyu commented Aug 5, 2024

aahouzi commented Aug 5, 2024 •

edited

Loading

github-actions bot commented Sep 19, 2024

Huge performance degradation using latest branch on Intel Core Ultra 7 155H #8328

Huge performance degradation using latest branch on Intel Core Ultra 7 155H #8328

Comments

aahouzi commented Jul 5, 2024 • edited Loading

Type of issue

Name and Version

What operating system are you seeing the problem on?

Relevant log output

ggerganov commented Jul 5, 2024

aahouzi commented Jul 5, 2024

fairydreaming commented Jul 5, 2024

aahouzi commented Jul 5, 2024

fairydreaming commented Jul 5, 2024

fairydreaming commented Jul 5, 2024

aahouzi commented Jul 5, 2024

fairydreaming commented Jul 5, 2024

aahouzi commented Jul 8, 2024

ggerganov commented Jul 9, 2024

zhangts20 commented Aug 5, 2024

NeoZhangJianyu commented Aug 5, 2024

NeoZhangJianyu commented Aug 5, 2024

aahouzi commented Aug 5, 2024 • edited Loading

github-actions bot commented Sep 19, 2024

aahouzi commented Jul 5, 2024 •

edited

Loading

aahouzi commented Aug 5, 2024 •

edited

Loading