Refactor babyllama example to use llama2.c as a submodule #2911

mreso · 2024-01-27T00:37:13Z

Description

This PR adds llama2.c as a submodule instead of a modified copy of the code. This allows to

pick up eventual changes to the main repository
easier integration of llama2.c forks (like e.g. llama2.so)

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Refactor

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A

torchserve_cpp build is complete. To run unit test:   ./_build/test/torchserve_cpp_test
Running main() from /home/ubuntu/serve/cpp/_build/_deps/googletest-src/googletest/src/gtest_main.cc
[==========] Running 46 tests from 11 test suites.
[----------] Global test environment set-up.
[----------] 1 test from BackendIntegTest
[ RUN      ] BackendIntegTest.TestOTFProtocolAndHandler
I0127 00:29:54.328921 400042 log_metric.cc:92] [METRICS]HandlerTime.Milliseconds:79.581624|#ModelName:mnist_scripted_v2,Level:Model|#hostname:ip-172-31-55-226,1706315394,reqi
I0127 00:29:54.329008 400042 log_metric.cc:92] [METRICS]PredictionTime.Milliseconds:79.581624|#ModelName:mnist_scripted_v2,Level:Model|#hostname:ip-172-31-55-226,1706315394,reqi
[       OK ] BackendIntegTest.TestOTFProtocolAndHandler (106 ms)
[----------] 1 test from BackendIntegTest (106 ms total)

[----------] 8 tests from OTFMessageTest
[ RUN      ] OTFMessageTest.TestRetieveCmd
[       OK ] OTFMessageTest.TestRetieveCmd (0 ms)
[ RUN      ] OTFMessageTest.TestEncodeLoadModelResponse
[       OK ] OTFMessageTest.TestEncodeLoadModelResponse (0 ms)
[ RUN      ] OTFMessageTest.TestUTF8EncodeLoadModelResponse
[       OK ] OTFMessageTest.TestUTF8EncodeLoadModelResponse (0 ms)
[ RUN      ] OTFMessageTest.TestRetrieveMsgLoadGpu
[       OK ] OTFMessageTest.TestRetrieveMsgLoadGpu (0 ms)
[ RUN      ] OTFMessageTest.TestRetrieveMsgLoadNoGpu
[       OK ] OTFMessageTest.TestRetrieveMsgLoadNoGpu (0 ms)
[ RUN      ] OTFMessageTest.TestEncodeSuccessInferenceResponse
[       OK ] OTFMessageTest.TestEncodeSuccessInferenceResponse (0 ms)
[ RUN      ] OTFMessageTest.TestEncodeFailureInferenceResponse
E0127 00:29:54.331107 400042 otf_message_test.cc:157] result_size: 120
[       OK ] OTFMessageTest.TestEncodeFailureInferenceResponse (0 ms)
[ RUN      ] OTFMessageTest.TestRetrieveInferenceMsg
[       OK ] OTFMessageTest.TestRetrieveInferenceMsg (0 ms)
[----------] 8 tests from OTFMessageTest (0 ms total)

[----------] 8 tests from ModelPredictTest
[ RUN      ] ModelPredictTest.TestLoadPredictBabyLlamaHandler
Total number of tokens generated: 332
Achieved tok per sec: 176.034
Generated String:  Hello my name is
The little girl, who was three years old, was playing in the garden. She saw a big, red tomato and wanted to pick it. She reached out her hand and grabbed it.
Suddenly, a voice said, "Hey! That's my tomato!"
The little girl looked up and saw a big, angry bird. She was scared and started to cry.
The bird said, "That tomato is mine! I'm going to eat it!"
The little girl was very scared and ran away. She was so sad that she didn't get to eat the tomato.
The bird flew away and the little girl was left alone in the garden. She was very sad and cried all the way home.
<s>

Generated String:  Hello my name is
The little girl, who was three years old, was playing in the garden. She saw a big, red tomato and wanted to pick it. She reached out her hand and grabbed it.
Suddenly, a voice said, "Hey! That's my tomato!"
The little girl looked up and saw a big, angry bird. She was scared and started to cry.
The bird said, "That tomato is mine! I'm going to eat it!"
The little girl was very scared and ran away. She was so sad that she didn't get to eat the tomato.
The bird flew away and the little girl was left alone in the garden. She was very sad and cried all the way home.
<s>

I0127 00:29:56.239158 400042 log_metric.cc:92] [METRICS]HandlerTime.Milliseconds:1902.926161|#ModelName:babyllama,Level:Model|#hostname:ip-172-31-55-226,1706315396,llm_ts_0,llm_ts_1
I0127 00:29:56.239181 400042 log_metric.cc:92] [METRICS]PredictionTime.Milliseconds:1902.926161|#ModelName:babyllama,Level:Model|#hostname:ip-172-31-55-226,1706315396,llm_ts_0,llm_ts_1
[       OK ] ModelPredictTest.TestLoadPredictBabyLlamaHandler (1909 ms)
[ RUN      ] ModelPredictTest.TestLoadPredictLlmHandler
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from test/resources/examples/llamacpp/llamacpp_handler/llama-2-7b-chat.Q5_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 8
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 4.33 GiB (5.52 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4435.49 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    70.50 MiB
llama_new_context_with_model: graph splits (measure): 1
E0127 00:29:56.505548 400042 llamacpp_handler.cc:16] Context initialized successfully

llama_print_timings:        load time =     394.93 ms
llama_print_timings:      sample time =       2.38 ms /    64 runs   (    0.04 ms per token, 26845.64 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   10504.53 ms /    64 runs   (  164.13 ms per token,     6.09 tokens per second)
llama_print_timings:       total time =   10793.23 ms /    65 tokens
I0127 00:30:07.068119 400042 log_metric.cc:92] [METRICS]HandlerTime.Milliseconds:10664.231547|#ModelName:llamacpp,Level:Model|#hostname:ip-172-31-55-226,1706315407,llm_ts_0,llm_ts_1
I0127 00:30:07.068143 400042 log_metric.cc:92] [METRICS]PredictionTime.Milliseconds:10664.231547|#ModelName:llamacpp,Level:Model|#hostname:ip-172-31-55-226,1706315407,llm_ts_0,llm_ts_1
[       OK ] ModelPredictTest.TestLoadPredictLlmHandler (10906 ms)
[ RUN      ] ModelPredictTest.TestLoadPredictBaseHandler
I0127 00:30:07.163043 400042 log_metric.cc:92] [METRICS]HandlerTime.Milliseconds:5.382311|#ModelName:mnist_scripted_v2,Level:Model|#hostname:ip-172-31-55-226,1706315407,mnist_ts_0,mnist_ts_1
I0127 00:30:07.163074 400042 log_metric.cc:92] [METRICS]PredictionTime.Milliseconds:5.382311|#ModelName:mnist_scripted_v2,Level:Model|#hostname:ip-172-31-55-226,1706315407,mnist_ts_0,mnist_ts_1
[       OK ] ModelPredictTest.TestLoadPredictBaseHandler (15 ms)
[ RUN      ] ModelPredictTest.TestLoadPredictMnistHandler
I0127 00:30:07.178177 400042 log_metric.cc:92] [METRICS]HandlerTime.Milliseconds:4.789629|#ModelName:mnist_scripted_v2,Level:Model|#hostname:ip-172-31-55-226,1706315407,mnist_ts_0,mnist_ts_1
I0127 00:30:07.178213 400042 log_metric.cc:92] [METRICS]PredictionTime.Milliseconds:4.789629|#ModelName:mnist_scripted_v2,Level:Model|#hostname:ip-172-31-55-226,1706315407,mnist_ts_0,mnist_ts_1
[       OK ] ModelPredictTest.TestLoadPredictMnistHandler (15 ms)
[ RUN      ] ModelPredictTest.TestBackendInitWrongModelDir
E0127 00:30:07.178795 400042 model_archive.cc:53] Failed to init Manifest from: test/resources/examples/mnist/MAR-INF/MANIFEST.json
[       OK ] ModelPredictTest.TestBackendInitWrongModelDir (0 ms)
[ RUN      ] ModelPredictTest.TestBackendInitWrongHandler
[       OK ] ModelPredictTest.TestBackendInitWrongHandler (0 ms)
[ RUN      ] ModelPredictTest.TestLoadModelFailure
E0127 00:30:07.184425 400042 torch_scripted_handler.cc:22] loading the model: mnist_scripted_v2, device id: -1, error: open file failed because of errno 2 on fopen: , file path: test/resources/examples/mnist/wrong_model/mnist_script.pt
[       OK ] ModelPredictTest.TestLoadModelFailure (5 ms)
[ RUN      ] ModelPredictTest.TestLoadPredictMnistHandlerFailure
E0127 00:30:07.198074 400042 base_handler.cc:154] Failed to load tensor for request id: mnist_ts_0, c10 error: PytorchStreamReader failed reading zip archive: failed finding central directory
E0127 00:30:07.202507 400042 base_handler.cc:154] Failed to load tensor for request id: mnist_ts_1, c10 error: PytorchStreamReader failed reading zip archive: failed finding central directory
E0127 00:30:07.204546 400042 base_handler.cc:51] Failed to handle this batch after: Preprocessing
[       OK ] ModelPredictTest.TestLoadPredictMnistHandlerFailure (20 ms)
[----------] 8 tests from ModelPredictTest (12873 ms total)

[----------] 1 test from DLLoaderTest
[ RUN      ] DLLoaderTest.TestGetInstance
[       OK ] DLLoaderTest.TestGetInstance (0 ms)
[----------] 1 test from DLLoaderTest (0 ms total)

[----------] 3 tests from LoggingTest
[ RUN      ] LoggingTest.TestIncorrectLogInitialization
[       OK ] LoggingTest.TestIncorrectLogInitialization (0 ms)
[ RUN      ] LoggingTest.TestJSONConfigLogInitialization
[       OK ] LoggingTest.TestJSONConfigLogInitialization (0 ms)
[ RUN      ] LoggingTest.TestFileLogInitialization
[       OK ] LoggingTest.TestFileLogInitialization (0 ms)
[----------] 3 tests from LoggingTest (0 ms total)

[----------] 6 tests from TSLogMetricTest
[ RUN      ] TSLogMetricTest.TestCounterMetric
[       OK ] TSLogMetricTest.TestCounterMetric (1 ms)
[ RUN      ] TSLogMetricTest.TestGaugeMetric
[       OK ] TSLogMetricTest.TestGaugeMetric (1 ms)
[ RUN      ] TSLogMetricTest.TestHistogramMetric
[       OK ] TSLogMetricTest.TestHistogramMetric (1 ms)
[ RUN      ] TSLogMetricTest.TestTSLogMetricEmitWithRequestId
[       OK ] TSLogMetricTest.TestTSLogMetricEmitWithRequestId (1 ms)
[ RUN      ] TSLogMetricTest.TestTSLogMetricEmitWithoutRequestId
[       OK ] TSLogMetricTest.TestTSLogMetricEmitWithoutRequestId (1 ms)
[ RUN      ] TSLogMetricTest.TestTSLogMetricEmitWithIncorrectDimensionData
[       OK ] TSLogMetricTest.TestTSLogMetricEmitWithIncorrectDimensionData (0 ms)
[----------] 6 tests from TSLogMetricTest (7 ms total)

[----------] 2 tests from TSLogMetricsCacheTest
[ RUN      ] TSLogMetricsCacheTest.TestInitialize
[       OK ] TSLogMetricsCacheTest.TestInitialize (3 ms)
[ RUN      ] TSLogMetricsCacheTest.TestGetMetric
I0127 00:30:07.217655 400042 log_metric.cc:89] [METRICS]GaugeTsMetricExample.Count:1.5|#model_name:model_name,host_name:host_name|#hostname:ip-172-31-55-226,1706315407
[       OK ] TSLogMetricsCacheTest.TestGetMetric (1 ms)
[----------] 2 tests from TSLogMetricsCacheTest (4 ms total)

[----------] 3 tests from RegistryTest
[ RUN      ] RegistryTest.TestValidConfigFile
[       OK ] RegistryTest.TestValidConfigFile (1 ms)
[ RUN      ] RegistryTest.TestInvalidConfigFile
[       OK ] RegistryTest.TestInvalidConfigFile (0 ms)
[ RUN      ] RegistryTest.TestReInitialize
[       OK ] RegistryTest.TestReInitialize (2 ms)
[----------] 3 tests from RegistryTest (3 ms total)

[----------] 3 tests from UnitsTest
[ RUN      ] UnitsTest.TestGetExistingUnitMapping
[       OK ] UnitsTest.TestGetExistingUnitMapping (0 ms)
[ RUN      ] UnitsTest.TestGetNonExistentUnitMapping
[       OK ] UnitsTest.TestGetNonExistentUnitMapping (0 ms)
[ RUN      ] UnitsTest.TestGetEmptyUnitMapping
[       OK ] UnitsTest.TestGetEmptyUnitMapping (0 ms)
[----------] 3 tests from UnitsTest (0 ms total)

[----------] 10 tests from YAMLConfigTest
[ RUN      ] YAMLConfigTest.TestLoadValidConfigFrontendContext
[       OK ] YAMLConfigTest.TestLoadValidConfigFrontendContext (1 ms)
[ RUN      ] YAMLConfigTest.TestLoadValidConfigBackendContext
[       OK ] YAMLConfigTest.TestLoadValidConfigBackendContext (1 ms)
[ RUN      ] YAMLConfigTest.TestLoadMinimalValidConfig
[       OK ] YAMLConfigTest.TestLoadMinimalValidConfig (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithDuplicateDimension
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithDuplicateDimension (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithEmptyDimension
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithEmptyDimension (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithUndefinedDimension
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithUndefinedDimension (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithDuplicateMetricDimension
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithDuplicateMetricDimension (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithMissingMetricName
E0127 00:30:07.226278 400042 yaml_config.cc:203] Configuration for a metric must consist of "name", "unit" and "dimensions"
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithMissingMetricName (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithEmptyMetricName
E0127 00:30:07.226656 400042 yaml_config.cc:215] Configuration for a metric must consist of a non-empty "name"
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithEmptyMetricName (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithDuplicateMetricName
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithDuplicateMetricName (0 ms)
[----------] 10 tests from YAMLConfigTest (5 ms total)

[----------] 1 test from ManifestTest
[ RUN      ] ManifestTest.TestInitialize
[       OK ] ManifestTest.TestInitialize (0 ms)
[----------] 1 test from ManifestTest (0 ms total)

[----------] Global test environment tear-down
[==========] 46 tests from 11 test suites ran. (13003 ms total)
[  PASSED  ] 46 tests.

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

…nstead of .c file

Include llama2.c as a submodule and just add header file to example i…

d886bff

…nstead of .c file

mreso requested review from chauhang and lxning January 27, 2024 01:19

mreso added the c++ label Jan 27, 2024

mreso marked this pull request as ready for review January 27, 2024 01:19

chauhang approved these changes Jan 27, 2024

View reviewed changes

mreso added this pull request to the merge queue Jan 27, 2024

Merged via the queue into master with commit 4b69459 Jan 27, 2024
13 checks passed

chauhang added this to the v0.10.0 milestone Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor babyllama example to use llama2.c as a submodule #2911

Refactor babyllama example to use llama2.c as a submodule #2911

mreso commented Jan 27, 2024

Refactor babyllama example to use llama2.c as a submodule #2911

Refactor babyllama example to use llama2.c as a submodule #2911

Conversation

mreso commented Jan 27, 2024

Description

Type of change

Feature/Issue validation/testing

Checklist: