Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AOT inductor example for cpp backend #2913

Merged
merged 14 commits into from
Feb 8, 2024
Merged

Conversation

mreso
Copy link
Collaborator

@mreso mreso commented Jan 30, 2024

Description

This PR add an aot_inductor example for the cpp backend

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • _build/test/torchserve_cpp_test
Running main() from /home/ubuntu/serve/cpp/_build/_deps/googletest-src/googletest/src/gtest_main.cc
[==========] Running 47 tests from 11 test suites.
[----------] Global test environment set-up.
[----------] 1 test from BackendIntegTest
[ RUN      ] BackendIntegTest.TestOTFProtocolAndHandler
I0130 05:56:24.971075 841185 log_metric.cc:92] [METRICS]HandlerTime.Milliseconds:65.773289|#ModelName:mnist_scripted_v2,Level:Model|#hostname:ip-172-31-55-226,1706594184,reqi
I0130 05:56:24.971156 841185 log_metric.cc:92] [METRICS]PredictionTime.Milliseconds:65.773289|#ModelName:mnist_scripted_v2,Level:Model|#hostname:ip-172-31-55-226,1706594184,reqi
[       OK ] BackendIntegTest.TestOTFProtocolAndHandler (93 ms)
[----------] 1 test from BackendIntegTest (93 ms total)

[----------] 8 tests from OTFMessageTest
[ RUN      ] OTFMessageTest.TestRetieveCmd
[       OK ] OTFMessageTest.TestRetieveCmd (0 ms)
[ RUN      ] OTFMessageTest.TestEncodeLoadModelResponse
[       OK ] OTFMessageTest.TestEncodeLoadModelResponse (0 ms)
[ RUN      ] OTFMessageTest.TestUTF8EncodeLoadModelResponse
[       OK ] OTFMessageTest.TestUTF8EncodeLoadModelResponse (0 ms)
[ RUN      ] OTFMessageTest.TestRetrieveMsgLoadGpu
[       OK ] OTFMessageTest.TestRetrieveMsgLoadGpu (0 ms)
[ RUN      ] OTFMessageTest.TestRetrieveMsgLoadNoGpu
[       OK ] OTFMessageTest.TestRetrieveMsgLoadNoGpu (0 ms)
[ RUN      ] OTFMessageTest.TestEncodeSuccessInferenceResponse
[       OK ] OTFMessageTest.TestEncodeSuccessInferenceResponse (0 ms)
[ RUN      ] OTFMessageTest.TestEncodeFailureInferenceResponse
E0130 05:56:24.973362 841185 otf_message_test.cc:157] result_size: 120
[       OK ] OTFMessageTest.TestEncodeFailureInferenceResponse (0 ms)
[ RUN      ] OTFMessageTest.TestRetrieveInferenceMsg
[       OK ] OTFMessageTest.TestRetrieveInferenceMsg (0 ms)
[----------] 8 tests from OTFMessageTest (0 ms total)

[----------] 9 tests from ModelPredictTest
[ RUN      ] ModelPredictTest.TestLoadPredictBabyLlamaHandler
Total number of tokens generated: 332
Achieved tok per sec: 181.223
Generated String:  Hello my name is
The little girl, who was three years old, was playing in the garden. She saw a big, red tomato and wanted to pick it. She reached out her hand and grabbed it.
Suddenly, a voice said, "Hey! That's my tomato!"
The little girl looked up and saw a big, angry bird. She was scared and started to cry.
The bird said, "That tomato is mine! I'm going to eat it!"
The little girl was very scared and ran away. She was so sad that she didn't get to eat the tomato.
The bird flew away and the little girl was left alone in the garden. She was very sad and cried all the way home.
<s>

Generated String:  Hello my name is
The little girl, who was three years old, was playing in the garden. She saw a big, red tomato and wanted to pick it. She reached out her hand and grabbed it.
Suddenly, a voice said, "Hey! That's my tomato!"
The little girl looked up and saw a big, angry bird. She was scared and started to cry.
The bird said, "That tomato is mine! I'm going to eat it!"
The little girl was very scared and ran away. She was so sad that she didn't get to eat the tomato.
The bird flew away and the little girl was left alone in the garden. She was very sad and cried all the way home.
<s>

I0130 05:56:26.827557 841185 log_metric.cc:92] [METRICS]HandlerTime.Milliseconds:1849.236237|#ModelName:babyllama,Level:Model|#hostname:ip-172-31-55-226,1706594186,llm_ts_0,llm_ts_1
I0130 05:56:26.827583 841185 log_metric.cc:92] [METRICS]PredictionTime.Milliseconds:1849.236237|#ModelName:babyllama,Level:Model|#hostname:ip-172-31-55-226,1706594186,llm_ts_0,llm_ts_1
[       OK ] ModelPredictTest.TestLoadPredictBabyLlamaHandler (1857 ms)
[ RUN      ] ModelPredictTest.TestLoadPredictAotInductorLlamaHandler
I0130 05:56:28.228975 841185 log_metric.cc:92] [METRICS]HandlerTime.Milliseconds:1393.753587|#ModelName:llama,Level:Model|#hostname:ip-172-31-55-226,1706594188,llm_ts_0,llm_ts_1
I0130 05:56:28.229018 841185 log_metric.cc:92] [METRICS]PredictionTime.Milliseconds:1393.753587|#ModelName:llama,Level:Model|#hostname:ip-172-31-55-226,1706594188,llm_ts_0,llm_ts_1
[       OK ] ModelPredictTest.TestLoadPredictAotInductorLlamaHandler (1401 ms)
[ RUN      ] ModelPredictTest.TestLoadPredictLlmHandler
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from test/resources/examples/llamacpp/llamacpp_handler/llama-2-7b-chat.Q5_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 8
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 4.33 GiB (5.52 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4435.49 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    70.50 MiB
llama_new_context_with_model: graph splits (measure): 1
E0130 05:56:28.481768 841185 llamacpp_handler.cc:16] Context initialized successfully

llama_print_timings:        load time =     377.89 ms
llama_print_timings:      sample time =       2.35 ms /    64 runs   (    0.04 ms per token, 27234.04 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   10247.66 ms /    64 runs   (  160.12 ms per token,     6.25 tokens per second)
llama_print_timings:       total time =   10522.45 ms /    65 tokens
I0130 05:56:38.788802 841185 log_metric.cc:92] [METRICS]HandlerTime.Milliseconds:10406.741794|#ModelName:llamacpp,Level:Model|#hostname:ip-172-31-55-226,1706594198,llm_ts_0,llm_ts_1
I0130 05:56:38.788825 841185 log_metric.cc:92] [METRICS]PredictionTime.Milliseconds:10406.741794|#ModelName:llamacpp,Level:Model|#hostname:ip-172-31-55-226,1706594198,llm_ts_0,llm_ts_1
[       OK ] ModelPredictTest.TestLoadPredictLlmHandler (10625 ms)
[ RUN      ] ModelPredictTest.TestLoadPredictBaseHandler
I0130 05:56:38.872547 841185 log_metric.cc:92] [METRICS]HandlerTime.Milliseconds:5.725905|#ModelName:mnist_scripted_v2,Level:Model|#hostname:ip-172-31-55-226,1706594198,mnist_ts_0,mnist_ts_1
I0130 05:56:38.872582 841185 log_metric.cc:92] [METRICS]PredictionTime.Milliseconds:5.725905|#ModelName:mnist_scripted_v2,Level:Model|#hostname:ip-172-31-55-226,1706594198,mnist_ts_0,mnist_ts_1
[       OK ] ModelPredictTest.TestLoadPredictBaseHandler (16 ms)
[ RUN      ] ModelPredictTest.TestLoadPredictMnistHandler
I0130 05:56:38.888532 841185 log_metric.cc:92] [METRICS]HandlerTime.Milliseconds:5.044438|#ModelName:mnist_scripted_v2,Level:Model|#hostname:ip-172-31-55-226,1706594198,mnist_ts_0,mnist_ts_1
I0130 05:56:38.888563 841185 log_metric.cc:92] [METRICS]PredictionTime.Milliseconds:5.044438|#ModelName:mnist_scripted_v2,Level:Model|#hostname:ip-172-31-55-226,1706594198,mnist_ts_0,mnist_ts_1
[       OK ] ModelPredictTest.TestLoadPredictMnistHandler (16 ms)
[ RUN      ] ModelPredictTest.TestBackendInitWrongModelDir
E0130 05:56:38.889216 841185 model_archive.cc:53] Failed to init Manifest from: test/resources/examples/mnist/MAR-INF/MANIFEST.json
[       OK ] ModelPredictTest.TestBackendInitWrongModelDir (0 ms)
[ RUN      ] ModelPredictTest.TestBackendInitWrongHandler
[       OK ] ModelPredictTest.TestBackendInitWrongHandler (0 ms)
[ RUN      ] ModelPredictTest.TestLoadModelFailure
E0130 05:56:38.895336 841185 torch_scripted_handler.cc:22] loading the model: mnist_scripted_v2, device id: -1, error: open file failed because of errno 2 on fopen: No such file or directory, file path: test/resources/examples/mnist/wrong_model/mnist_script.pt
[       OK ] ModelPredictTest.TestLoadModelFailure (5 ms)
[ RUN      ] ModelPredictTest.TestLoadPredictMnistHandlerFailure
E0130 05:56:38.908894 841185 base_handler.cc:154] Failed to load tensor for request id: mnist_ts_0, c10 error: PytorchStreamReader failed reading zip archive: failed finding central directory
E0130 05:56:38.913367 841185 base_handler.cc:154] Failed to load tensor for request id: mnist_ts_1, c10 error: PytorchStreamReader failed reading zip archive: failed finding central directory
E0130 05:56:38.915435 841185 base_handler.cc:51] Failed to handle this batch after: Preprocessing
[       OK ] ModelPredictTest.TestLoadPredictMnistHandlerFailure (20 ms)
[----------] 9 tests from ModelPredictTest (13942 ms total)

[----------] 1 test from DLLoaderTest
[ RUN      ] DLLoaderTest.TestGetInstance
[       OK ] DLLoaderTest.TestGetInstance (0 ms)
[----------] 1 test from DLLoaderTest (0 ms total)

[----------] 3 tests from LoggingTest
[ RUN      ] LoggingTest.TestIncorrectLogInitialization
[       OK ] LoggingTest.TestIncorrectLogInitialization (0 ms)
[ RUN      ] LoggingTest.TestJSONConfigLogInitialization
[       OK ] LoggingTest.TestJSONConfigLogInitialization (0 ms)
[ RUN      ] LoggingTest.TestFileLogInitialization
[       OK ] LoggingTest.TestFileLogInitialization (0 ms)
[----------] 3 tests from LoggingTest (0 ms total)

[----------] 6 tests from TSLogMetricTest
[ RUN      ] TSLogMetricTest.TestCounterMetric
[       OK ] TSLogMetricTest.TestCounterMetric (1 ms)
[ RUN      ] TSLogMetricTest.TestGaugeMetric
[       OK ] TSLogMetricTest.TestGaugeMetric (1 ms)
[ RUN      ] TSLogMetricTest.TestHistogramMetric
[       OK ] TSLogMetricTest.TestHistogramMetric (1 ms)
[ RUN      ] TSLogMetricTest.TestTSLogMetricEmitWithRequestId
[       OK ] TSLogMetricTest.TestTSLogMetricEmitWithRequestId (1 ms)
[ RUN      ] TSLogMetricTest.TestTSLogMetricEmitWithoutRequestId
[       OK ] TSLogMetricTest.TestTSLogMetricEmitWithoutRequestId (1 ms)
[ RUN      ] TSLogMetricTest.TestTSLogMetricEmitWithIncorrectDimensionData
[       OK ] TSLogMetricTest.TestTSLogMetricEmitWithIncorrectDimensionData (0 ms)
[----------] 6 tests from TSLogMetricTest (7 ms total)

[----------] 2 tests from TSLogMetricsCacheTest
[ RUN      ] TSLogMetricsCacheTest.TestInitialize
[       OK ] TSLogMetricsCacheTest.TestInitialize (3 ms)
[ RUN      ] TSLogMetricsCacheTest.TestGetMetric
I0130 05:56:38.928321 841185 log_metric.cc:89] [METRICS]GaugeTsMetricExample.Count:1.5|#model_name:model_name,host_name:host_name|#hostname:ip-172-31-55-226,1706594198
[       OK ] TSLogMetricsCacheTest.TestGetMetric (1 ms)
[----------] 2 tests from TSLogMetricsCacheTest (4 ms total)

[----------] 3 tests from RegistryTest
[ RUN      ] RegistryTest.TestValidConfigFile
[       OK ] RegistryTest.TestValidConfigFile (1 ms)
[ RUN      ] RegistryTest.TestInvalidConfigFile
[       OK ] RegistryTest.TestInvalidConfigFile (0 ms)
[ RUN      ] RegistryTest.TestReInitialize
[       OK ] RegistryTest.TestReInitialize (1 ms)
[----------] 3 tests from RegistryTest (3 ms total)

[----------] 3 tests from UnitsTest
[ RUN      ] UnitsTest.TestGetExistingUnitMapping
[       OK ] UnitsTest.TestGetExistingUnitMapping (0 ms)
[ RUN      ] UnitsTest.TestGetNonExistentUnitMapping
[       OK ] UnitsTest.TestGetNonExistentUnitMapping (0 ms)
[ RUN      ] UnitsTest.TestGetEmptyUnitMapping
[       OK ] UnitsTest.TestGetEmptyUnitMapping (0 ms)
[----------] 3 tests from UnitsTest (0 ms total)

[----------] 10 tests from YAMLConfigTest
[ RUN      ] YAMLConfigTest.TestLoadValidConfigFrontendContext
[       OK ] YAMLConfigTest.TestLoadValidConfigFrontendContext (1 ms)
[ RUN      ] YAMLConfigTest.TestLoadValidConfigBackendContext
[       OK ] YAMLConfigTest.TestLoadValidConfigBackendContext (1 ms)
[ RUN      ] YAMLConfigTest.TestLoadMinimalValidConfig
[       OK ] YAMLConfigTest.TestLoadMinimalValidConfig (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithDuplicateDimension
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithDuplicateDimension (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithEmptyDimension
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithEmptyDimension (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithUndefinedDimension
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithUndefinedDimension (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithDuplicateMetricDimension
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithDuplicateMetricDimension (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithMissingMetricName
E0130 05:56:38.936670 841185 yaml_config.cc:203] Configuration for a metric must consist of "name", "unit" and "dimensions"
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithMissingMetricName (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithEmptyMetricName
E0130 05:56:38.937023 841185 yaml_config.cc:215] Configuration for a metric must consist of a non-empty "name"
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithEmptyMetricName (0 ms)
[ RUN      ] YAMLConfigTest.TestLoadInvalidConfigWithDuplicateMetricName
[       OK ] YAMLConfigTest.TestLoadInvalidConfigWithDuplicateMetricName (0 ms)
[----------] 10 tests from YAMLConfigTest (5 ms total)

[----------] 1 test from ManifestTest
[ RUN      ] ManifestTest.TestInitialize
[       OK ] ManifestTest.TestInitialize (0 ms)
[----------] 1 test from ManifestTest (0 ms total)

[----------] Global test environment tear-down
[==========] 47 tests from 11 test suites ran. (14058 ms total)
[  PASSED  ] 47 tests.

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@mreso mreso marked this pull request as ready for review January 31, 2024 18:07
@mreso mreso requested review from chauhang and lxning January 31, 2024 18:07
@mreso mreso added the c++ label Jan 31, 2024
return model, gptconf


if __name__ == "__main__":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that this script can be used to support a list of models. Can we move it to an utils dir?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of, the file needs to be adjusted for different models and inputs (see constraints). We can see later if we can leverage a common structure. For now this is specific to this (kind of) model.

@mreso
Copy link
Collaborator Author

mreso commented Feb 6, 2024

@lxning Thanks for reviewing the changes! I addressed your comments. We can see if we can reutilize the compile.py script somehow when we tackle the next AOT example. I'll create an issue to track the mnist preprocessing step.

@lxning lxning added this pull request to the merge queue Feb 8, 2024
Merged via the queue into master with commit f71d875 Feb 8, 2024
13 checks passed
@mreso mreso deleted the feature/cpp_aot_inductor_example branch February 8, 2024 23:13
@chauhang chauhang added this to the v0.10.0 milestone Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants