Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TensorRT] Support Multiple EP Context #23294

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

Conversation

jingyanwangms
Copy link
Contributor

@jingyanwangms jingyanwangms commented Jan 8, 2025

Description

  • Use CreateEpContextModel from graph_partitioner.cc to save model with context ep. Now multi ep context in a model is supported
  • Updated merging ep context related options from session option and tensorrt option
  • Updated and adding unit test

Supported scenarios:

  • Save/run static single ep context node using engine cache
  • Save/run static single ep context node with embedded ep context info
  • Save/run static multiple ep context node using engine cache
  • Save/run static multiple ep context node with embedded ep context info
  • Save/run dynamic multiple ep context node using engine cache
  • Save/run dynamic multiple ep context node with embedded ep context info

Unsupported scenarios:

  • Subsequent runs with embedded dynamic input ep context node where dynamic input dimension changed
    This does not work because tensorrt engine might be updated during run time because of input size change but ORT does not have a call back mechanism to call CreateEpContextModel to update embedded ep context. Supporting this will require significant changes in the existing infrastructure.

Motivation and Context

@jywu-msft jywu-msft requested a review from chilo-ms January 10, 2025 17:06
@chilo-ms
Copy link
Contributor

chilo-ms commented Jan 14, 2025

You should modify tensorrt_execution_provider.cc line # 3853 to 3856

      // dump ep context model
      if (dump_ep_context_model_ && ep_context_embed_mode_) {
        UpdateCtxNodeModelEngineContext(model_proto_.get(), reinterpret_cast<char*>(serialized_engine->data()), serialized_engine->size());
        DumpCtxModel(model_proto_.get(), ctx_model_path_);
      }

The code above handles the case when the graph has dynamic shape input(s) and the engine is being updated during inference.
Old TRT EP behavior will update the engine binary embedded in EP Context node and dump the EP Context model to disk.
In this PR to support EP Context model for partitioning, it's graph partitioner which dumps the model to disk, but we still need to think about how to handle the special case here for TRT EP. If not, the new TRT EP might not work for the old app which works on dynamic shape input and ep_context_embed_mode is 1.

@jingyanwangms
Copy link
Contributor Author

jingyanwangms commented Jan 22, 2025

You should modify tensorrt_execution_provider.cc line # 3853 to 3856

      // dump ep context model
      if (dump_ep_context_model_ && ep_context_embed_mode_) {
        UpdateCtxNodeModelEngineContext(model_proto_.get(), reinterpret_cast<char*>(serialized_engine->data()), serialized_engine->size());
        DumpCtxModel(model_proto_.get(), ctx_model_path_);
      }

The code above handles the case when the graph has dynamic shape input(s) and the engine is being updated during inference. Old TRT EP behavior will update the engine binary embedded in EP Context node and dump the EP Context model to disk. In this PR to support EP Context model for partitioning, it's graph partitioner which dumps the model to disk, but we still need to think about how to handle the special case here for TRT EP. If not, the new TRT EP might not work for the old app which works on dynamic shape input and ep_context_embed_mode is 1.

I added a warning in if (dump_ep_context_model_ && ep_context_embed_mode_) case to prompt user generate ep context model. Handling this case will require changes to all EP context design. We have confirmed this is a lower priority use case

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

jingyanwangms and others added 5 commits January 30, 2025 21:52
…r.cc

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment on lines 591 to 593
ASSERT_TRUE(status.IsOK());
// run inference
// TRT engine will be created and cached
// TRT profile will be created and cached only for dynamic input shape
// Data in profile,
// X: 1, 3, 3, 2, 2, 2
// Y: 1, 3, 3, 2, 2, 2
// Z: 1, 3, 3, 2, 2, 2
RunSession(session_object3, run_options, feeds, output_names, expected_dims_mul_m, expected_values_mul_m);

// Test engine cache path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ASSERT_TRUE(status.IsOK());
// run inference
// TRT engine will be created and cached
// TRT profile will be created and cached only for dynamic input shape
// Data in profile,
// X: 1, 3, 3, 2, 2, 2
// Y: 1, 3, 3, 2, 2, 2
// Z: 1, 3, 3, 2, 2, 2
RunSession(session_object3, run_options, feeds, output_names, expected_dims_mul_m, expected_values_mul_m);
// Test engine cache path:
ASSERT_TRUE(status.IsOK());
// Test engine cache path:

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment on lines 211 to 213


std::vector<char> ReadFileFromDisk(const PathString& path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<char> ReadFileFromDisk(const PathString& path) {
std::vector<char> ReadFileFromDisk(const PathString& path) {

Comment on lines 477 to 479
std::vector<int> dims = {1, 3, 2};

remove(ctx_model_path.c_str()); // remove the context model file generated by previous test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<int> dims = {1, 3, 2};
remove(ctx_model_path.c_str()); // remove the context model file generated by previous test
std::vector<int> dims = {1, 3, 2};
remove(ctx_model_path.c_str()); // remove the context model file generated by previous test

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment on lines 776 to 778
std::vector<int64_t> expected_dims_mul_m = {3, 6};
std::vector<int64_t> expected_values_mul_m = { 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 13, 14, 14, 16, 16, 18, 0, 1 };

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<int64_t> expected_dims_mul_m = {3, 6};
std::vector<int64_t> expected_values_mul_m = { 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 13, 14, 14, 16, 16, 18, 0, 1 };
std::vector<int64_t> expected_dims_mul_m = {3, 6};
std::vector<int64_t> expected_values_mul_m = {1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 13, 14, 14, 16, 16, 18, 0, 1};

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment on lines 22 to +24
bool GraphHasCtxNode(const GraphViewer& graph_viewer) {
for (int i = 0; i < graph_viewer.MaxNodeIndex(); ++i) {
auto node = graph_viewer.GetNode(i);
for (auto node_index: graph_viewer.GetNodesInTopologicalOrder()) {
auto node = graph_viewer.GetNode(node_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool GraphHasCtxNode(const GraphViewer& graph_viewer) {
for (int i = 0; i < graph_viewer.MaxNodeIndex(); ++i) {
auto node = graph_viewer.GetNode(i);
for (auto node_index: graph_viewer.GetNodesInTopologicalOrder()) {
auto node = graph_viewer.GetNode(node_index);
bool GraphHasCtxNode(const GraphViewer& graph_viewer) {
for (auto node_index : graph_viewer.GetNodesInTopologicalOrder()) {
auto node = graph_viewer.GetNode(node_index);

Comment on lines +376 to +378
const auto& subgraph_node_list = graph_viewer.GetNodesInTopologicalOrder();
assert(subgraph_node_list.size() == 1); // There should only be 1 node in filtered graph
const auto node = graph_viewer.GetNode(subgraph_node_list[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const auto& subgraph_node_list = graph_viewer.GetNodesInTopologicalOrder();
assert(subgraph_node_list.size() == 1); // There should only be 1 node in filtered graph
const auto node = graph_viewer.GetNode(subgraph_node_list[0]);
const auto& subgraph_node_list = graph_viewer.GetNodesInTopologicalOrder();
assert(subgraph_node_list.size() == 1); // There should only be 1 node in filtered graph
const auto node = graph_viewer.GetNode(subgraph_node_list[0]);

Comment on lines 1734 to 1772
LOGS_DEFAULT(VERBOSE) << "[TensorRT EP] TensorRT provider options: "
<< "device_id: " << device_id_
<< ", trt_max_partition_iterations: " << max_partition_iterations_
<< ", trt_min_subgraph_size: " << min_subgraph_size_
<< ", trt_max_workspace_size: " << max_workspace_size_
<< ", trt_fp16_enable: " << fp16_enable_
<< ", trt_int8_enable: " << int8_enable_
<< ", trt_int8_calibration_cache_name: " << int8_calibration_cache_name_
<< ", int8_calibration_cache_available: " << int8_calibration_cache_available_
<< ", trt_int8_use_native_tensorrt_calibration_table: " << int8_use_native_tensorrt_calibration_table_
<< ", trt_dla_enable: " << dla_enable_
<< ", trt_dla_core: " << dla_core_
<< ", trt_dump_subgraphs: " << dump_subgraphs_
<< ", trt_engine_cache_enable: " << engine_cache_enable_
<< ", trt_weight_stripped_engine_enable: " << weight_stripped_engine_enable_
<< ", trt_onnx_model_folder_path: " << onnx_model_folder_path_
<< ", trt_cache_path: " << cache_path_
<< ", trt_global_cache_path: " << global_cache_path_
<< ", trt_engine_decryption_enable: " << engine_decryption_enable_
<< ", trt_engine_decryption_lib_path: " << engine_decryption_lib_path_
<< ", trt_force_sequential_engine_build: " << force_sequential_engine_build_
<< ", trt_context_memory_sharing_enable: " << context_memory_sharing_enable_
<< ", trt_layer_norm_fp32_fallback: " << layer_norm_fp32_fallback_
<< ", trt_build_heuristics_enable: " << build_heuristics_enable_
<< ", trt_sparsity_enable: " << sparsity_enable_
<< ", trt_builder_optimization_level: " << builder_optimization_level_
<< ", trt_auxiliary_streams: " << auxiliary_streams_
<< ", trt_tactic_sources: " << tactic_sources_
<< ", trt_profile_min_shapes: " << profile_min_shapes
<< ", trt_profile_max_shapes: " << profile_max_shapes
<< ", trt_profile_opt_shapes: " << profile_opt_shapes
<< ", trt_cuda_graph_enable: " << cuda_graph_enable_
<< ", trt_dump_ep_context_model: " << dump_ep_context_model_
<< ", trt_ep_context_file_path: " << ep_context_file_path_
<< ", trt_ep_context_embed_mode: " << ep_context_embed_mode_
<< ", trt_cache_prefix: " << cache_prefix_
<< ", trt_engine_hw_compatible: " << engine_hw_compatible_
<< ", trt_onnx_model_bytestream_size_: " << onnx_model_bytestream_size_;
<< "device_id: " << device_id_
<< ", trt_max_partition_iterations: " << max_partition_iterations_
<< ", trt_min_subgraph_size: " << min_subgraph_size_
<< ", trt_max_workspace_size: " << max_workspace_size_
<< ", trt_fp16_enable: " << fp16_enable_
<< ", trt_int8_enable: " << int8_enable_
<< ", trt_int8_calibration_cache_name: " << int8_calibration_cache_name_
<< ", int8_calibration_cache_available: " << int8_calibration_cache_available_
<< ", trt_int8_use_native_tensorrt_calibration_table: " << int8_use_native_tensorrt_calibration_table_
<< ", trt_dla_enable: " << dla_enable_
<< ", trt_dla_core: " << dla_core_
<< ", trt_dump_subgraphs: " << dump_subgraphs_
<< ", trt_engine_cache_enable: " << engine_cache_enable_
<< ", trt_weight_stripped_engine_enable: " << weight_stripped_engine_enable_
<< ", trt_onnx_model_folder_path: " << onnx_model_folder_path_
<< ", trt_cache_path: " << cache_path_
<< ", trt_global_cache_path: " << global_cache_path_
<< ", trt_engine_decryption_enable: " << engine_decryption_enable_
<< ", trt_engine_decryption_lib_path: " << engine_decryption_lib_path_
<< ", trt_force_sequential_engine_build: " << force_sequential_engine_build_
<< ", trt_context_memory_sharing_enable: " << context_memory_sharing_enable_
<< ", trt_layer_norm_fp32_fallback: " << layer_norm_fp32_fallback_
<< ", trt_build_heuristics_enable: " << build_heuristics_enable_
<< ", trt_sparsity_enable: " << sparsity_enable_
<< ", trt_builder_optimization_level: " << builder_optimization_level_
<< ", trt_auxiliary_streams: " << auxiliary_streams_
<< ", trt_tactic_sources: " << tactic_sources_
<< ", trt_profile_min_shapes: " << profile_min_shapes
<< ", trt_profile_max_shapes: " << profile_max_shapes
<< ", trt_profile_opt_shapes: " << profile_opt_shapes
<< ", trt_cuda_graph_enable: " << cuda_graph_enable_
<< ", trt_dump_ep_context_model: " << dump_ep_context_model_
<< ", trt_ep_context_file_path: " << ep_context_file_path_
<< ", trt_ep_context_embed_mode: " << ep_context_embed_mode_
<< ", trt_cache_prefix: " << cache_prefix_
<< ", trt_engine_hw_compatible: " << engine_hw_compatible_
<< ", trt_onnx_model_bytestream_size_: " << onnx_model_bytestream_size_;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LOGS_DEFAULT(VERBOSE) << "[TensorRT EP] TensorRT provider options: "
<< "device_id: " << device_id_
<< ", trt_max_partition_iterations: " << max_partition_iterations_
<< ", trt_min_subgraph_size: " << min_subgraph_size_
<< ", trt_max_workspace_size: " << max_workspace_size_
<< ", trt_fp16_enable: " << fp16_enable_
<< ", trt_int8_enable: " << int8_enable_
<< ", trt_int8_calibration_cache_name: " << int8_calibration_cache_name_
<< ", int8_calibration_cache_available: " << int8_calibration_cache_available_
<< ", trt_int8_use_native_tensorrt_calibration_table: " << int8_use_native_tensorrt_calibration_table_
<< ", trt_dla_enable: " << dla_enable_
<< ", trt_dla_core: " << dla_core_
<< ", trt_dump_subgraphs: " << dump_subgraphs_
<< ", trt_engine_cache_enable: " << engine_cache_enable_
<< ", trt_weight_stripped_engine_enable: " << weight_stripped_engine_enable_
<< ", trt_onnx_model_folder_path: " << onnx_model_folder_path_
<< ", trt_cache_path: " << cache_path_
<< ", trt_global_cache_path: " << global_cache_path_
<< ", trt_engine_decryption_enable: " << engine_decryption_enable_
<< ", trt_engine_decryption_lib_path: " << engine_decryption_lib_path_
<< ", trt_force_sequential_engine_build: " << force_sequential_engine_build_
<< ", trt_context_memory_sharing_enable: " << context_memory_sharing_enable_
<< ", trt_layer_norm_fp32_fallback: " << layer_norm_fp32_fallback_
<< ", trt_build_heuristics_enable: " << build_heuristics_enable_
<< ", trt_sparsity_enable: " << sparsity_enable_
<< ", trt_builder_optimization_level: " << builder_optimization_level_
<< ", trt_auxiliary_streams: " << auxiliary_streams_
<< ", trt_tactic_sources: " << tactic_sources_
<< ", trt_profile_min_shapes: " << profile_min_shapes
<< ", trt_profile_max_shapes: " << profile_max_shapes
<< ", trt_profile_opt_shapes: " << profile_opt_shapes
<< ", trt_cuda_graph_enable: " << cuda_graph_enable_
<< ", trt_dump_ep_context_model: " << dump_ep_context_model_
<< ", trt_ep_context_file_path: " << ep_context_file_path_
<< ", trt_ep_context_embed_mode: " << ep_context_embed_mode_
<< ", trt_cache_prefix: " << cache_prefix_
<< ", trt_engine_hw_compatible: " << engine_hw_compatible_
<< ", trt_onnx_model_bytestream_size_: " << onnx_model_bytestream_size_;
<< "device_id: " << device_id_
<< ", trt_max_partition_iterations: " << max_partition_iterations_
<< ", trt_min_subgraph_size: " << min_subgraph_size_
<< ", trt_max_workspace_size: " << max_workspace_size_
<< ", trt_fp16_enable: " << fp16_enable_
<< ", trt_int8_enable: " << int8_enable_
<< ", trt_int8_calibration_cache_name: " << int8_calibration_cache_name_
<< ", int8_calibration_cache_available: " << int8_calibration_cache_available_
<< ", trt_int8_use_native_tensorrt_calibration_table: " << int8_use_native_tensorrt_calibration_table_
<< ", trt_dla_enable: " << dla_enable_
<< ", trt_dla_core: " << dla_core_
<< ", trt_dump_subgraphs: " << dump_subgraphs_
<< ", trt_engine_cache_enable: " << engine_cache_enable_
<< ", trt_weight_stripped_engine_enable: " << weight_stripped_engine_enable_
<< ", trt_onnx_model_folder_path: " << onnx_model_folder_path_
<< ", trt_cache_path: " << cache_path_
<< ", trt_global_cache_path: " << global_cache_path_
<< ", trt_engine_decryption_enable: " << engine_decryption_enable_
<< ", trt_engine_decryption_lib_path: " << engine_decryption_lib_path_
<< ", trt_force_sequential_engine_build: " << force_sequential_engine_build_
<< ", trt_context_memory_sharing_enable: " << context_memory_sharing_enable_
<< ", trt_layer_norm_fp32_fallback: " << layer_norm_fp32_fallback_
<< ", trt_build_heuristics_enable: " << build_heuristics_enable_
<< ", trt_sparsity_enable: " << sparsity_enable_
<< ", trt_builder_optimization_level: " << builder_optimization_level_
<< ", trt_auxiliary_streams: " << auxiliary_streams_
<< ", trt_tactic_sources: " << tactic_sources_
<< ", trt_profile_min_shapes: " << profile_min_shapes
<< ", trt_profile_max_shapes: " << profile_max_shapes
<< ", trt_profile_opt_shapes: " << profile_opt_shapes
<< ", trt_cuda_graph_enable: " << cuda_graph_enable_
<< ", trt_dump_ep_context_model: " << dump_ep_context_model_
<< ", trt_ep_context_file_path: " << ep_context_file_path_
<< ", trt_ep_context_embed_mode: " << ep_context_embed_mode_
<< ", trt_cache_prefix: " << cache_prefix_
<< ", trt_engine_hw_compatible: " << engine_hw_compatible_
<< ", trt_onnx_model_bytestream_size_: " << onnx_model_bytestream_size_;
}
LOGS_DEFAULT(VERBOSE) << "[TensorRT EP] TensorRT provider options: "
<< "device_id: " << device_id_
<< ", trt_max_partition_iterations: " << max_partition_iterations_
<< ", trt_min_subgraph_size: " << min_subgraph_size_
<< ", trt_max_workspace_size: " << max_workspace_size_
<< ", trt_fp16_enable: " << fp16_enable_
<< ", trt_int8_enable: " << int8_enable_
<< ", trt_int8_calibration_cache_name: " << int8_calibration_cache_name_
<< ", int8_calibration_cache_available: " << int8_calibration_cache_available_
<< ", trt_int8_use_native_tensorrt_calibration_table: " << int8_use_native_tensorrt_calibration_table_
<< ", trt_dla_enable: " << dla_enable_
<< ", trt_dla_core: " << dla_core_
<< ", trt_dump_subgraphs: " << dump_subgraphs_
<< ", trt_engine_cache_enable: " << engine_cache_enable_
<< ", trt_weight_stripped_engine_enable: " << weight_stripped_engine_enable_
<< ", trt_onnx_model_folder_path: " << onnx_model_folder_path_
<< ", trt_cache_path: " << cache_path_
<< ", trt_global_cache_path: " << global_cache_path_
<< ", trt_engine_decryption_enable: " << engine_decryption_enable_
<< ", trt_engine_decryption_lib_path: " << engine_decryption_lib_path_
<< ", trt_force_sequential_engine_build: " << force_sequential_engine_build_
<< ", trt_context_memory_sharing_enable: " << context_memory_sharing_enable_
<< ", trt_layer_norm_fp32_fallback: " << layer_norm_fp32_fallback_
<< ", trt_build_heuristics_enable: " << build_heuristics_enable_
<< ", trt_sparsity_enable: " << sparsity_enable_
<< ", trt_builder_optimization_level: " << builder_optimization_level_
<< ", trt_auxiliary_streams: " << auxiliary_streams_
<< ", trt_tactic_sources: " << tactic_sources_
<< ", trt_profile_min_shapes: " << profile_min_shapes
<< ", trt_profile_max_shapes: " << profile_max_shapes
<< ", trt_profile_opt_shapes: " << profile_opt_shapes
<< ", trt_cuda_graph_enable: " << cuda_graph_enable_
<< ", trt_dump_ep_context_model: " << dump_ep_context_model_
<< ", trt_ep_context_file_path: " << ep_context_file_path_
<< ", trt_ep_context_embed_mode: " << ep_context_embed_mode_
<< ", trt_cache_prefix: " << cache_prefix_
<< ", trt_engine_hw_compatible: " << engine_hw_compatible_
<< ", trt_onnx_model_bytestream_size_: " << onnx_model_bytestream_size_;
}

Comment on lines 3201 to 3203
// Generate file name for dumping ep context model
if (dump_ep_context_model_ && ctx_model_path_.empty()) {
ctx_model_path_ = GetCtxModelPath(ep_context_file_path_, model_path_);
}


if (!has_dynamic_shape) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Generate file name for dumping ep context model
if (dump_ep_context_model_ && ctx_model_path_.empty()) {
ctx_model_path_ = GetCtxModelPath(ep_context_file_path_, model_path_);
}
if (!has_dynamic_shape) {
// Generate file name for dumping ep context model
if (!has_dynamic_shape) {

Comment on lines +352 to 354
bool is_single_node_epcontext_graph = false;

std::unordered_set<std::string> control_flow_op_set_ = {"If", "Loop", "Scan"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool is_single_node_epcontext_graph = false;
std::unordered_set<std::string> control_flow_op_set_ = {"If", "Loop", "Scan"};
bool is_single_node_epcontext_graph = false;
std::unordered_set<std::string> control_flow_op_set_ = {"If", "Loop", "Scan"};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants