-
Notifications
You must be signed in to change notification settings - Fork 41
Optimize CPU time spent in inference path (continued) #695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ovep-develop
Are you sure you want to change the base?
Conversation
9959011
to
1756787
Compare
@TejalKhade28 can you check if the TDR issue you fixed is still "fixed" with these changes? I have a different implementation of the inference request queue (now renamed to pool) but I think it's ok if we don't "delete" the inference requests(?). If not, then I can re-add something similar to what you had before. @ankitm3k would you mind taking a quick look at the merge result to see if anything stands out as being an issue? FYI @sfatimar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces further optimizations for the inference path to reduce CPU time. Key changes include:
- Unification of input/output bindings across devices.
- Moving infer request binding caching into a dedicated wrapper.
- Switching from asynchronous inference (async/wait) to direct infer() calls and revising the inference request pooling strategy.
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc | Updated test Run invocation to disable the OpenVINO provider temporarily. |
onnxruntime/core/providers/openvino/ov_interface.h | Added helper structs (ParameterShape, ParameterInfo) and caching logic for infer requests. |
onnxruntime/core/providers/openvino/ov_interface.cc | Refactored exception handling using the new OvExceptionBoundary template. |
onnxruntime/core/providers/openvino/openvino_provider_factory.cc | Revised dynamic shapes logic for target devices. |
onnxruntime/core/providers/openvino/ibackend.h | Updated the Infer method signature to be const. |
onnxruntime/core/providers/openvino/backends/basic_backend.{h,cc} | Replaced the InferRequestsQueue with a new InferRequestPool and refactored inference binding and output copy logic. |
onnxruntime/core/providers/openvino/backend_utils.{h,cc} | Updated utility functions for output tensor creation. |
onnxruntime/core/providers/openvino/backend_manager.{h,cc} | Added mutex protection for backend map modifications and streamlined dynamic backend selection. |
Comments suppressed due to low confidence (3)
onnxruntime/core/providers/openvino/ibackend.h:17
- Marking Infer as const may be problematic if implementations modify internal state (e.g., caching). Please verify that no mutable state changes occur within Infer implementations.
virtual void Infer(OrtKernelContext* context) const = 0;
onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc:4102
- [nitpick] Consider adding an inline comment or reference explaining why the OpenVINO execution provider is disabled temporarily in this test to improve clarity for future maintainers.
test.Run(
OpTester::ExpectResult::kExpectSuccess,
"",
{kOpenVINOExecutionProvider} // OpenVINO: Disabled temporarily
);
onnxruntime/core/providers/openvino/backend_manager.cc:547
- [nitpick] Consider consolidating the mutex lock usage for both the lookup and insertion into backend_map_ to reduce potential contention and improve clarity.
backend_map_.insert({key, dynamic_backend});
Can we base async, sync inference on OpenVINO Config Parameters for async, sync so that other workloads are not impacted... |
What is the validation plan for this PR as it is a major change |
Multithreading Issue needs to be tested. Please check parallel thread execution from ORT -c 2 |
@ericcraw can you please send me a document of the unit testing you have done with this |
The compute function always waits for async requests to finish. Using sync inference request is equivalent but avoids the threading overhead that async inference requests require in OV. We should definitely not add more openvino configurations.
sanity accuracy testing via etests for cpu + npu + gpu on LNL with basic models. Note: the previous implementation for multiple inference requests (i.e. multiple threads calling into the Compute function) is not thread safe. The npu implementation touches a map that is owned by the basic_backend, multiple threads could modify it parallel. The problem is fixed in this PR. |
@ericcraw |
test.Run( | ||
OpTester::ExpectResult::kExpectSuccess, | ||
"", | ||
{kOpenVINOExecutionProvider} // OpenVINO: Disabled temporarily |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericcraw
Can you please comment, why this test need to be disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be an OV CPU bug (at least in 2025.1) for this op when incoming model is a single instance of this op configured as a noop. The bug: when using ov inference request set_tensor for the output tensor the values in the output after inference are incorrect. Using get_tensor to get the output tensor has correct output. An OV bug will be filed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericcraw can you please send us the OV Bug ID.
@@ -4099,7 +4099,11 @@ TEST(ReductionOpTest, ReduceSum_noop_axes_input_initializer_opset_18) { | |||
3.0f, 4.0f}); | |||
test.AddInput<int64_t>("axes", {0}, {}, true); | |||
test.AddOutput<float>("reduced", {1, 2, 2}, {1.0f, 2.0f, 3.0f, 4.0f}); | |||
test.Run(); | |||
test.Run( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we no longer handling int64
auto tensor = context.GetInput(input_info.onnx_index); | ||
auto input_shape = ParameterShape(tensor.GetTensorTypeAndShapeInfo().GetShape()); | ||
|
||
infer_request->SetTensorShapeOverride(input_info, input_shape, const_cast<void*>(tensor.GetTensorRawData())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, our logic utilized memcpy to transfer data to the GPU, which lacked a shared memory region. The current implementation, however, appears to reference the ORT memory buffer directly. Could you elaborate on the mechanism that enables this direct memory access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using the ov inference request api to set the input tensors to a user controlled ov tensor. In this case we will create an ov tensor that wraps the ort tensor. It's up to the respective OV device plugin to performantly implement how they make the user tensor accessible to the underlying device. In all likelihood today there will be some sort of copy from this user tensor to device memory within the plugin (hopefully via GPU DMA but that's up to the plugin implementation).
We used a similar mechanism for dynamic shapes before, but it was less optimal. Instead of wrapping the ort tensor with an ov tensor, we created a new ov tensor with fresh cpu memory and copied the ort tensor into it. Effectively creating had an unnecessary extra copy.
namespace onnxruntime { | ||
namespace openvino_ep { | ||
|
||
static const std::string log_tag = "[OpenVINO-EP] "; | ||
template <typename Func, typename... Args> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericcraw Can you please omit usage of c++20 standard here as the build fails on Ubuntu OS due to unsupported compiler issue. I am currently using - gcc version 13.1.0 (Ubuntu 13.1.0-8ubuntu1~22.04) & observe below failure

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already require c++20 support to compile the ep, it's not new. Perhaps an older version of libstdc++ is somehow getting used that doesn't have a format implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build command used in Ubuntu -
./build.sh --config RelWithDebInfo --use_openvino CPU --build_shared_lib --build_wheel --parallel --skip_tests --cmake_extra_defines CMAKE_CXX_STANDARD=20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
I'm not super familiar with gcc tool chain management, but I see that gcc 13.1 should support this, and I assume it's correlating std library (libstdc++) is automatically installed along with it--assuming it was installed from apt-get. However, since you are getting a compilation error, I suspect that the library that's actually getting used during compilation is an older version.
Maybe you might want to try reconfiguring cmake to ensure it's using the version of the gcc toolchain you expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used CoPilot to generate C++17 code. Please check if this can replace the existing C++20 implementation.
Verified with C++17 and gcc 10. https://godbolt.org/z/71Ezz4j1z
template <typename Func, typename... Args> | |
#include <sstream> | |
#include <string> | |
#include <utility> | |
const std::string log_tag = "[OpenVINO-EP] "; | |
// Helper to concatenate arguments into a string (C++17) | |
template <typename... Args> | |
std::string StringFormat(const Args&... args) { | |
std::ostringstream oss; | |
(oss << ... << args); | |
return oss.str(); | |
} | |
template <typename Func, typename... Args> | |
inline auto OvExceptionBoundary(Func&& func, const char* msg, Args&&... args) { | |
try { | |
return func(); | |
} catch (const ov::Exception& e) { | |
ORT_THROW(log_tag + StringFormat(msg, std::forward<Args>(args)...) + ": " + std::string(e.what())); | |
} catch (...) { | |
ORT_THROW(log_tag + StringFormat(msg, std::forward<Args>(args)...)); | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cmake is specifying c++20, it should be supported, Period!! Please fix build systems. Otherwise fix cmake to mention C++17, then this discussion is valid.
@@ -27,8 +27,9 @@ | |||
|
|||
namespace onnxruntime { | |||
namespace openvino_ep { | |||
static constexpr std::string log_tag = "[OpenVINO-EP] "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above constexpr string require a c++20 std lib implementation.
Supported versions for PTL Linux are , if OVEP does not build , it will gate PTL QS Release. But the PTL Linux QS Branch off is WW26. So we have time, but we do need to resolve this. PTL supports Ubuntu 24 but we must support Ubuntu22 too, so we do have to resolve this before next ORT Release. Additionally ManyLinux Python package must build. |
Where are those built? Where's the CI results? Why is there no developer feedback apart from comments on PR? |
Additional changes to further improve cpu time in inference path.
Unify input/output bindings across devices.
Move infer request binding caching to infer request wrapper.
Use infer request's infer() instead of async/wait.
Modify infer request queue allow for an unbound number of inference requests to better follow ORT expectations w.r.t threading.