Skip to content

Optimize CPU time spent in inference path (continued) #695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: ovep-develop
Choose a base branch
from

Conversation

ericcraw
Copy link

@ericcraw ericcraw commented May 23, 2025

Additional changes to further improve cpu time in inference path.

Unify input/output bindings across devices.
Move infer request binding caching to infer request wrapper.
Use infer request's infer() instead of async/wait.
Modify infer request queue allow for an unbound number of inference requests to better follow ORT expectations w.r.t threading.

@ericcraw ericcraw changed the title use infer instead of start async/wait Optimize CPU time spent in inference path (continued) May 27, 2025
@ericcraw ericcraw force-pushed the inference_opt_continued branch from 9959011 to 1756787 Compare May 27, 2025 22:50
@ericcraw ericcraw marked this pull request as ready for review May 27, 2025 22:52
@MayureshV1 MayureshV1 requested a review from Copilot May 29, 2025 17:17
Copilot

This comment was marked as outdated.

@ericcraw
Copy link
Author

ericcraw commented Jun 5, 2025

@TejalKhade28 can you check if the TDR issue you fixed is still "fixed" with these changes? I have a different implementation of the inference request queue (now renamed to pool) but I think it's ok if we don't "delete" the inference requests(?). If not, then I can re-add something similar to what you had before.

@ankitm3k would you mind taking a quick look at the merge result to see if anything stands out as being an issue?

FYI @sfatimar

@ankitm3k ankitm3k requested a review from Copilot June 6, 2025 05:31
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces further optimizations for the inference path to reduce CPU time. Key changes include:

  • Unification of input/output bindings across devices.
  • Moving infer request binding caching into a dedicated wrapper.
  • Switching from asynchronous inference (async/wait) to direct infer() calls and revising the inference request pooling strategy.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc Updated test Run invocation to disable the OpenVINO provider temporarily.
onnxruntime/core/providers/openvino/ov_interface.h Added helper structs (ParameterShape, ParameterInfo) and caching logic for infer requests.
onnxruntime/core/providers/openvino/ov_interface.cc Refactored exception handling using the new OvExceptionBoundary template.
onnxruntime/core/providers/openvino/openvino_provider_factory.cc Revised dynamic shapes logic for target devices.
onnxruntime/core/providers/openvino/ibackend.h Updated the Infer method signature to be const.
onnxruntime/core/providers/openvino/backends/basic_backend.{h,cc} Replaced the InferRequestsQueue with a new InferRequestPool and refactored inference binding and output copy logic.
onnxruntime/core/providers/openvino/backend_utils.{h,cc} Updated utility functions for output tensor creation.
onnxruntime/core/providers/openvino/backend_manager.{h,cc} Added mutex protection for backend map modifications and streamlined dynamic backend selection.
Comments suppressed due to low confidence (3)

onnxruntime/core/providers/openvino/ibackend.h:17

  • Marking Infer as const may be problematic if implementations modify internal state (e.g., caching). Please verify that no mutable state changes occur within Infer implementations.
virtual void Infer(OrtKernelContext* context) const = 0;

onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc:4102

  • [nitpick] Consider adding an inline comment or reference explaining why the OpenVINO execution provider is disabled temporarily in this test to improve clarity for future maintainers.
test.Run(
      OpTester::ExpectResult::kExpectSuccess,
      "",
      {kOpenVINOExecutionProvider} // OpenVINO: Disabled temporarily
    );

onnxruntime/core/providers/openvino/backend_manager.cc:547

  • [nitpick] Consider consolidating the mutex lock usage for both the lookup and insertion into backend_map_ to reduce potential contention and improve clarity.
backend_map_.insert({key, dynamic_backend});

@sfatimar
Copy link

Can we base async, sync inference on OpenVINO Config Parameters for async, sync so that other workloads are not impacted...

@sfatimar
Copy link

What is the validation plan for this PR as it is a major change

@sfatimar
Copy link

Multithreading Issue needs to be tested. Please check parallel thread execution from ORT -c 2

@sfatimar
Copy link

@ericcraw can you please send me a document of the unit testing you have done with this

@ericcraw
Copy link
Author

Can we base async, sync inference on OpenVINO Config Parameters for async, sync so that other workloads are not impacted...

The compute function always waits for async requests to finish. Using sync inference request is equivalent but avoids the threading overhead that async inference requests require in OV. We should definitely not add more openvino configurations.

@ericcraw can you please send me a document of the unit testing you have done with this

sanity accuracy testing via etests for cpu + npu + gpu on LNL with basic models.
multi-inference with onnx perf test.
ort unit tests using ov cpu.

Note: the previous implementation for multiple inference requests (i.e. multiple threads calling into the Compute function) is not thread safe. The npu implementation touches a map that is owned by the basic_backend, multiple threads could modify it parallel. The problem is fixed in this PR.

@vthaniel
Copy link

vthaniel commented Jun 11, 2025

@ericcraw
TDR issue is tested with this PR. The issue is fixed in this PR also

test.Run(
OpTester::ExpectResult::kExpectSuccess,
"",
{kOpenVINOExecutionProvider} // OpenVINO: Disabled temporarily

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericcraw
Can you please comment, why this test need to be disabled?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There appears to be an OV CPU bug (at least in 2025.1) for this op when incoming model is a single instance of this op configured as a noop. The bug: when using ov inference request set_tensor for the output tensor the values in the output after inference are incorrect. Using get_tensor to get the output tensor has correct output. An OV bug will be filed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericcraw can you please send us the OV Bug ID.

@@ -4099,7 +4099,11 @@ TEST(ReductionOpTest, ReduceSum_noop_axes_input_initializer_opset_18) {
3.0f, 4.0f});
test.AddInput<int64_t>("axes", {0}, {}, true);
test.AddOutput<float>("reduced", {1, 2, 2}, {1.0f, 2.0f, 3.0f, 4.0f});
test.Run();
test.Run(
Copy link

@sfatimar sfatimar Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we no longer handling int64

auto tensor = context.GetInput(input_info.onnx_index);
auto input_shape = ParameterShape(tensor.GetTensorTypeAndShapeInfo().GetShape());

infer_request->SetTensorShapeOverride(input_info, input_shape, const_cast<void*>(tensor.GetTensorRawData()));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, our logic utilized memcpy to transfer data to the GPU, which lacked a shared memory region. The current implementation, however, appears to reference the ORT memory buffer directly. Could you elaborate on the mechanism that enables this direct memory access?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using the ov inference request api to set the input tensors to a user controlled ov tensor. In this case we will create an ov tensor that wraps the ort tensor. It's up to the respective OV device plugin to performantly implement how they make the user tensor accessible to the underlying device. In all likelihood today there will be some sort of copy from this user tensor to device memory within the plugin (hopefully via GPU DMA but that's up to the plugin implementation).

We used a similar mechanism for dynamic shapes before, but it was less optimal. Instead of wrapping the ort tensor with an ov tensor, we created a new ov tensor with fresh cpu memory and copied the ort tensor into it. Effectively creating had an unnecessary extra copy.

namespace onnxruntime {
namespace openvino_ep {

static const std::string log_tag = "[OpenVINO-EP] ";
template <typename Func, typename... Args>
Copy link

@ankitm3k ankitm3k Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericcraw Can you please omit usage of c++20 standard here as the build fails on Ubuntu OS due to unsupported compiler issue. I am currently using - gcc version 13.1.0 (Ubuntu 13.1.0-8ubuntu1~22.04) & observe below failure

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already require c++20 support to compile the ep, it's not new. Perhaps an older version of libstdc++ is somehow getting used that doesn't have a format implementation?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build command used in Ubuntu -
./build.sh --config RelWithDebInfo --use_openvino CPU --build_shared_lib --build_wheel --parallel --skip_tests --cmake_extra_defines CMAKE_CXX_STANDARD=20

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

I'm not super familiar with gcc tool chain management, but I see that gcc 13.1 should support this, and I assume it's correlating std library (libstdc++) is automatically installed along with it--assuming it was installed from apt-get. However, since you are getting a compilation error, I suspect that the library that's actually getting used during compilation is an older version.

Maybe you might want to try reconfiguring cmake to ensure it's using the version of the gcc toolchain you expect?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used CoPilot to generate C++17 code. Please check if this can replace the existing C++20 implementation.

Verified with C++17 and gcc 10. https://godbolt.org/z/71Ezz4j1z

Suggested change
template <typename Func, typename... Args>
#include <sstream>
#include <string>
#include <utility>
const std::string log_tag = "[OpenVINO-EP] ";
// Helper to concatenate arguments into a string (C++17)
template <typename... Args>
std::string StringFormat(const Args&... args) {
std::ostringstream oss;
(oss << ... << args);
return oss.str();
}
template <typename Func, typename... Args>
inline auto OvExceptionBoundary(Func&& func, const char* msg, Args&&... args) {
try {
return func();
} catch (const ov::Exception& e) {
ORT_THROW(log_tag + StringFormat(msg, std::forward<Args>(args)...) + ": " + std::string(e.what()));
} catch (...) {
ORT_THROW(log_tag + StringFormat(msg, std::forward<Args>(args)...));
}
}

Copy link

@jnagi-intel jnagi-intel Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cmake is specifying c++20, it should be supported, Period!! Please fix build systems. Otherwise fix cmake to mention C++17, then this discussion is valid.

@@ -27,8 +27,9 @@

namespace onnxruntime {
namespace openvino_ep {
static constexpr std::string log_tag = "[OpenVINO-EP] ";
Copy link

@ankitm3k ankitm3k Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build error here also due to compiler
image

Suggestion - static const std::string log_tag = "[OpenVINO-EP] ";

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above constexpr string require a c++20 std lib implementation.

@sfatimar
Copy link

sfatimar commented Jun 13, 2025

Supported versions for PTL Linux are , if OVEP does not build , it will gate PTL QS Release. But the PTL Linux QS Branch off is WW26. So we have time, but we do need to resolve this. PTL supports Ubuntu 24 but we must support Ubuntu22 too, so we do have to resolve this before next ORT Release. Additionally ManyLinux Python package must build.
gcc 13.3.0 on ubuntu24
gcc 11.4.0 on ubuntu22
gcc 12.2.0 on debian12
Clang 18.0.0 on ChromeOS
Clang 18.0.3 on AluminiumOS

@ericcraw
Copy link
Author

Supported versions for PTL Linux are , if OVEP does not build , it will gate PTL QS Release. But the PTL Linux QS Branch off is WW26. So we have time, but we do need to resolve this. PTL supports Ubuntu 24 but we must support Ubuntu22 too, so we do have to resolve this before next ORT Release. Additionally ManyLinux Python package must build. gcc 13.3.0 on ubuntu24 gcc 11.4.0 on ubuntu22 gcc 12.2.0 on debian12 Clang 18.0.0 on ChromeOS Clang 18.0.3 on AluminiumOS

Where are those built? Where's the CI results? Why is there no developer feedback apart from comments on PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants