Optimize CPU time spent in inference path (continued) #695

ericcraw · 2025-05-23T23:36:20Z

Additional changes to further improve cpu time in inference path.

Unify input/output bindings across devices.
Move infer request binding caching to infer request wrapper.
Use infer request's infer() instead of async/wait.
Modify infer request queue allow for an unbound number of inference requests to better follow ORT expectations w.r.t threading.

onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc

onnxruntime/core/providers/openvino/ov_interface.h

onnxruntime/core/providers/openvino/backends/basic_backend.h

…_continued

ericcraw · 2025-06-05T23:50:00Z

@TejalKhade28 can you check if the TDR issue you fixed is still "fixed" with these changes? I have a different implementation of the inference request queue (now renamed to pool) but I think it's ok if we don't "delete" the inference requests(?). If not, then I can re-add something similar to what you had before.

@ankitm3k would you mind taking a quick look at the merge result to see if anything stands out as being an issue?

FYI @sfatimar

Copilot

Pull Request Overview

This PR introduces further optimizations for the inference path to reduce CPU time. Key changes include:

Unification of input/output bindings across devices.
Moving infer request binding caching into a dedicated wrapper.
Switching from asynchronous inference (async/wait) to direct infer() calls and revising the inference request pooling strategy.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc	Updated test Run invocation to disable the OpenVINO provider temporarily.
onnxruntime/core/providers/openvino/ov_interface.h	Added helper structs (ParameterShape, ParameterInfo) and caching logic for infer requests.
onnxruntime/core/providers/openvino/ov_interface.cc	Refactored exception handling using the new OvExceptionBoundary template.
onnxruntime/core/providers/openvino/openvino_provider_factory.cc	Revised dynamic shapes logic for target devices.
onnxruntime/core/providers/openvino/ibackend.h	Updated the Infer method signature to be const.
onnxruntime/core/providers/openvino/backends/basic_backend.{h,cc}	Replaced the InferRequestsQueue with a new InferRequestPool and refactored inference binding and output copy logic.
onnxruntime/core/providers/openvino/backend_utils.{h,cc}	Updated utility functions for output tensor creation.
onnxruntime/core/providers/openvino/backend_manager.{h,cc}	Added mutex protection for backend map modifications and streamlined dynamic backend selection.

Comments suppressed due to low confidence (3)

onnxruntime/core/providers/openvino/ibackend.h:17

Marking Infer as const may be problematic if implementations modify internal state (e.g., caching). Please verify that no mutable state changes occur within Infer implementations.

virtual void Infer(OrtKernelContext* context) const = 0;

onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc:4102

[nitpick] Consider adding an inline comment or reference explaining why the OpenVINO execution provider is disabled temporarily in this test to improve clarity for future maintainers.

test.Run(
      OpTester::ExpectResult::kExpectSuccess,
      "",
      {kOpenVINOExecutionProvider} // OpenVINO: Disabled temporarily
    );

onnxruntime/core/providers/openvino/backend_manager.cc:547

[nitpick] Consider consolidating the mutex lock usage for both the lookup and insertion into backend_map_ to reduce potential contention and improve clarity.

backend_map_.insert({key, dynamic_backend});

onnxruntime/core/providers/openvino/openvino_provider_factory.cc

sfatimar · 2025-06-10T16:22:35Z

Can we base async, sync inference on OpenVINO Config Parameters for async, sync so that other workloads are not impacted...

sfatimar · 2025-06-10T16:22:59Z

What is the validation plan for this PR as it is a major change

sfatimar · 2025-06-10T16:25:56Z

Multithreading Issue needs to be tested. Please check parallel thread execution from ORT -c 2

sfatimar · 2025-06-10T16:45:12Z

@ericcraw can you please send me a document of the unit testing you have done with this

ericcraw · 2025-06-11T00:40:01Z

Can we base async, sync inference on OpenVINO Config Parameters for async, sync so that other workloads are not impacted...

The compute function always waits for async requests to finish. Using sync inference request is equivalent but avoids the threading overhead that async inference requests require in OV. We should definitely not add more openvino configurations.

@ericcraw can you please send me a document of the unit testing you have done with this

sanity accuracy testing via etests for cpu + npu + gpu on LNL with basic models.
multi-inference with onnx perf test.
ort unit tests using ov cpu.

Note: the previous implementation for multiple inference requests (i.e. multiple threads calling into the Compute function) is not thread safe. The npu implementation touches a map that is owned by the basic_backend, multiple threads could modify it parallel. The problem is fixed in this PR.

vthaniel · 2025-06-11T12:04:01Z

@ericcraw
TDR issue is tested with this PR. The issue is fixed in this PR also

vthaniel · 2025-06-11T12:04:54Z

onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc

+  test.Run(
+      OpTester::ExpectResult::kExpectSuccess,
+      "",
+      {kOpenVINOExecutionProvider} // OpenVINO: Disabled temporarily


@ericcraw
Can you please comment, why this test need to be disabled?

There appears to be an OV CPU bug (at least in 2025.1) for this op when incoming model is a single instance of this op configured as a noop. The bug: when using ov inference request set_tensor for the output tensor the values in the output after inference are incorrect. Using get_tensor to get the output tensor has correct output. An OV bug will be filed.

@ericcraw can you please send us the OV Bug ID.

sfatimar · 2025-06-11T14:23:12Z

onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc

@@ -4099,7 +4099,11 @@ TEST(ReductionOpTest, ReduceSum_noop_axes_input_initializer_opset_18) {
                        3.0f, 4.0f});
  test.AddInput<int64_t>("axes", {0}, {}, true);
  test.AddOutput<float>("reduced", {1, 2, 2}, {1.0f, 2.0f, 3.0f, 4.0f});
-  test.Run();
+  test.Run(


Are we no longer handling int64

onnxruntime/core/providers/openvino/backends/basic_backend.cc

onnxruntime/core/providers/openvino/ov_interface.h

jatinwadhwa921 · 2025-06-12T07:33:47Z

onnxruntime/core/providers/openvino/backends/basic_backend.cc

+      auto tensor = context.GetInput(input_info.onnx_index);
+      auto input_shape = ParameterShape(tensor.GetTensorTypeAndShapeInfo().GetShape());
+
+      infer_request->SetTensorShapeOverride(input_info, input_shape, const_cast<void*>(tensor.GetTensorRawData()));


Previously, our logic utilized memcpy to transfer data to the GPU, which lacked a shared memory region. The current implementation, however, appears to reference the ORT memory buffer directly. Could you elaborate on the mechanism that enables this direct memory access?

We're using the ov inference request api to set the input tensors to a user controlled ov tensor. In this case we will create an ov tensor that wraps the ort tensor. It's up to the respective OV device plugin to performantly implement how they make the user tensor accessible to the underlying device. In all likelihood today there will be some sort of copy from this user tensor to device memory within the plugin (hopefully via GPU DMA but that's up to the plugin implementation).

We used a similar mechanism for dynamic shapes before, but it was less optimal. Instead of wrapping the ort tensor with an ov tensor, we created a new ov tensor with fresh cpu memory and copied the ort tensor into it. Effectively creating had an unnecessary extra copy.

ankitm3k · 2025-06-12T16:51:17Z

onnxruntime/core/providers/openvino/ov_interface.cc

 namespace onnxruntime {
 namespace openvino_ep {

-static const std::string log_tag = "[OpenVINO-EP] ";
+template <typename Func, typename... Args>


@ericcraw Can you please omit usage of c++20 standard here as the build fails on Ubuntu OS due to unsupported compiler issue. I am currently using - gcc version 13.1.0 (Ubuntu 13.1.0-8ubuntu1~22.04) & observe below failure

We already require c++20 support to compile the ep, it's not new. Perhaps an older version of libstdc++ is somehow getting used that doesn't have a format implementation?

Build command used in Ubuntu -
./build.sh --config RelWithDebInfo --use_openvino CPU --build_shared_lib --build_wheel --parallel --skip_tests --cmake_extra_defines CMAKE_CXX_STANDARD=20

sure.

I'm not super familiar with gcc tool chain management, but I see that gcc 13.1 should support this, and I assume it's correlating std library (libstdc++) is automatically installed along with it--assuming it was installed from apt-get. However, since you are getting a compilation error, I suspect that the library that's actually getting used during compilation is an older version.

Maybe you might want to try reconfiguring cmake to ensure it's using the version of the gcc toolchain you expect?

Used CoPilot to generate C++17 code. Please check if this can replace the existing C++20 implementation.

Verified with C++17 and gcc 10. https://godbolt.org/z/71Ezz4j1z

Suggested change

template <typename Func, typename... Args>

#include <sstream>

#include <string>

#include <utility>

const std::string log_tag = "[OpenVINO-EP] ";

// Helper to concatenate arguments into a string (C++17)

template <typename... Args>

std::string StringFormat(const Args&... args) {

std::ostringstream oss;

(oss << ... << args);

return oss.str();

}

template <typename Func, typename... Args>

inline auto OvExceptionBoundary(Func&& func, const char* msg, Args&&... args) {

try {

return func();

} catch (const ov::Exception& e) {

ORT_THROW(log_tag + StringFormat(msg, std::forward<Args>(args)...) + ": " + std::string(e.what()));

} catch (...) {

ORT_THROW(log_tag + StringFormat(msg, std::forward<Args>(args)...));

}

}

If cmake is specifying c++20, it should be supported, Period!! Please fix build systems. Otherwise fix cmake to mention C++17, then this discussion is valid.

ankitm3k · 2025-06-12T17:06:27Z

onnxruntime/core/providers/openvino/backend_utils.h

@@ -27,8 +27,9 @@

 namespace onnxruntime {
 namespace openvino_ep {
+static constexpr std::string log_tag = "[OpenVINO-EP] ";


Build error here also due to compiler

Suggestion - static const std::string log_tag = "[OpenVINO-EP] ";

see above constexpr string require a c++20 std lib implementation.

sfatimar · 2025-06-13T07:09:12Z

Supported versions for PTL Linux are , if OVEP does not build , it will gate PTL QS Release. But the PTL Linux QS Branch off is WW26. So we have time, but we do need to resolve this. PTL supports Ubuntu 24 but we must support Ubuntu22 too, so we do have to resolve this before next ORT Release. Additionally ManyLinux Python package must build.
gcc 13.3.0 on ubuntu24
gcc 11.4.0 on ubuntu22
gcc 12.2.0 on debian12
Clang 18.0.0 on ChromeOS
Clang 18.0.3 on AluminiumOS

ericcraw · 2025-06-13T15:49:37Z

Supported versions for PTL Linux are , if OVEP does not build , it will gate PTL QS Release. But the PTL Linux QS Branch off is WW26. So we have time, but we do need to resolve this. PTL supports Ubuntu 24 but we must support Ubuntu22 too, so we do have to resolve this before next ORT Release. Additionally ManyLinux Python package must build. gcc 13.3.0 on ubuntu24 gcc 11.4.0 on ubuntu22 gcc 12.2.0 on debian12 Clang 18.0.0 on ChromeOS Clang 18.0.3 on AluminiumOS

Where are those built? Where's the CI results? Why is there no developer feedback apart from comments on PR?

ericcraw added 5 commits May 27, 2025 15:37

Use infer instead of start async/wait

d26d5be

Introduce OvExeceptionBoundary for exception handling

4c76147

unbound infer request pool

12dd635

Fix dynamically sized i/o

4f32f22

Rename onnx->ort + remove unused parameter shape functions

1756787

ericcraw changed the title ~~use infer instead of start async/wait~~ Optimize CPU time spent in inference path (continued) May 27, 2025

ericcraw force-pushed the inference_opt_continued branch from 9959011 to 1756787 Compare May 27, 2025 22:50

ericcraw marked this pull request as ready for review May 27, 2025 22:52

fix linux build issue + review dog comments

09c1346

MayureshV1 requested a review from Copilot May 29, 2025 17:17

This comment was marked as outdated.

Sign in to view

ericcraw and others added 3 commits May 29, 2025 13:52

more linux build fixes + copilot feedback

77d03af

disable ReduceSum_noop_axes_input_initializer_opset_18

f021909

Merge branch 'ovep-develop' into inference_opt_continued

6ea90aa

MayureshV1 requested review from preetha-intel, vthaniel and jatinwadhwa921 June 4, 2025 21:56

javier-intel requested changes Jun 5, 2025

View reviewed changes

onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc Outdated Show resolved Hide resolved

onnxruntime/core/providers/openvino/ov_interface.h Outdated Show resolved Hide resolved

onnxruntime/core/providers/openvino/backends/basic_backend.h Outdated Show resolved Hide resolved

ericcraw added 4 commits June 5, 2025 11:04

review feedback + last minute touch ups

2988582

Merge remote-tracking branch 'origin/ovep-develop' into inference_opt…

50d8499

…_continued

slightly more scalable llm handling

d4b5129

Simplify dynamic shape checks

0b34743

ankitm3k requested a review from Copilot June 6, 2025 05:31

Copilot AI reviewed Jun 6, 2025

View reviewed changes

ankitm3k requested changes Jun 6, 2025

View reviewed changes

onnxruntime/core/providers/openvino/openvino_provider_factory.cc Show resolved Hide resolved

add missing staged changes

fd85a62

ericcraw requested review from javier-intel and ankitm3k June 11, 2025 00:40

javier-intel approved these changes Jun 11, 2025

View reviewed changes

vthaniel reviewed Jun 11, 2025

View reviewed changes

sfatimar reviewed Jun 11, 2025

View reviewed changes

onnxruntime/core/providers/openvino/backends/basic_backend.cc Outdated Show resolved Hide resolved

sfatimar reviewed Jun 11, 2025

View reviewed changes

onnxruntime/core/providers/openvino/backends/basic_backend.cc Outdated Show resolved Hide resolved

sfatimar reviewed Jun 11, 2025

View reviewed changes

onnxruntime/core/providers/openvino/ov_interface.h Show resolved Hide resolved

ericcraw added 3 commits June 11, 2025 14:00

Remove references to IO_BUFFER_ENABLED

5726a05

Minor tweaks to InferRequestPool

198a4b9

remove unused mem_info

e6857bc

jatinwadhwa921 reviewed Jun 12, 2025

View reviewed changes

ankitm3k requested changes Jun 12, 2025

View reviewed changes

Move ParameterShape and ParameterInfo out of ov_interface

4a2c7f5

-template <typename Func, typename... Args>
+#include <sstream>
+#include <string>
+#include <utility>
+const std::string log_tag = "[OpenVINO-EP] ";
+// Helper to concatenate arguments into a string (C++17)
+template <typename... Args>
+std::string StringFormat(const Args&... args) {
+  std::ostringstream oss;
+  (oss << ... << args);
+  return oss.str();
+}
+template <typename Func, typename... Args>
+inline auto OvExceptionBoundary(Func&& func, const char* msg, Args&&... args) {
+  try {
+    return func();
+  } catch (const ov::Exception& e) {
+    ORT_THROW(log_tag + StringFormat(msg, std::forward<Args>(args)...) + ": " + std::string(e.what()));
+  } catch (...) {
+    ORT_THROW(log_tag + StringFormat(msg, std::forward<Args>(args)...));
+  }
+}

Optimize CPU time spent in inference path (continued) #695

Are you sure you want to change the base?

Optimize CPU time spent in inference path (continued) #695

Uh oh!

Conversation

ericcraw commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericcraw commented Jun 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

sfatimar commented Jun 10, 2025

Uh oh!

sfatimar commented Jun 10, 2025

Uh oh!

sfatimar commented Jun 10, 2025

Uh oh!

sfatimar commented Jun 10, 2025

Uh oh!

ericcraw commented Jun 11, 2025

Uh oh!

vthaniel commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfatimar Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankitm3k Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnagi-intel Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankitm3k Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfatimar commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericcraw commented Jun 13, 2025

Uh oh!

Uh oh!

ericcraw commented May 23, 2025 •

edited

Loading

vthaniel commented Jun 11, 2025 •

edited

Loading

sfatimar Jun 11, 2025 •

edited

Loading

ankitm3k Jun 12, 2025 •

edited

Loading

jnagi-intel Jun 13, 2025 •

edited

Loading

ankitm3k Jun 12, 2025 •

edited

Loading

sfatimar commented Jun 13, 2025 •

edited

Loading