feature: onedal verbose profiler #3155

Alexandr-Solovev · 2025-04-02T14:24:22Z

Summary

This PR introduces internal support for logging, tracing, and analysis in oneDAL. The functionality is conditionally enabled using the ONEDAL_VERBOSE environment variable.

Changes Introduced

Added logger, tracer, and analyzer components to the oneDAL internal infrastructure.
These tools provide structured tracing of algorithm and kernel execution.
Logging behavior is controlled via the ONEDAL_VERBOSE environment variable:
- ONEDAL_VERBOSE=0/empty: Logging is disabled (default).
- ONEDAL_VERBOSE=1: Logging enabled.
- ONEDAL_VERBOSE=2: Tracer enabled.
- ONEDAL_VERBOSE=3: Analyzer enabled.
- ONEDAL_VERBOSE=4: Analyzer logger and tracer are enabled.
- ONEDAL_VERBOSE=5: Analyzer logger and tracer are enabled with service function.

Purpose

These additions are intended to:

Help developers debug and profile algorithms and kernels.
Analyze performance bottlenecks and nested execution paths.
Provide structured insights into internal flow without affecting public APIs.

Additional Notes

Logging is zero-cost when disabled (no runtime overhead).
All functionality is internal and transparent to end users.
Future improvements may extend the analyzer with visualization or export support.

PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

I have reviewed my changes thoroughly before submitting this pull request.
I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have added a respective label(s) to PR if I have a permission for that.
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

Performance

I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
I have provided justification why performance has changed or why changes are not expected.
I have provided justification why quality metrics have changed or why changes are not expected.
I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

Alexandr-Solovev · 2025-04-02T14:24:56Z

/intelci: run

Alexandr-Solovev · 2025-04-03T12:11:30Z

/intelci: run

Alexandr-Solovev · 2025-04-08T10:15:57Z

/intelci: run

Alexandr-Solovev · 2025-04-09T11:57:39Z

/intelci: run

Alexandr-Solovev · 2025-04-14T08:57:13Z

/intelci: run

Alexandr-Solovev · 2025-04-14T11:00:07Z

/intelci: run

david-cortes-intel · 2025-04-15T11:55:02Z

Thanks for looking into this. A couple comments:

Please add documentation about this variable.
It doesn't work if the environment variable is set after the library is imported. If this is hard to change, then please document this aspect too.

I cannot find any information about execution times in the outputs. For example, this is what I see from linear regression with verbosity level 3:

auto daal::algorithms::linear_model::normal_equations::training::internal::ThreadingTask<double, daal::avx512>::update(long long, long long, const NumericTable &, const NumericTable &)::(anonymous class)::operator()() const [algorithmFPType = double, cpu = daal::avx512]
Profiler task_name: computeUpdate.syrkX-----------------------------------------------------------------------------
auto daal::algorithms::linear_model::normal_equations::training::internal::ThreadingTask<double, daal::avx512>::update(long long, long long, const NumericTable &, const NumericTable &)::(anonymous class)::operator()() const [algorithmFPType = double, cpu = daal::avx512]
Profiler task_name: computeUpdate.gemm1X

I'm not sure if this is coming from this PR, but it looks like linear regression is printing a lot more than it should. For example, it is generating one print for each subproblem that calls syrk/gemm, whereas last time I looked at it it had only one call to kernel profiling before the loop that makes multiple of those calls.
Given the amount of prints, perhaps this could use pure C++ streams not synchronized with C. Info: https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdio
Looks like there's quite a bit of commented out code that needs removal.

david-cortes-intel · 2025-04-16T13:55:16Z

@Alexandr-Solovev I'm still seeing the same kinds of prints without timings after the last updates.

Alexandr-Solovev · 2025-04-17T06:16:16Z

@david-cortes-intel Thanks for the comments! With the latest commit, in case of usage Jupyter times should be available with tracer(ONEDAL_VERBOSE 2 4 5), you can use 4 or 5 to enable everything.

I fixed "Given the amount of prints, perhaps this could use pure C++ streams not synchronized with C", "It doesn't work if the environment variable is set after the library is imported. If this is hard to change, then please document this aspect too."
About the "I'm not sure if this is coming from this PR, but it looks like linear regression is printing a lot more than it should. For example, it is generating one print for each subproblem that calls syrk/gemm, whereas last time I looked at it it had only one call to kernel profiling before the loop that makes multiple of those calls." Not sure, I had the same structure, but I can take a look if its important.

icfaust · 2025-04-17T06:18:18Z

/intelci: run

david-cortes-intel · 2025-04-17T06:33:04Z

@Alexandr-Solovev Thanks, I can see the times now. But it again looks like it generates one log per each batch of linear regression:

oneDAL/cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i

Line 72 in 5dd39b8

    
           Status ThreadingTask<algorithmFPType, cpu>::update(DAAL_INT startRow, DAAL_INT nRows, const NumericTable & xTable, const NumericTable & yTable)

.. whereas it should be generating one entry for the whole kernel:

oneDAL/cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i

Line 234 in 5dd39b8

DAAL_ITTNOTIFY_SCOPED_TASK(computeUpdate);

@Vika-F Any comments here?

Alexandr-Solovev · 2025-04-17T06:47:55Z

@david-cortes-intel I checked and its a threading task(ThreadingTask<algorithmFPType, cpu>::update)
I assume that each thread prints own information. @Vika-F
I can try to add new type of tasks for threading. @david-cortes-intel Can you limit the number of threads and check changes?

david-cortes-intel · 2025-04-17T06:51:32Z

@david-cortes-intel I checked and its a threading task(ThreadingTask<algorithmFPType, cpu>::update) I assume that each thread prints own information. @Vika-F I can try to add new type of tasks for threading. @david-cortes-intel Can you limit the number of threads and check changes?

I get slightly fewer prints when limiting threads, but it still looks like it prints a lot more than it should.

Alexandr-Solovev · 2025-04-17T06:54:38Z

@david-cortes-intel I checked and its a threading task(ThreadingTask<algorithmFPType, cpu>::update) I assume that each thread prints own information. @Vika-F I can try to add new type of tasks for threading. @david-cortes-intel Can you limit the number of threads and check changes?

I get slightly fewer prints when limiting threads, but it still looks like it prints a lot more than it should.

It must print just once for each PROFILER_TASK call, so, may be smth is called in a loop

david-cortes-intel · 2025-04-17T06:58:41Z

I'm not sure what exactly is happening there.

For our use case, it ideally shouldn't print anything about those per-thread tasks, just about the larger task where they are called in a loop.

But still, the amount of prints doesn't seem to correspond with the number of calls.

Currently, it divides the data in batches of 256 rows, so for example if I pass 5k rows, it would amount to 20 batches, but it makes 81 prints instead:

import os
os.environ["ONEDAL_VERBOSE"] = "4"
import numpy as np
from sklearnex.linear_model import LinearRegression
rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(5000,40))
y = rng.standard_normal(size=X.shape[0])
model = LinearRegression().fit(X, y)

Using a single thread reduces those to 42, but that's still more than expected.

Alexandr-Solovev · 2025-04-17T07:03:04Z

@david-cortes-intel can you launch it as an example(not through Jupyter) and check the algorithm tree to see what are nested calls?

david-cortes-intel · 2025-04-17T07:32:14Z

@Alexandr-Solovev Attached is the full log with the analyzer and prints when running it through the cmake examples, without jupyter, on the same data. I see 80 prints which is just one less than before.
log_linreg.txt

Vika-F · 2025-04-17T12:20:54Z

cpp/daal/src/externals/service_profiler.cpp

+    int newval               = 0;
+    if (verbose_str)
+    {
+        newval = std::atoi(verbose_str);


I think atoi() has security issues, as it does not perform error checks.
It would be better to replace it with strtol.

Vika-F · 2025-04-17T12:25:28Z

cpp/daal/src/externals/service_profiler.cpp

+*                      4 enabled with logger tracer and analyzer
+*                      5 enabled with logger tracer and analyzer with service functions
+*/
+int * daal_verbose_mode()


It's not obvious why it is returned by pointer here. Can you please explain?

Vika-F · 2025-04-17T12:41:55Z

cpp/daal/src/externals/service_profiler.cpp

+{
+    static const bool logger_value = [] {
+        int value = *daal_verbose_mode();
+        return value == 1 || value == 4 || value == 5;


Please define constants that represent the values 1, 4, 5 here and in other similar places.
It could be done in the top of the file, near daal_verbose_val declaration. I'd also convert some values into bit flags in order to simplify the logic of the checks.
For example:

constexpr unsigned int DAAL_VERBOSE_ENABLED = (1U << 0); constexpr unsigned int DAAL_VERBOSE_TRACER_ENABLED = (1U << 1); constexpr unsigned int DAAL_VERBOSE_LOGGER_ENABLED = (1U << 2); constexpr unsigned int DAAL_VERBOSE_ANALYSER_ENABLED = (1U << 3); constexpr unsigned int DAAL_VERBOSE_SERVICE_FUNC_ENABLED = (1U << 4); constexpr unsigned int DAAL_VERBOSE_ALL_ENABLED = (DAAL_VERBOSE_ENABLED | DAAL_VERBOSE_TRACER_ENABLED | DAAL_VERBOSE_LOGGER_ENABLED | DAAL_VERBOSE_ANALYSER_ENABLED); constexpr unsigned int DAAL_VERBOSE_SERVICE_ENABLED = (DAAL_VERBOSE_ALL_ENABLED | DAAL_VERBOSE_SERVICE_FUNC_ENABLED);

In this case the check will look like:
return (value & DAAL_VERBOSE_LOGGER_ENABLED);

But the logic of set_verbose_from_env would have to be modified to convert from 0, 1, 2, 3... to bit flags in that case.

I do not insist on bit flags, but the constants 1, 2, 3,... have to be named somehow.

david-cortes-intel · 2025-04-17T12:30:58Z

docs/source/contribution/profiling.rst

+- which computational functions are called
+- what parameters are passed to them
+- how much time is spent to execute the functions
+- (for GPU applications) which GPU device the kernel is executed on


Suggested change

- which computational functions are called

- what parameters are passed to them

- how much time is spent to execute the functions

- (for GPU applications) which GPU device the kernel is executed on

- Which computational kernels are called

- What parameters are passed to them

- How much time is spent to execute the functions

- Which device (CPU/GPU) the kernel is executed on

david-cortes-intel · 2025-04-17T12:32:53Z

cpp/daal/src/externals/service_profiler.cpp

 {
-    return ProfilerTask(taskName);
+    std::ostringstream out;


Would there be any issue in writing directly to stdout if the prints from threaded kernels are removed?

david-cortes-intel · 2025-04-17T12:33:37Z

docs/source/contribution/profiling.rst

+- (for GPU applications) which GPU device the kernel is executed on
+
+You can get an application to print this information to a standard output
+device by enabling **Intel® oneAPI Data Analytics Library Verbose**.


Suggested change

device by enabling **Intel® oneAPI Data Analytics Library Verbose**.

device by enabling **Intel® oneAPI Data Analytics Library Verbose Mode**.

david-cortes-intel · 2025-04-17T12:33:56Z

docs/source/contribution/profiling.rst

+device by enabling **Intel® oneAPI Data Analytics Library Verbose**.
+
+When Verbose mode is active in oneDAL, every call of a verbose-enabled function finishes with 
+printing a human-readable line describing the call. Even, if your application gets terminated during 


Suggested change

printing a human-readable line describing the call. Even, if your application gets terminated during

printing a human-readable line describing the call. Even if the application gets terminated during

david-cortes-intel · 2025-04-17T12:34:22Z

docs/source/contribution/profiling.rst

+The first call to a verbose-enabled function also prints a version information line.
+
+For GPU applications, additional information (one or more GPU information lines) will also 
+be printed by the first call to a verbose-enabled function, following the version information line printed


Suggested change

be printed by the first call to a verbose-enabled function, following the version information line printed

be printed on the first call to a verbose-enabled function, following the version information lines printed

david-cortes-intel · 2025-04-17T12:47:06Z

cpp/oneapi/dal/detail/profiler.hpp

+    const char* comma = strchr(names, ',');
+    std::string name = comma ? std::string(names, comma) : std::string(names);
+
+    name.erase(0, name.find_first_not_of(" \t\n\r"));


Maybe could use std::iswspace with a different function instead?

david-cortes-intel · 2025-04-17T12:47:56Z

cpp/oneapi/dal/detail/profiler.hpp

+
+    name.erase(0, name.find_first_not_of(" \t\n\r"));
+
+    std::cerr << name << ": " << value << "; ";


Why is this one using cerr while the others use cout?

david-cortes-intel · 2025-04-17T12:49:50Z

cpp/daal/src/externals/service_profiler.cpp


-    __itt_task_begin(Profiler::getDomain(), __itt_null, __itt_null, _handle);
+    std::cout << "Major version:          " << ver.majorVersion << std::endl;


Would be faster with newline ('\n') instead of std::endl at each call.

david-cortes-intel · 2025-04-17T12:50:49Z

cpp/daal/src/externals/service_profiler.cpp

 {
-    __itt_task_end(Profiler::getDomain());
+#ifdef _MSC_VER


Would all of this work on windows if using a compiler other than MSVC?

david-cortes-intel · 2025-04-17T12:51:38Z

docs/source/contribution/profiling.rst

+- how much time is spent to execute the functions
+- (for GPU applications) which GPU device the kernel is executed on
+
+You can get an application to print this information to a standard output


Worth mentioning that some logs are going to stdout and some to stderr.

Vika-F · 2025-04-17T13:15:14Z

cpp/daal/src/externals/service_profiler.cpp

+                }
+            }
+
+            prefix += is_last ? "└── " : "├── ";


Is it Ok to have non-ASCII symbols in the code? Maybe it would be safer to replace with the IDs of the respective characters?

I think having UTF8 string literals like that would lead to parser errors if the OS has an encoding that cannot read them. It'd never be the default on linux, but it theoretically allows to use encodings like cp1252. Could alternatively be encoded using codes with 'x', in which case a system with unsupported encoding would still read them but would print incorrect things.

Vika-F · 2025-04-17T13:17:26Z

cpp/daal/src/externals/service_profiler.h

 #ifndef __SERVICE_PROFILER_H__
-#define __SERVICE_PROFILER_H__
+    #define __SERVICE_PROFILER_H__


You don't need those in case there is already #pragma once in the file.

Vika-F · 2025-04-17T13:24:41Z

cpp/oneapi/dal/detail/profiler.cpp

+void print_header() {
+    daal::services::LibraryVersionInfo ver;
+
+    std::cerr << "Major version:          " << ver.majorVersion << std::endl;
+    std::cerr << "Minor version:          " << ver.minorVersion << std::endl;
+    std::cerr << "Update version:         " << ver.updateVersion << std::endl;
+    std::cerr << "Product status:         " << ver.productStatus << std::endl;
+    std::cerr << "Build:                  " << ver.build << std::endl;
+    std::cerr << "Build revision:         " << ver.build_rev << std::endl;
+    std::cerr << "Name:                   " << ver.name << std::endl;
+    std::cerr << "Processor optimization: " << ver.processor << std::endl;
+    std::cerr << std::endl;
 }

-void profiler::end_task(const char* task_name) {}
+static void set_verbose_from_env(void) {


Why not to call those directly from DAAL and not duplicate here?

Vika-F · 2025-04-17T13:36:13Z

@david-cortes-intel I checked and its a threading task(ThreadingTask<algorithmFPType, cpu>::update) I assume that each thread prints own information. @Vika-F I can try to add new type of tasks for threading. @david-cortes-intel Can you limit the number of threads and check changes?

I would expect that the times are grouped together in one number for all threads. Otherwise it is too much info and less useful.

david-cortes-intel · 2025-04-17T15:02:10Z

@david-cortes-intel I checked and its a threading task(ThreadingTask<algorithmFPType, cpu>::update) I assume that each thread prints own information. @Vika-F I can try to add new type of tasks for threading. @david-cortes-intel Can you limit the number of threads and check changes?

I would expect that the times are grouped together in one number for all threads. Otherwise it is too much info and less useful.

@Vika-F But do we actually need numbers from per-thread tasks? Before this PR, were those somehow being logged differently than the ones from the main algorithm workflow?

minor fix

2ce1745

minor fix

19c6d46

Alexandr-Solovev added 5 commits April 3, 2025 15:21

Merge branch 'uxlfoundation:main' into dev/asolovev_add_cpu_profiler

8eb57e3

fixes

6baa46a

minor fixes for compilers

823cfbd

minor fix

785f817

minor fix

0bfc36f

Alexandr-Solovev added 4 commits April 8, 2025 03:32

minor fixes

e4c56c9

minor fix for debug info

48d0e53

fixes

4774f67

minor fix

9ff78ec

Alexandr-Solovev added the dpc++ Issue/PR related to DPC++ functionality label Apr 8, 2025

Alexandr-Solovev added 3 commits April 8, 2025 11:52

fix for clang

43308ba

minor fixes and mutex for ci

dcbdad8

add mutex header

6f4a004

Alexandr-Solovev added 7 commits April 9, 2025 06:03

minor fix

1e6531d

disable win

41aae95

minor fixes

ec46ecc

fixes

6fb5a28

Merge branch 'uxlfoundation:main' into dev/asolovev_add_cpu_profiler

181db0d

Merge branch 'uxlfoundation:main' into dev/asolovev_add_cpu_profiler

080bc06

add arg profiler and add win support

f094bb6

Alexandr-Solovev added 2 commits April 14, 2025 03:11

minor fixes

67ee817

add new macro

a633037

Alexandr-Solovev requested review from avolkov-intel, david-cortes-intel and KateBlueSky April 15, 2025 08:53

Alexandr-Solovev added 4 commits April 16, 2025 00:31

fixes for queue perf

6c6bbe6

minor fixes

9ab9e1d

Merge branch 'uxlfoundation:main' into dev/asolovev_add_cpu_profiler

26d2ab2

fixes

0f48dae

Alexandr-Solovev added 2 commits April 16, 2025 12:10

minor fix

2a7c284

minor fixes

ad31b5e

add docs and minor fixes

e698a50

Vika-F reviewed Apr 17, 2025

View reviewed changes

david-cortes-intel reviewed Apr 17, 2025

View reviewed changes

Vika-F reviewed Apr 17, 2025

View reviewed changes

	device by enabling Intel® oneAPI Data Analytics Library Verbose.
	device by enabling Intel® oneAPI Data Analytics Library Verbose Mode.

	printing a human-readable line describing the call. Even, if your application gets terminated during
	printing a human-readable line describing the call. Even if the application gets terminated during

	be printed by the first call to a verbose-enabled function, following the version information line printed
	be printed on the first call to a verbose-enabled function, following the version information lines printed


		name.erase(0, name.find_first_not_of(" \t\n\r"));

		std::cerr << name << ": " << value << "; ";


		__itt_task_begin(Profiler::getDomain(), __itt_null, __itt_null, _handle);
		std::cout << "Major version: " << ver.majorVersion << std::endl;

feature: onedal verbose profiler #3155

Are you sure you want to change the base?

feature: onedal verbose profiler #3155

Conversation

Alexandr-Solovev commented Apr 2, 2025 • edited Loading

Summary

Changes Introduced

Purpose

Additional Notes

Alexandr-Solovev commented Apr 2, 2025

Alexandr-Solovev commented Apr 3, 2025

Alexandr-Solovev commented Apr 8, 2025

Alexandr-Solovev commented Apr 9, 2025

Alexandr-Solovev commented Apr 14, 2025

Alexandr-Solovev commented Apr 14, 2025

david-cortes-intel commented Apr 15, 2025

david-cortes-intel commented Apr 16, 2025

Alexandr-Solovev commented Apr 17, 2025

icfaust commented Apr 17, 2025

david-cortes-intel commented Apr 17, 2025 • edited Loading

Alexandr-Solovev commented Apr 17, 2025

david-cortes-intel commented Apr 17, 2025

Alexandr-Solovev commented Apr 17, 2025 • edited Loading

david-cortes-intel commented Apr 17, 2025

Alexandr-Solovev commented Apr 17, 2025

david-cortes-intel commented Apr 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vika-F Apr 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vika-F commented Apr 17, 2025

david-cortes-intel commented Apr 17, 2025

Alexandr-Solovev commented Apr 2, 2025 •

edited

Loading

david-cortes-intel commented Apr 17, 2025 •

edited

Loading

Alexandr-Solovev commented Apr 17, 2025 •

edited

Loading

david-cortes-intel commented Apr 17, 2025 •

edited

Loading

Vika-F Apr 17, 2025 •

edited

Loading