Skip to content

feature: onedal verbose profiler #3155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

Alexandr-Solovev
Copy link
Contributor

@Alexandr-Solovev Alexandr-Solovev commented Apr 2, 2025

Summary

This PR introduces internal support for logging, tracing, and analysis in oneDAL. The functionality is conditionally enabled using the ONEDAL_VERBOSE environment variable.

Changes Introduced

  • Added logger, tracer, and analyzer components to the oneDAL internal infrastructure.
  • These tools provide structured tracing of algorithm and kernel execution.
  • Logging behavior is controlled via the ONEDAL_VERBOSE environment variable:
    • ONEDAL_VERBOSE=0/empty: Logging is disabled (default).
    • ONEDAL_VERBOSE=1: Logging enabled.
    • ONEDAL_VERBOSE=2: Tracer enabled.
    • ONEDAL_VERBOSE=3: Analyzer enabled.
    • ONEDAL_VERBOSE=4: Analyzer logger and tracer are enabled.
    • ONEDAL_VERBOSE=5: Analyzer logger and tracer are enabled with service function.

Purpose

These additions are intended to:

  • Help developers debug and profile algorithms and kernels.
  • Analyze performance bottlenecks and nested execution paths.
  • Provide structured insights into internal flow without affecting public APIs.

Additional Notes

  • Logging is zero-cost when disabled (no runtime overhead).
  • All functionality is internal and transparent to end users.
  • Future improvements may extend the analyzer with visualization or export support.

PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

  • I have reviewed my changes thoroughly before submitting this pull request.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have added a respective label(s) to PR if I have a permission for that.
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
  • I have provided justification why performance has changed or why changes are not expected.
  • I have provided justification why quality metrics have changed or why changes are not expected.
  • I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

@Alexandr-Solovev
Copy link
Contributor Author

/intelci: run

@Alexandr-Solovev
Copy link
Contributor Author

/intelci: run

@Alexandr-Solovev
Copy link
Contributor Author

/intelci: run

@Alexandr-Solovev Alexandr-Solovev added the dpc++ Issue/PR related to DPC++ functionality label Apr 8, 2025
@Alexandr-Solovev
Copy link
Contributor Author

/intelci: run

@Alexandr-Solovev
Copy link
Contributor Author

/intelci: run

@Alexandr-Solovev
Copy link
Contributor Author

/intelci: run

@david-cortes-intel
Copy link
Contributor

Thanks for looking into this. A couple comments:

  • Please add documentation about this variable.
  • It doesn't work if the environment variable is set after the library is imported. If this is hard to change, then please document this aspect too.
  • I cannot find any information about execution times in the outputs. For example, this is what I see from linear regression with verbosity level 3:
    auto daal::algorithms::linear_model::normal_equations::training::internal::ThreadingTask<double, daal::avx512>::update(long long, long long, const NumericTable &, const NumericTable &)::(anonymous class)::operator()() const [algorithmFPType = double, cpu = daal::avx512]
    Profiler task_name: computeUpdate.syrkX-----------------------------------------------------------------------------
    auto daal::algorithms::linear_model::normal_equations::training::internal::ThreadingTask<double, daal::avx512>::update(long long, long long, const NumericTable &, const NumericTable &)::(anonymous class)::operator()() const [algorithmFPType = double, cpu = daal::avx512]
    Profiler task_name: computeUpdate.gemm1X
    
  • I'm not sure if this is coming from this PR, but it looks like linear regression is printing a lot more than it should. For example, it is generating one print for each subproblem that calls syrk/gemm, whereas last time I looked at it it had only one call to kernel profiling before the loop that makes multiple of those calls.
  • Given the amount of prints, perhaps this could use pure C++ streams not synchronized with C. Info: https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdio
  • Looks like there's quite a bit of commented out code that needs removal.

@david-cortes-intel
Copy link
Contributor

@Alexandr-Solovev I'm still seeing the same kinds of prints without timings after the last updates.

@Alexandr-Solovev
Copy link
Contributor Author

@david-cortes-intel Thanks for the comments! With the latest commit, in case of usage Jupyter times should be available with tracer(ONEDAL_VERBOSE 2 4 5), you can use 4 or 5 to enable everything.

  • I fixed "Given the amount of prints, perhaps this could use pure C++ streams not synchronized with C", "It doesn't work if the environment variable is set after the library is imported. If this is hard to change, then please document this aspect too."

  • About the "I'm not sure if this is coming from this PR, but it looks like linear regression is printing a lot more than it should. For example, it is generating one print for each subproblem that calls syrk/gemm, whereas last time I looked at it it had only one call to kernel profiling before the loop that makes multiple of those calls." Not sure, I had the same structure, but I can take a look if its important.

@icfaust
Copy link
Contributor

icfaust commented Apr 17, 2025

/intelci: run

@david-cortes-intel
Copy link
Contributor

david-cortes-intel commented Apr 17, 2025

@Alexandr-Solovev Thanks, I can see the times now. But it again looks like it generates one log per each batch of linear regression:

Status ThreadingTask<algorithmFPType, cpu>::update(DAAL_INT startRow, DAAL_INT nRows, const NumericTable & xTable, const NumericTable & yTable)

.. whereas it should be generating one entry for the whole kernel:

@Vika-F Any comments here?

@Alexandr-Solovev
Copy link
Contributor Author

@david-cortes-intel I checked and its a threading task(ThreadingTask<algorithmFPType, cpu>::update)
I assume that each thread prints own information. @Vika-F
I can try to add new type of tasks for threading. @david-cortes-intel Can you limit the number of threads and check changes?

@david-cortes-intel
Copy link
Contributor

@david-cortes-intel I checked and its a threading task(ThreadingTask<algorithmFPType, cpu>::update) I assume that each thread prints own information. @Vika-F I can try to add new type of tasks for threading. @david-cortes-intel Can you limit the number of threads and check changes?

I get slightly fewer prints when limiting threads, but it still looks like it prints a lot more than it should.

@Alexandr-Solovev
Copy link
Contributor Author

Alexandr-Solovev commented Apr 17, 2025

@david-cortes-intel I checked and its a threading task(ThreadingTask<algorithmFPType, cpu>::update) I assume that each thread prints own information. @Vika-F I can try to add new type of tasks for threading. @david-cortes-intel Can you limit the number of threads and check changes?

I get slightly fewer prints when limiting threads, but it still looks like it prints a lot more than it should.

It must print just once for each PROFILER_TASK call, so, may be smth is called in a loop

@david-cortes-intel
Copy link
Contributor

I'm not sure what exactly is happening there.

For our use case, it ideally shouldn't print anything about those per-thread tasks, just about the larger task where they are called in a loop.

But still, the amount of prints doesn't seem to correspond with the number of calls.

Currently, it divides the data in batches of 256 rows, so for example if I pass 5k rows, it would amount to 20 batches, but it makes 81 prints instead:

import os
os.environ["ONEDAL_VERBOSE"] = "4"
import numpy as np
from sklearnex.linear_model import LinearRegression
rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(5000,40))
y = rng.standard_normal(size=X.shape[0])
model = LinearRegression().fit(X, y)

Using a single thread reduces those to 42, but that's still more than expected.

@Alexandr-Solovev
Copy link
Contributor Author

@david-cortes-intel can you launch it as an example(not through Jupyter) and check the algorithm tree to see what are nested calls?

@david-cortes-intel
Copy link
Contributor

david-cortes-intel commented Apr 17, 2025

@Alexandr-Solovev Attached is the full log with the analyzer and prints when running it through the cmake examples, without jupyter, on the same data. I see 80 prints which is just one less than before.
log_linreg.txt

int newval = 0;
if (verbose_str)
{
newval = std::atoi(verbose_str);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think atoi() has security issues, as it does not perform error checks.
It would be better to replace it with strtol.

* 4 enabled with logger tracer and analyzer
* 5 enabled with logger tracer and analyzer with service functions
*/
int * daal_verbose_mode()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious why it is returned by pointer here. Can you please explain?

{
static const bool logger_value = [] {
int value = *daal_verbose_mode();
return value == 1 || value == 4 || value == 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please define constants that represent the values 1, 4, 5 here and in other similar places.
It could be done in the top of the file, near daal_verbose_val declaration. I'd also convert some values into bit flags in order to simplify the logic of the checks.
For example:

constexpr unsigned int DAAL_VERBOSE_ENABLED           = (1U << 0);
constexpr unsigned int DAAL_VERBOSE_TRACER_ENABLED    = (1U << 1);
constexpr unsigned int DAAL_VERBOSE_LOGGER_ENABLED    = (1U << 2);
constexpr unsigned int DAAL_VERBOSE_ANALYSER_ENABLED  = (1U << 3);
constexpr unsigned int DAAL_VERBOSE_SERVICE_FUNC_ENABLED  = (1U << 4);
constexpr unsigned int DAAL_VERBOSE_ALL_ENABLED  = (DAAL_VERBOSE_ENABLED | DAAL_VERBOSE_TRACER_ENABLED | DAAL_VERBOSE_LOGGER_ENABLED | DAAL_VERBOSE_ANALYSER_ENABLED);
constexpr unsigned int DAAL_VERBOSE_SERVICE_ENABLED = (DAAL_VERBOSE_ALL_ENABLED | DAAL_VERBOSE_SERVICE_FUNC_ENABLED);

In this case the check will look like:
return (value & DAAL_VERBOSE_LOGGER_ENABLED);

Copy link
Contributor

@Vika-F Vika-F Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the logic of set_verbose_from_env would have to be modified to convert from 0, 1, 2, 3... to bit flags in that case.

I do not insist on bit flags, but the constants 1, 2, 3,... have to be named somehow.

Comment on lines +20 to +23
- which computational functions are called
- what parameters are passed to them
- how much time is spent to execute the functions
- (for GPU applications) which GPU device the kernel is executed on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- which computational functions are called
- what parameters are passed to them
- how much time is spent to execute the functions
- (for GPU applications) which GPU device the kernel is executed on
- Which computational kernels are called
- What parameters are passed to them
- How much time is spent to execute the functions
- Which device (CPU/GPU) the kernel is executed on

{
return ProfilerTask(taskName);
std::ostringstream out;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any issue in writing directly to stdout if the prints from threaded kernels are removed?

- (for GPU applications) which GPU device the kernel is executed on

You can get an application to print this information to a standard output
device by enabling **Intel® oneAPI Data Analytics Library Verbose**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
device by enabling **Intel® oneAPI Data Analytics Library Verbose**.
device by enabling **Intel® oneAPI Data Analytics Library Verbose Mode**.

device by enabling **Intel® oneAPI Data Analytics Library Verbose**.

When Verbose mode is active in oneDAL, every call of a verbose-enabled function finishes with
printing a human-readable line describing the call. Even, if your application gets terminated during
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
printing a human-readable line describing the call. Even, if your application gets terminated during
printing a human-readable line describing the call. Even if the application gets terminated during

The first call to a verbose-enabled function also prints a version information line.

For GPU applications, additional information (one or more GPU information lines) will also
be printed by the first call to a verbose-enabled function, following the version information line printed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
be printed by the first call to a verbose-enabled function, following the version information line printed
be printed on the first call to a verbose-enabled function, following the version information lines printed

const char* comma = strchr(names, ',');
std::string name = comma ? std::string(names, comma) : std::string(names);

name.erase(0, name.find_first_not_of(" \t\n\r"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe could use std::iswspace with a different function instead?


name.erase(0, name.find_first_not_of(" \t\n\r"));

std::cerr << name << ": " << value << "; ";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this one using cerr while the others use cout?


__itt_task_begin(Profiler::getDomain(), __itt_null, __itt_null, _handle);
std::cout << "Major version: " << ver.majorVersion << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be faster with newline ('\n') instead of std::endl at each call.

{
__itt_task_end(Profiler::getDomain());
#ifdef _MSC_VER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would all of this work on windows if using a compiler other than MSVC?

- how much time is spent to execute the functions
- (for GPU applications) which GPU device the kernel is executed on

You can get an application to print this information to a standard output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning that some logs are going to stdout and some to stderr.

}
}

prefix += is_last ? "└── " : "├── ";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it Ok to have non-ASCII symbols in the code? Maybe it would be safer to replace with the IDs of the respective characters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having UTF8 string literals like that would lead to parser errors if the OS has an encoding that cannot read them. It'd never be the default on linux, but it theoretically allows to use encodings like cp1252. Could alternatively be encoded using codes with 'x', in which case a system with unsupported encoding would still read them but would print incorrect things.

Comment on lines 43 to +44
#ifndef __SERVICE_PROFILER_H__
#define __SERVICE_PROFILER_H__
#define __SERVICE_PROFILER_H__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need those in case there is already #pragma once in the file.

Comment on lines +38 to +52
void print_header() {
daal::services::LibraryVersionInfo ver;

std::cerr << "Major version: " << ver.majorVersion << std::endl;
std::cerr << "Minor version: " << ver.minorVersion << std::endl;
std::cerr << "Update version: " << ver.updateVersion << std::endl;
std::cerr << "Product status: " << ver.productStatus << std::endl;
std::cerr << "Build: " << ver.build << std::endl;
std::cerr << "Build revision: " << ver.build_rev << std::endl;
std::cerr << "Name: " << ver.name << std::endl;
std::cerr << "Processor optimization: " << ver.processor << std::endl;
std::cerr << std::endl;
}

void profiler::end_task(const char* task_name) {}
static void set_verbose_from_env(void) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to call those directly from DAAL and not duplicate here?

@Vika-F
Copy link
Contributor

Vika-F commented Apr 17, 2025

@david-cortes-intel I checked and its a threading task(ThreadingTask<algorithmFPType, cpu>::update) I assume that each thread prints own information. @Vika-F I can try to add new type of tasks for threading. @david-cortes-intel Can you limit the number of threads and check changes?

I would expect that the times are grouped together in one number for all threads. Otherwise it is too much info and less useful.

@david-cortes-intel
Copy link
Contributor

@david-cortes-intel I checked and its a threading task(ThreadingTask<algorithmFPType, cpu>::update) I assume that each thread prints own information. @Vika-F I can try to add new type of tasks for threading. @david-cortes-intel Can you limit the number of threads and check changes?

I would expect that the times are grouped together in one number for all threads. Otherwise it is too much info and less useful.

@Vika-F But do we actually need numbers from per-thread tasks? Before this PR, were those somehow being logged differently than the ones from the main algorithm workflow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dpc++ Issue/PR related to DPC++ functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants