-
Notifications
You must be signed in to change notification settings - Fork 221
feature: onedal verbose profiler #3155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feature: onedal verbose profiler #3155
Conversation
/intelci: run |
/intelci: run |
/intelci: run |
/intelci: run |
/intelci: run |
/intelci: run |
Thanks for looking into this. A couple comments:
|
@Alexandr-Solovev I'm still seeing the same kinds of prints without timings after the last updates. |
@david-cortes-intel Thanks for the comments! With the latest commit, in case of usage Jupyter times should be available with tracer(ONEDAL_VERBOSE 2 4 5), you can use 4 or 5 to enable everything.
|
/intelci: run |
@Alexandr-Solovev Thanks, I can see the times now. But it again looks like it generates one log per each batch of linear regression: oneDAL/cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i Line 72 in 5dd39b8
.. whereas it should be generating one entry for the whole kernel: oneDAL/cpp/daal/src/algorithms/linear_model/linear_model_train_normeq_update_impl.i Line 234 in 5dd39b8
@Vika-F Any comments here? |
@david-cortes-intel I checked and its a threading task(ThreadingTask<algorithmFPType, cpu>::update) |
I get slightly fewer prints when limiting threads, but it still looks like it prints a lot more than it should. |
It must print just once for each PROFILER_TASK call, so, may be smth is called in a loop |
I'm not sure what exactly is happening there. For our use case, it ideally shouldn't print anything about those per-thread tasks, just about the larger task where they are called in a loop. But still, the amount of prints doesn't seem to correspond with the number of calls. Currently, it divides the data in batches of 256 rows, so for example if I pass 5k rows, it would amount to 20 batches, but it makes 81 prints instead: import os
os.environ["ONEDAL_VERBOSE"] = "4"
import numpy as np
from sklearnex.linear_model import LinearRegression
rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(5000,40))
y = rng.standard_normal(size=X.shape[0])
model = LinearRegression().fit(X, y) Using a single thread reduces those to 42, but that's still more than expected. |
@david-cortes-intel can you launch it as an example(not through Jupyter) and check the algorithm tree to see what are nested calls? |
@Alexandr-Solovev Attached is the full log with the analyzer and prints when running it through the cmake examples, without jupyter, on the same data. I see 80 prints which is just one less than before. |
int newval = 0; | ||
if (verbose_str) | ||
{ | ||
newval = std::atoi(verbose_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think atoi() has security issues, as it does not perform error checks.
It would be better to replace it with strtol.
* 4 enabled with logger tracer and analyzer | ||
* 5 enabled with logger tracer and analyzer with service functions | ||
*/ | ||
int * daal_verbose_mode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious why it is returned by pointer here. Can you please explain?
{ | ||
static const bool logger_value = [] { | ||
int value = *daal_verbose_mode(); | ||
return value == 1 || value == 4 || value == 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please define constants that represent the values 1
, 4
, 5
here and in other similar places.
It could be done in the top of the file, near daal_verbose_val
declaration. I'd also convert some values into bit flags in order to simplify the logic of the checks.
For example:
constexpr unsigned int DAAL_VERBOSE_ENABLED = (1U << 0);
constexpr unsigned int DAAL_VERBOSE_TRACER_ENABLED = (1U << 1);
constexpr unsigned int DAAL_VERBOSE_LOGGER_ENABLED = (1U << 2);
constexpr unsigned int DAAL_VERBOSE_ANALYSER_ENABLED = (1U << 3);
constexpr unsigned int DAAL_VERBOSE_SERVICE_FUNC_ENABLED = (1U << 4);
constexpr unsigned int DAAL_VERBOSE_ALL_ENABLED = (DAAL_VERBOSE_ENABLED | DAAL_VERBOSE_TRACER_ENABLED | DAAL_VERBOSE_LOGGER_ENABLED | DAAL_VERBOSE_ANALYSER_ENABLED);
constexpr unsigned int DAAL_VERBOSE_SERVICE_ENABLED = (DAAL_VERBOSE_ALL_ENABLED | DAAL_VERBOSE_SERVICE_FUNC_ENABLED);
In this case the check will look like:
return (value & DAAL_VERBOSE_LOGGER_ENABLED);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the logic of set_verbose_from_env
would have to be modified to convert from 0, 1, 2, 3... to bit flags in that case.
I do not insist on bit flags, but the constants 1, 2, 3,... have to be named somehow.
- which computational functions are called | ||
- what parameters are passed to them | ||
- how much time is spent to execute the functions | ||
- (for GPU applications) which GPU device the kernel is executed on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- which computational functions are called | |
- what parameters are passed to them | |
- how much time is spent to execute the functions | |
- (for GPU applications) which GPU device the kernel is executed on | |
- Which computational kernels are called | |
- What parameters are passed to them | |
- How much time is spent to execute the functions | |
- Which device (CPU/GPU) the kernel is executed on |
{ | ||
return ProfilerTask(taskName); | ||
std::ostringstream out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any issue in writing directly to stdout
if the prints from threaded kernels are removed?
- (for GPU applications) which GPU device the kernel is executed on | ||
|
||
You can get an application to print this information to a standard output | ||
device by enabling **Intel® oneAPI Data Analytics Library Verbose**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device by enabling **Intel® oneAPI Data Analytics Library Verbose**. | |
device by enabling **Intel® oneAPI Data Analytics Library Verbose Mode**. |
device by enabling **Intel® oneAPI Data Analytics Library Verbose**. | ||
|
||
When Verbose mode is active in oneDAL, every call of a verbose-enabled function finishes with | ||
printing a human-readable line describing the call. Even, if your application gets terminated during |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
printing a human-readable line describing the call. Even, if your application gets terminated during | |
printing a human-readable line describing the call. Even if the application gets terminated during |
The first call to a verbose-enabled function also prints a version information line. | ||
|
||
For GPU applications, additional information (one or more GPU information lines) will also | ||
be printed by the first call to a verbose-enabled function, following the version information line printed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be printed by the first call to a verbose-enabled function, following the version information line printed | |
be printed on the first call to a verbose-enabled function, following the version information lines printed |
const char* comma = strchr(names, ','); | ||
std::string name = comma ? std::string(names, comma) : std::string(names); | ||
|
||
name.erase(0, name.find_first_not_of(" \t\n\r")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe could use std::iswspace with a different function instead?
|
||
name.erase(0, name.find_first_not_of(" \t\n\r")); | ||
|
||
std::cerr << name << ": " << value << "; "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this one using cerr
while the others use cout
?
|
||
__itt_task_begin(Profiler::getDomain(), __itt_null, __itt_null, _handle); | ||
std::cout << "Major version: " << ver.majorVersion << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be faster with newline ('\n'
) instead of std::endl
at each call.
{ | ||
__itt_task_end(Profiler::getDomain()); | ||
#ifdef _MSC_VER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would all of this work on windows if using a compiler other than MSVC?
- how much time is spent to execute the functions | ||
- (for GPU applications) which GPU device the kernel is executed on | ||
|
||
You can get an application to print this information to a standard output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth mentioning that some logs are going to stdout
and some to stderr
.
} | ||
} | ||
|
||
prefix += is_last ? "└── " : "├── "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it Ok to have non-ASCII symbols in the code? Maybe it would be safer to replace with the IDs of the respective characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having UTF8 string literals like that would lead to parser errors if the OS has an encoding that cannot read them. It'd never be the default on linux, but it theoretically allows to use encodings like cp1252
. Could alternatively be encoded using codes with 'x', in which case a system with unsupported encoding would still read them but would print incorrect things.
#ifndef __SERVICE_PROFILER_H__ | ||
#define __SERVICE_PROFILER_H__ | ||
#define __SERVICE_PROFILER_H__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need those in case there is already #pragma once
in the file.
void print_header() { | ||
daal::services::LibraryVersionInfo ver; | ||
|
||
std::cerr << "Major version: " << ver.majorVersion << std::endl; | ||
std::cerr << "Minor version: " << ver.minorVersion << std::endl; | ||
std::cerr << "Update version: " << ver.updateVersion << std::endl; | ||
std::cerr << "Product status: " << ver.productStatus << std::endl; | ||
std::cerr << "Build: " << ver.build << std::endl; | ||
std::cerr << "Build revision: " << ver.build_rev << std::endl; | ||
std::cerr << "Name: " << ver.name << std::endl; | ||
std::cerr << "Processor optimization: " << ver.processor << std::endl; | ||
std::cerr << std::endl; | ||
} | ||
|
||
void profiler::end_task(const char* task_name) {} | ||
static void set_verbose_from_env(void) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not to call those directly from DAAL and not duplicate here?
I would expect that the times are grouped together in one number for all threads. Otherwise it is too much info and less useful. |
@Vika-F But do we actually need numbers from per-thread tasks? Before this PR, were those somehow being logged differently than the ones from the main algorithm workflow? |
Summary
This PR introduces internal support for logging, tracing, and analysis in oneDAL. The functionality is conditionally enabled using the
ONEDAL_VERBOSE
environment variable.Changes Introduced
logger
,tracer
, andanalyzer
components to the oneDAL internal infrastructure.ONEDAL_VERBOSE
environment variable:ONEDAL_VERBOSE=0/empty
: Logging is disabled (default).ONEDAL_VERBOSE=1
: Logging enabled.ONEDAL_VERBOSE=2
: Tracer enabled.ONEDAL_VERBOSE=3
: Analyzer enabled.ONEDAL_VERBOSE=4
: Analyzer logger and tracer are enabled.ONEDAL_VERBOSE=5
: Analyzer logger and tracer are enabled with service function.Purpose
These additions are intended to:
Additional Notes
PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.
You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).
Checklist to comply with before moving PR from draft:
PR completeness and readability
Testing
Performance