Skip to content

Use aligned loads and stores where possible in DAAL memory management #3159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Vika-F
Copy link
Contributor

@Vika-F Vika-F commented Apr 7, 2025

Description

  • service_memset_seq, service_calloc and service_scalable_calloc functions were modified to use aligned memory loads and stores for aligned data.

  • C++ standard version was increased from C++11 to C++17 in daal_module in bazel build. This was done to simplify the implementations of service_calloc and service_scalable_calloc by the use of C++17 features like if constexpr.


Checklist to comply with before moving PR from draft:

PR completeness and readability

  • I have reviewed my changes thoroughly before submitting this pull request.
  • I have commented my code, particularly in hard-to-understand areas.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have added a respective label(s) to PR if I have a permission for that.
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.

Total geomean speedup across the algorithms is 1.15 on 112-core SPR system.
While there is no significant change in the performance of the 'fit' stage, the performance of the 'predict' stage shown 1.3 geomean improvement due to the use of aligned loads and stores.

  • I have provided justification why performance has changed or why changes are not expected.

The only significant performance drop (up to 3x) happens in Ridge regression 'fit' and in 2 cases of KNN Regression fit. This drop is stably reproducible.
The reasons are yet unknown. To be investigated further.

  • I have provided justification why quality metrics have changed or why changes are not expected.

The quality metrics had not changed because the computations logic is not affected.

@Vika-F Vika-F added the perf Performance optimization label Apr 7, 2025
@Vika-F
Copy link
Contributor Author

Vika-F commented Apr 7, 2025

/intelci: run

@Vika-F
Copy link
Contributor Author

Vika-F commented Apr 9, 2025

/intelci: run

@Vika-F
Copy link
Contributor Author

Vika-F commented Apr 9, 2025

{
return NULL;
/// Use aligned stores
const unsigned int num32 = static_cast<unsigned int>(num);
Copy link
Contributor

@avolkov-intel avolkov-intel Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this conversion? Is it is in general possible to vectorize loop if num >= UINT_MAX? Not sure if we allocate such big arrays anywhere but with the current implementation we will do it much slower because alignment vectorization will not be applied. Maybe in this case we can split loop into few so that each of them is vectorizable

Copy link
Contributor Author

@Vika-F Vika-F Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this conversion?

Because loops with 32-bit counters are in general faster that loops with 64-bit counters.

Is it is in general possible to vectorize loop if num >= UINT_MAX?

It is possible. And the loop will be vectorized in this case. Al least for default types like float, double, int, etc. there should be no problems to vectorize this loop.

Maybe in this case we can split loop into few so that each of them is vectorizable

Yes, I thought about adding more branches or maybe even splitting the loop into two sub-loops: unaligned first part and aligned second part (when possible).

But the current version already solves the performance problems I have in #3126, and it improves the performance in most of the testcases in our benchmarks.

Further improvements are possible though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRAGMA_IVDEP
PRAGMA_VECTOR_ALWAYS
PRAGMA_VECTOR_ALIGNED
for (unsigned int i = 0; i < num32; i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we apply vectorization in case sizeof(T) > our alignment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would depend on the data type.
Also, those pragmas are just guidance to the compiler. Having them here does not force compiler to vectorize the loop, it just asks it to vectorize this loop.

return NULL;
}

if constexpr (std::is_trivially_default_constructible_v<T>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, if we can use std here (for example if T is a template class that has cpu as template parameter). Likely, it's fine but maybe worth to check

Copy link
Contributor Author

@Vika-F Vika-F Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case this shouldn't be a problem as this construct is converted into true of false in compile time.
Actually, the whole if constexpr is evalueated in compile time into a single call of memset_default or memset depending on the data type T.
That was the idea. Because it is not possible to call the default constructor for the type that doesn't have a default constructor.
And setting the memory to all zeros for non-POD types (as it was done previously) also looks strange to me.

Copy link
Contributor

@avolkov-intel avolkov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve with minor comments: please, add TODO and create the follow up ticket to investigate observed performance degradations

@Vika-F
Copy link
Contributor Author

Vika-F commented Apr 14, 2025

/intelci: run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf Performance optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants