Use aligned loads and stores where possible in DAAL memory management #3159

Vika-F · 2025-04-07T08:40:26Z

Description

service_memset_seq, service_calloc and service_scalable_calloc functions were modified to use aligned memory loads and stores for aligned data.
C++ standard version was increased from C++11 to C++17 in daal_module in bazel build. This was done to simplify the implementations of service_calloc and service_scalable_calloc by the use of C++17 features like if constexpr.

Checklist to comply with before moving PR from draft:

PR completeness and readability

I have reviewed my changes thoroughly before submitting this pull request.
I have commented my code, particularly in hard-to-understand areas.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have added a respective label(s) to PR if I have a permission for that.
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.

Performance

I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.

Total geomean speedup across the algorithms is 1.15 on 112-core SPR system.
While there is no significant change in the performance of the 'fit' stage, the performance of the 'predict' stage shown 1.3 geomean improvement due to the use of aligned loads and stores.

I have provided justification why performance has changed or why changes are not expected.

The only significant performance drop (up to 3x) happens in Ridge regression 'fit' and in 2 cases of KNN Regression fit. This drop is stably reproducible.
The reasons are yet unknown. To be investigated further.

I have provided justification why quality metrics have changed or why changes are not expected.

The quality metrics had not changed because the computations logic is not affected.

…agement

Vika-F · 2025-04-07T11:20:16Z

/intelci: run

Vika-F · 2025-04-09T07:50:18Z

/intelci: run

Vika-F · 2025-04-09T11:17:03Z

CI re-run: http://intel-ci.intel.com/f0153412-dab0-f1e3-a345-a4bf010d0e2d

avolkov-intel · 2025-04-11T13:17:59Z

cpp/daal/src/externals/service_memory.h

    {
-        return NULL;
+        /// Use aligned stores
+        const unsigned int num32 = static_cast<unsigned int>(num);


Why do we need this conversion? Is it is in general possible to vectorize loop if num >= UINT_MAX? Not sure if we allocate such big arrays anywhere but with the current implementation we will do it much slower because alignment vectorization will not be applied. Maybe in this case we can split loop into few so that each of them is vectorizable

Why do we need this conversion?

Because loops with 32-bit counters are in general faster that loops with 64-bit counters.

Is it is in general possible to vectorize loop if num >= UINT_MAX?

It is possible. And the loop will be vectorized in this case. Al least for default types like float, double, int, etc. there should be no problems to vectorize this loop.

Maybe in this case we can split loop into few so that each of them is vectorizable

Yes, I thought about adding more branches or maybe even splitting the loop into two sub-loops: unaligned first part and aligned second part (when possible).

But the current version already solves the performance problems I have in #3126, and it improves the performance in most of the testcases in our benchmarks.

Further improvements are possible though.

Let's add TODO?

Here is the TODO:
https://github.com/uxlfoundation/oneDAL/pull/3159/files#diff-d7260f08f9fba4887d684cf7c2e0e085c7d885dc6b9b260c9843a1728331e5cdR53

avolkov-intel · 2025-04-11T13:26:28Z

cpp/daal/src/externals/service_memory.h

+        PRAGMA_IVDEP
+        PRAGMA_VECTOR_ALWAYS
+        PRAGMA_VECTOR_ALIGNED
+        for (unsigned int i = 0; i < num32; i++)


Can we apply vectorization in case sizeof(T) > our alignment?

I think it would depend on the data type.
Also, those pragmas are just guidance to the compiler. Having them here does not force compiler to vectorize the loop, it just asks it to vectorize this loop.

avolkov-intel · 2025-04-11T13:46:22Z

cpp/daal/src/externals/service_memory.h

+        return NULL;
+    }
+
+    if constexpr (std::is_trivially_default_constructible_v<T>)


Not sure, if we can use std here (for example if T is a template class that has cpu as template parameter). Likely, it's fine but maybe worth to check

In this particular case this shouldn't be a problem as this construct is converted into true of false in compile time.
Actually, the whole if constexpr is evalueated in compile time into a single call of memset_default or memset depending on the data type T.
That was the idea. Because it is not possible to call the default constructor for the type that doesn't have a default constructor.
And setting the memory to all zeros for non-POD types (as it was done previously) also looks strange to me.

avolkov-intel

Approve with minor comments: please, add TODO and create the follow up ticket to investigate observed performance degradations

Pull changes from main branch

Vika-F · 2025-04-14T12:09:43Z

/intelci: run

Use memory aligned loads and stores where possible in DAAL memory man…

6d0d613

…agement

Vika-F added the perf Performance optimization label Apr 7, 2025

Fix incorrect allocation size in service_calloc

14ae059

Vika-F added 3 commits April 7, 2025 08:01

Add comments

e038abc

Fix a typo

ec16b58

Fix a bug

54e0fbd

clang-format

b8b8b94

Vika-F marked this pull request as ready for review April 10, 2025 10:18

Vika-F requested review from Alexsandruss, Alexandr-Solovev, napetrov, homksei, ahuber21 and ethanglaser as code owners April 10, 2025 10:18

Vika-F mentioned this pull request Apr 11, 2025

Use parallel reduce threading primitive in covariance algorithm #3126

Draft

12 tasks

avolkov-intel reviewed Apr 11, 2025

View reviewed changes

avolkov-intel approved these changes Apr 14, 2025

View reviewed changes

Vika-F added 2 commits April 14, 2025 11:54

Merge pull request #43 from uxlfoundation/main

a0f9cd3

Pull changes from main branch

Add TODO

ee6a137

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use aligned loads and stores where possible in DAAL memory management #3159

Use aligned loads and stores where possible in DAAL memory management #3159

Vika-F commented Apr 7, 2025 •

edited

Loading

Vika-F commented Apr 7, 2025

Vika-F commented Apr 9, 2025

Vika-F commented Apr 9, 2025

avolkov-intel Apr 11, 2025 •

edited

Loading

Vika-F Apr 11, 2025 •

edited

Loading

avolkov-intel Apr 14, 2025

Vika-F Apr 14, 2025

avolkov-intel Apr 11, 2025

Vika-F Apr 11, 2025

avolkov-intel Apr 11, 2025

Vika-F Apr 11, 2025 •

edited

Loading

avolkov-intel left a comment

Vika-F commented Apr 14, 2025

Use aligned loads and stores where possible in DAAL memory management #3159

Are you sure you want to change the base?

Use aligned loads and stores where possible in DAAL memory management #3159

Conversation

Vika-F commented Apr 7, 2025 • edited Loading

Description

Vika-F commented Apr 7, 2025

Vika-F commented Apr 9, 2025

Vika-F commented Apr 9, 2025

avolkov-intel Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

Vika-F Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

avolkov-intel Apr 14, 2025

Choose a reason for hiding this comment

Vika-F Apr 14, 2025

Choose a reason for hiding this comment

avolkov-intel Apr 11, 2025

Choose a reason for hiding this comment

Vika-F Apr 11, 2025

Choose a reason for hiding this comment

avolkov-intel Apr 11, 2025

Choose a reason for hiding this comment

Vika-F Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

avolkov-intel left a comment

Choose a reason for hiding this comment

Vika-F commented Apr 14, 2025

Vika-F commented Apr 7, 2025 •

edited

Loading

avolkov-intel Apr 11, 2025 •

edited

Loading

Vika-F Apr 11, 2025 •

edited

Loading

Vika-F Apr 11, 2025 •

edited

Loading