-
Notifications
You must be signed in to change notification settings - Fork 221
Use aligned loads and stores where possible in DAAL memory management #3159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/intelci: run |
/intelci: run |
{ | ||
return NULL; | ||
/// Use aligned stores | ||
const unsigned int num32 = static_cast<unsigned int>(num); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this conversion? Is it is in general possible to vectorize loop if num >= UINT_MAX? Not sure if we allocate such big arrays anywhere but with the current implementation we will do it much slower because alignment vectorization will not be applied. Maybe in this case we can split loop into few so that each of them is vectorizable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this conversion?
Because loops with 32-bit counters are in general faster that loops with 64-bit counters.
Is it is in general possible to vectorize loop if num >= UINT_MAX?
It is possible. And the loop will be vectorized in this case. Al least for default types like float
, double
, int
, etc. there should be no problems to vectorize this loop.
Maybe in this case we can split loop into few so that each of them is vectorizable
Yes, I thought about adding more branches or maybe even splitting the loop into two sub-loops: unaligned first part and aligned second part (when possible).
But the current version already solves the performance problems I have in #3126, and it improves the performance in most of the testcases in our benchmarks.
Further improvements are possible though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRAGMA_IVDEP | ||
PRAGMA_VECTOR_ALWAYS | ||
PRAGMA_VECTOR_ALIGNED | ||
for (unsigned int i = 0; i < num32; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we apply vectorization in case sizeof(T) > our alignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would depend on the data type.
Also, those pragmas are just guidance to the compiler. Having them here does not force compiler to vectorize the loop, it just asks it to vectorize this loop.
return NULL; | ||
} | ||
|
||
if constexpr (std::is_trivially_default_constructible_v<T>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, if we can use std here (for example if T is a template class that has cpu as template parameter). Likely, it's fine but maybe worth to check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case this shouldn't be a problem as this construct is converted into true
of false
in compile time.
Actually, the whole if constexpr
is evalueated in compile time into a single call of memset_default
or memset
depending on the data type T
.
That was the idea. Because it is not possible to call the default constructor for the type that doesn't have a default constructor.
And setting the memory to all zeros for non-POD types (as it was done previously) also looks strange to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve with minor comments: please, add TODO and create the follow up ticket to investigate observed performance degradations
Pull changes from main branch
/intelci: run |
Description
service_memset_seq
,service_calloc
andservice_scalable_calloc
functions were modified to use aligned memory loads and stores for aligned data.C++ standard version was increased from C++11 to C++17 in
daal_module
in bazel build. This was done to simplify the implementations ofservice_calloc
andservice_scalable_calloc
by the use of C++17 features likeif constexpr
.Checklist to comply with before moving PR from draft:
PR completeness and readability
Testing
Performance
Total geomean speedup across the algorithms is 1.15 on 112-core SPR system.
While there is no significant change in the performance of the 'fit' stage, the performance of the 'predict' stage shown 1.3 geomean improvement due to the use of aligned loads and stores.
The only significant performance drop (up to 3x) happens in Ridge regression 'fit' and in 2 cases of KNN Regression fit. This drop is stably reproducible.
The reasons are yet unknown. To be investigated further.
The quality metrics had not changed because the computations logic is not affected.