-
Notifications
You must be signed in to change notification settings - Fork 1.6k
AXPY looks bad especially on MacOS (M4) #5230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
On MacOS (M4 Pro): It looks like the threading hurts. Setting different OMP_NUM_THREDS I see:
For comparison:
On "true" Intel (i9-9880H) it actually does not look too bad -- I get shortest time (0.14sec) at 8 threads |
what is your input size here ? axpy should already be running single-threaded for any N below 10000 (and the Ryzen3950 should be using essentially the same daxpy kernel as your i9-9880) |
It is 1 X 50000 vector. |
I tried MKL. Its best time was also at 4 threads and it was worse then OpenBLAS' (0.2 sec). But it would pretty much stay the same all the way to 32 threads. |
Yes, with OpenBLAS it "historically" either one thread or however many cores you have. But that is not set in stone, and the Arm folks are already planting ramps in the gemm interfaces |
I am doing some benchmarking on 2d convolution in octave and e.g. for simple benchmark like that:
On MacOS (M4) the timing for OpenBLAS is
3.66
s), and for APPLE veclib it is0.1 s
.On x86_64 linux (Ryzen 3950x) it is also a couple seconds (and pretty much the same as NETLIB).
I will try to get some other Blas on it eventually to compare.
The conv code essentially is:
and profiler shows that it all dominated by
daxpy
calls.The text was updated successfully, but these errors were encountered: