Skip to content

AXPY looks bad especially on MacOS (M4) #5230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dasergatskov opened this issue Apr 16, 2025 · 5 comments
Open

AXPY looks bad especially on MacOS (M4) #5230

dasergatskov opened this issue Apr 16, 2025 · 5 comments

Comments

@dasergatskov
Copy link

dasergatskov commented Apr 16, 2025

I am doing some benchmarking on 2d convolution in octave and e.g. for simple benchmark like that:

r = ones (1, 5e4);
tic;  x1 = conv  (r, r);  time_row_conv  = toc

On MacOS (M4) the timing for OpenBLAS is 3.66 s), and for APPLE veclib it is 0.1 s.
On x86_64 linux (Ryzen 3950x) it is also a couple seconds (and pretty much the same as NETLIB).
I will try to get some other Blas on it eventually to compare.

The conv code essentially is:

    const F77_INT len = ma - mb + 1;  // Pre-calculate this value to avoid temporary
    for (F77_INT k = 0; k < na - nb + 1; k++) {
      for (F77_INT j = 0; j < nb; j++) {
        for (F77_INT i = 0; i < mb; i++) {
          double b_val = b[i + j*mb];
          daxpy_(&len, &b_val, &a[mb-i-1 + (k+nb-j-1)*ma], &one, &c[k*len], &one);
        }
      }
    }

and profiler shows that it all dominated by daxpy calls.

@dasergatskov
Copy link
Author

On MacOS (M4 Pro): It looks like the threading hurts. Setting different OMP_NUM_THREDS I see:

% OMP_NUM_THREADS=1 ./run-octave -qf
octave:1> r = ones (1, 5e4);
octave:2> tic;  x1 = conv  (r, r);  time_row_conv  = toc
time_row_conv = 0.4625

% OMP_NUM_THREADS=2 ./run-octave -qf
octave:1> r = ones (1, 5e4);
octave:2> tic;  x1 = conv  (r, r);  time_row_conv  = toc
time_row_conv = 0.8415

dmitri@dasmac out_gopt % OMP_NUM_THREADS=4 ./run-octave -qf
octave:1> r = ones (1, 5e4);
octave:2> tic;  x1 = conv  (r, r);  time_row_conv  = toc
time_row_conv = 1.0977

% OMP_NUM_THREADS=8 ./run-octave -qf
octave:1> r = ones (1, 5e4);
octave:2> tic;  x1 = conv  (r, r);  time_row_conv  = toc
time_row_conv = 2.1310

For comparison:

% FLEXIBLAS=NETLIB ./run-octave -qf
octave:1> r = ones (1, 5e4);
octave:2> tic;  x1 = conv  (r, r);  time_row_conv  = toc
time_row_conv = 1.0877

On "true" Intel (i9-9880H) it actually does not look too bad -- I get shortest time (0.14sec) at 8 threads
(CPU has 8 physical and 16 logical). Netlib gives 0.8 sec there.

@martin-frbg
Copy link
Collaborator

what is your input size here ? axpy should already be running single-threaded for any N below 10000 (and the Ryzen3950 should be using essentially the same daxpy kernel as your i9-9880)

@dasergatskov
Copy link
Author

dasergatskov commented Apr 17, 2025

It is 1 X 50000 vector.
On ryzen (16 cores/32 threads) performance is OK, but the best time is at 4 Threads (0.15s). I usually run it at 16 (0.23 sec) (when I remember to set it up) or 32 (when I don't) (0.49 sec), that is why I thought something is def wrong.
For the reference Netlib on ryzen gives 0.53 sec, so I kind of expected to be a bit faster than 0.15s at the optimum, but may be I am hitting the next bottleneck there.

@dasergatskov
Copy link
Author

dasergatskov commented Apr 17, 2025

I tried MKL. Its best time was also at 4 threads and it was worse then OpenBLAS' (0.2 sec). But it would pretty much stay the same all the way to 32 threads.
I was kind of wondering if OpenBLAS ramps up threads too fast. It looks like it is one below threshold, and all above. May be it should be some gentle ramp, like Nthreads = (N/N_0)^2.5 or something more scientific :)

@martin-frbg
Copy link
Collaborator

Yes, with OpenBLAS it "historically" either one thread or however many cores you have. But that is not set in stone, and the Arm folks are already planting ramps in the gemm interfaces

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants