AXPY looks bad especially on MacOS (M4) #5230

dasergatskov · 2025-04-16T17:37:44Z

I am doing some benchmarking on 2d convolution in octave and e.g. for simple benchmark like that:

r = ones (1, 5e4);
tic;  x1 = conv  (r, r);  time_row_conv  = toc

On MacOS (M4) the timing for OpenBLAS is 3.66 s), and for APPLE veclib it is 0.1 s.
On x86_64 linux (Ryzen 3950x) it is also a couple seconds (and pretty much the same as NETLIB).
I will try to get some other Blas on it eventually to compare.

The conv code essentially is:

    const F77_INT len = ma - mb + 1;  // Pre-calculate this value to avoid temporary
    for (F77_INT k = 0; k < na - nb + 1; k++) {
      for (F77_INT j = 0; j < nb; j++) {
        for (F77_INT i = 0; i < mb; i++) {
          double b_val = b[i + j*mb];
          daxpy_(&len, &b_val, &a[mb-i-1 + (k+nb-j-1)*ma], &one, &c[k*len], &one);
        }
      }
    }

and profiler shows that it all dominated by daxpy calls.

The text was updated successfully, but these errors were encountered:

dasergatskov · 2025-04-16T23:28:03Z

On MacOS (M4 Pro): It looks like the threading hurts. Setting different OMP_NUM_THREDS I see:

% OMP_NUM_THREADS=1 ./run-octave -qf
octave:1> r = ones (1, 5e4);
octave:2> tic;  x1 = conv  (r, r);  time_row_conv  = toc
time_row_conv = 0.4625

% OMP_NUM_THREADS=2 ./run-octave -qf
octave:1> r = ones (1, 5e4);
octave:2> tic;  x1 = conv  (r, r);  time_row_conv  = toc
time_row_conv = 0.8415

dmitri@dasmac out_gopt % OMP_NUM_THREADS=4 ./run-octave -qf
octave:1> r = ones (1, 5e4);
octave:2> tic;  x1 = conv  (r, r);  time_row_conv  = toc
time_row_conv = 1.0977

% OMP_NUM_THREADS=8 ./run-octave -qf
octave:1> r = ones (1, 5e4);
octave:2> tic;  x1 = conv  (r, r);  time_row_conv  = toc
time_row_conv = 2.1310

For comparison:

% FLEXIBLAS=NETLIB ./run-octave -qf
octave:1> r = ones (1, 5e4);
octave:2> tic;  x1 = conv  (r, r);  time_row_conv  = toc
time_row_conv = 1.0877

On "true" Intel (i9-9880H) it actually does not look too bad -- I get shortest time (0.14sec) at 8 threads
(CPU has 8 physical and 16 logical). Netlib gives 0.8 sec there.

martin-frbg · 2025-04-17T08:16:58Z

what is your input size here ? axpy should already be running single-threaded for any N below 10000 (and the Ryzen3950 should be using essentially the same daxpy kernel as your i9-9880)

dasergatskov · 2025-04-17T11:39:12Z

It is 1 X 50000 vector.
On ryzen (16 cores/32 threads) performance is OK, but the best time is at 4 Threads (0.15s). I usually run it at 16 (0.23 sec) (when I remember to set it up) or 32 (when I don't) (0.49 sec), that is why I thought something is def wrong.
For the reference Netlib on ryzen gives 0.53 sec, so I kind of expected to be a bit faster than 0.15s at the optimum, but may be I am hitting the next bottleneck there.

dasergatskov · 2025-04-17T14:48:40Z

I tried MKL. Its best time was also at 4 threads and it was worse then OpenBLAS' (0.2 sec). But it would pretty much stay the same all the way to 32 threads.
I was kind of wondering if OpenBLAS ramps up threads too fast. It looks like it is one below threshold, and all above. May be it should be some gentle ramp, like Nthreads = (N/N_0)^2.5 or something more scientific :)

martin-frbg · 2025-04-17T16:17:42Z

Yes, with OpenBLAS it "historically" either one thread or however many cores you have. But that is not set in stone, and the Arm folks are already planting ramps in the gemm interfaces

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AXPY looks bad especially on MacOS (M4) #5230

AXPY looks bad especially on MacOS (M4) #5230

dasergatskov commented Apr 16, 2025 •

edited

Loading

dasergatskov commented Apr 16, 2025

martin-frbg commented Apr 17, 2025

dasergatskov commented Apr 17, 2025 •

edited

Loading

dasergatskov commented Apr 17, 2025 •

edited

Loading

martin-frbg commented Apr 17, 2025

AXPY looks bad especially on MacOS (M4) #5230

AXPY looks bad especially on MacOS (M4) #5230

Comments

dasergatskov commented Apr 16, 2025 • edited Loading

dasergatskov commented Apr 16, 2025

martin-frbg commented Apr 17, 2025

dasergatskov commented Apr 17, 2025 • edited Loading

dasergatskov commented Apr 17, 2025 • edited Loading

martin-frbg commented Apr 17, 2025

dasergatskov commented Apr 16, 2025 •

edited

Loading

dasergatskov commented Apr 17, 2025 •

edited

Loading

dasergatskov commented Apr 17, 2025 •

edited

Loading