AVX 256 functions accepting `__m{128,256}i_u` should accept `__m{128,256}i` #24076

PaperStrike · 2025-04-09T03:32:52Z

Version of emscripten/emsdk:

emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 4.0.6 (1ddaae4d2d6dfbb678ecc193bc988820d1fc4633)
clang version 21.0.0git (https:/github.com/llvm/llvm-project 4775e6d9099467df9363e1a3cd5950cc3d2fde05)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /data/home/sliphua/open/emsdk/upstream/bin

Failing command line in full:
It fails to compile astc-encoder x86-64 AVX2 version. Below is the error and a working solution. The other steps to adapt to emscripten AVX seem irrelevant, so they are not written here.

The error:

$ cd ~/open/astc-encoder/build_avx2
$ emcmake cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../ -DASTCENC_ISA_AVX2=ON ..
$ emmake make
make: make
[  2%] Building CXX object Source/CMakeFiles/astcenc-avx2-static.dir/astcenc_averages_and_directions.cpp.o
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_averages_and_directions.cpp:23:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_internal.h:35:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_mathlib.h:460:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib.h:86:
/data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib_avx2_8.h:137:7: error: no matching function for call to '_mm256_loadu_si256'
  137 |                 m = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(p));
      |                     ^~~~~~~~~~~~~~~~~~
/data/home/sliphua/open/emsdk/upstream/emscripten/cache/sysroot/include/compat/avxintrin.h:1276:1: note: candidate function not viable: no known conversion from 'const __m256i *' to 'const __m256i_u *' for 1st argument
 1276 | _mm256_loadu_si256(__m256i_u const* __p) {
      | ^                  ~~~~~~~~~~~~~~~~~~~~
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_averages_and_directions.cpp:23:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_internal.h:35:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_mathlib.h:460:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib.h:86:
/data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib_avx2_8.h:189:16: error: no matching function for call to '_mm256_lddqu_si256'
  189 |                 return vint8(_mm256_lddqu_si256(reinterpret_cast<const __m256i*>(p)));
      |                              ^~~~~~~~~~~~~~~~~~
/data/home/sliphua/open/emsdk/upstream/emscripten/cache/sysroot/include/compat/avxintrin.h:1284:1: note: candidate function not viable: no known conversion from 'const __m256i *' to 'const __m256i_u *' for 1st argument
 1284 | _mm256_lddqu_si256(__m256i_u const* __p) {
      | ^                  ~~~~~~~~~~~~~~~~~~~~
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_averages_and_directions.cpp:23:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_internal.h:35:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_mathlib.h:460:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib.h:86:
/data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib_avx2_8.h:522:2: error: no matching function for call to '_mm256_storeu_si256'
  522 |         _mm256_storeu_si256(reinterpret_cast<__m256i*>(p), a.m);
      |         ^~~~~~~~~~~~~~~~~~~
/data/home/sliphua/open/emsdk/upstream/emscripten/cache/sysroot/include/compat/avxintrin.h:1322:1: note: candidate function not viable: no known conversion from '__m256i *' to '__m256i_u *' for 1st argument
 1322 | _mm256_storeu_si256(__m256i_u* __p, __m256i __a) {
      | ^                   ~~~~~~~~~~~~~~
3 errors generated.
make[2]: *** [Source/CMakeFiles/astcenc-avx2-static.dir/build.make:77: Source/CMakeFiles/astcenc-avx2-static.dir/astcenc_averages_and_directions.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:860: Source/CMakeFiles/astcenc-avx2-static.dir/all] Error 2
make: *** [Makefile:146: all] Error 2
emmake: error: 'make' failed (returned 2)

A working solution:
Manually adding the __m256i* overloads of the functions suggested by the error message to avxintrin.h resolves the error.

SIMDe uses void*, so it appears they don't have the issue.
https://github.com/simd-everywhere/simde/blob/2af3dce9b2481b6b32139b1022cdfc02a633c898/simde/x86/avx.h#L4361

The text was updated successfully, but these errors were encountered:

sbc100 · 2025-04-09T17:59:16Z

Would you have time to send a PR?

Also, have you looked into disabled the AVX2 code in the library in question? The emulation path is unlikely to give you great results in wasm anyway.

PaperStrike · 2025-04-10T11:12:31Z

Would you have time to send a PR?

I would like to, but I need to spend some time understanding why SIMDe chose void* and why __m{128,256}i_u works in GCC/Clang before I can submit a PR with a good solution.

Also, have you looked into disabled the AVX2 code in the library in question? The emulation path is unlikely to give you great results in wasm anyway.

Yes, astc-encoder has various build modes including sse2, sse4.1, avx2, neno, and "none" which use different SIMD instruction set. Except for "none", each mode requires some minor adjustments to the cmake or cpp files to be built successfully. I've managed to find the fastest mode by building all the modes and comparing the encoding speed of the same large image file, and they do have noticeable perfomance differences.

It's interesting that the SSE4.1 wasm build performs optimally across both x86-64 and ARM.

	Intel i7-13700	Apple M3 Pro
none (no intrinsics)	7.49s	3.57s
sse2	3.94s	3.03s
sse4.1	3.51s	2.53s
avx2 (without f16c)	4.17s	2.95s
neon	4.25s	2.86s

sbc100 · 2025-04-10T16:19:41Z

Would you have time to send a PR?

I would like to, but I need to spend some time understanding why SIMDe chose void* and why __m{128,256}i_u works in GCC/Clang before I can submit a PR with a good solution.

@tlively ?

Also, have you looked into disabled the AVX2 code in the library in question? The emulation path is unlikely to give you great results in wasm anyway.

Yes, astc-encoder has various build modes including sse2, sse4.1, avx2, neno, and "none" which use different SIMD instruction set. Except for "none", each mode requires some minor adjustments to the cmake or cpp files to be built successfully. I've managed to find the fastest mode by building all the modes and comparing the encoding speed of the same large image file, and they do have noticeable perfomance differences.

It's interesting that the SSE4.1 wasm build performs optimally across both x86-64 and ARM.

Intel i7-13700 Apple M3 Pro
none (no intrinsics) 7.49s 3.57s
sse2 3.94s 3.03s
sse4.1 3.51s 2.53s
avx2 (without f16c) 4.17s 2.95s
neon 4.25s 2.86s

Are you talking about wasm performance or native build performance?

PaperStrike · 2025-04-11T02:10:30Z

Are you talking about wasm performance or native build performance?

The wasm performance when running in Chrome 135. The table shows the avg time they took to encode a large image file.

sbc100 · 2025-04-14T17:27:00Z

@tlively are these number what we might expect? I suspect the reason is that sse4.1 is closest to wasm-simd and therefore has the smallest emulation overhead?

olokobayusuf · 2025-04-14T19:36:28Z

From #24095 , I'm curious whether a more ideal fix (i.e. more inline with other compilers) is to somehow unify the definitions of __m256i and __m256i_u. Other compilers (Xcode, GCC, Clang) do this, but in Emscripten it's a bit more complicated because the __m256i* types contain corresponding __m128i* types.

If we could somehow unify both definitions, then we could typedef both __m256i and __m256i_u to the same underlying type with only differing alignments. This obviates the need to define intrinsic overloads which might be cumbersome.

olokobayusuf · 2025-04-14T19:40:18Z

@sbc100 will something like this work?

 typedef int64_t __m128i_u __attribute__((__vector_size__(16), __aligned__(1))); 

 typedef struct { 
   __m128i_u v0; 
   __m128i_u v1; 
 } __m256i_base;
 
typedef __m256i_base __m256i_u __attribute__((__aligned__(1)));
typedef __m256i_base __m256i __attribute__((__aligned__(32)));

sbc100 mentioned this issue Apr 14, 2025

AVX emulation uses distinct types for unaligned vector types #24095

Closed

sbc100 marked this as a duplicate of #24095 Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX 256 functions accepting `__m{128,256}i_u` should accept `__m{128,256}i` #24076

AVX 256 functions accepting `__m{128,256}i_u` should accept `__m{128,256}i` #24076

PaperStrike commented Apr 9, 2025

sbc100 commented Apr 9, 2025

PaperStrike commented Apr 10, 2025 •

edited

Loading

sbc100 commented Apr 10, 2025

PaperStrike commented Apr 11, 2025

sbc100 commented Apr 14, 2025

olokobayusuf commented Apr 14, 2025

olokobayusuf commented Apr 14, 2025

AVX 256 functions accepting __m{128,256}i_u should accept __m{128,256}i #24076

AVX 256 functions accepting __m{128,256}i_u should accept __m{128,256}i #24076

Comments

PaperStrike commented Apr 9, 2025

sbc100 commented Apr 9, 2025

PaperStrike commented Apr 10, 2025 • edited Loading

sbc100 commented Apr 10, 2025

PaperStrike commented Apr 11, 2025

sbc100 commented Apr 14, 2025

olokobayusuf commented Apr 14, 2025

olokobayusuf commented Apr 14, 2025

AVX 256 functions accepting `__m{128,256}i_u` should accept `__m{128,256}i` #24076

AVX 256 functions accepting `__m{128,256}i_u` should accept `__m{128,256}i` #24076

PaperStrike commented Apr 10, 2025 •

edited

Loading