Skip to content

AVX 256 functions accepting __m{128,256}i_u should accept __m{128,256}i #24076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
PaperStrike opened this issue Apr 9, 2025 · 7 comments
Open

Comments

@PaperStrike
Copy link

Version of emscripten/emsdk:

emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 4.0.6 (1ddaae4d2d6dfbb678ecc193bc988820d1fc4633)
clang version 21.0.0git (https:/github.com/llvm/llvm-project 4775e6d9099467df9363e1a3cd5950cc3d2fde05)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /data/home/sliphua/open/emsdk/upstream/bin

Failing command line in full:
It fails to compile astc-encoder x86-64 AVX2 version. Below is the error and a working solution. The other steps to adapt to emscripten AVX seem irrelevant, so they are not written here.

The error:

$ cd ~/open/astc-encoder/build_avx2
$ emcmake cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../ -DASTCENC_ISA_AVX2=ON ..
$ emmake make
make: make
[  2%] Building CXX object Source/CMakeFiles/astcenc-avx2-static.dir/astcenc_averages_and_directions.cpp.o
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_averages_and_directions.cpp:23:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_internal.h:35:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_mathlib.h:460:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib.h:86:
/data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib_avx2_8.h:137:7: error: no matching function for call to '_mm256_loadu_si256'
  137 |                 m = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(p));
      |                     ^~~~~~~~~~~~~~~~~~
/data/home/sliphua/open/emsdk/upstream/emscripten/cache/sysroot/include/compat/avxintrin.h:1276:1: note: candidate function not viable: no known conversion from 'const __m256i *' to 'const __m256i_u *' for 1st argument
 1276 | _mm256_loadu_si256(__m256i_u const* __p) {
      | ^                  ~~~~~~~~~~~~~~~~~~~~
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_averages_and_directions.cpp:23:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_internal.h:35:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_mathlib.h:460:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib.h:86:
/data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib_avx2_8.h:189:16: error: no matching function for call to '_mm256_lddqu_si256'
  189 |                 return vint8(_mm256_lddqu_si256(reinterpret_cast<const __m256i*>(p)));
      |                              ^~~~~~~~~~~~~~~~~~
/data/home/sliphua/open/emsdk/upstream/emscripten/cache/sysroot/include/compat/avxintrin.h:1284:1: note: candidate function not viable: no known conversion from 'const __m256i *' to 'const __m256i_u *' for 1st argument
 1284 | _mm256_lddqu_si256(__m256i_u const* __p) {
      | ^                  ~~~~~~~~~~~~~~~~~~~~
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_averages_and_directions.cpp:23:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_internal.h:35:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_mathlib.h:460:
In file included from /data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib.h:86:
/data/home/sliphua/open/astc-encoder/Source/astcenc_vecmathlib_avx2_8.h:522:2: error: no matching function for call to '_mm256_storeu_si256'
  522 |         _mm256_storeu_si256(reinterpret_cast<__m256i*>(p), a.m);
      |         ^~~~~~~~~~~~~~~~~~~
/data/home/sliphua/open/emsdk/upstream/emscripten/cache/sysroot/include/compat/avxintrin.h:1322:1: note: candidate function not viable: no known conversion from '__m256i *' to '__m256i_u *' for 1st argument
 1322 | _mm256_storeu_si256(__m256i_u* __p, __m256i __a) {
      | ^                   ~~~~~~~~~~~~~~
3 errors generated.
make[2]: *** [Source/CMakeFiles/astcenc-avx2-static.dir/build.make:77: Source/CMakeFiles/astcenc-avx2-static.dir/astcenc_averages_and_directions.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:860: Source/CMakeFiles/astcenc-avx2-static.dir/all] Error 2
make: *** [Makefile:146: all] Error 2
emmake: error: 'make' failed (returned 2)

A working solution:
Manually adding the __m256i* overloads of the functions suggested by the error message to avxintrin.h resolves the error.

SIMDe uses void*, so it appears they don't have the issue.
https://github.com/simd-everywhere/simde/blob/2af3dce9b2481b6b32139b1022cdfc02a633c898/simde/x86/avx.h#L4361

@sbc100
Copy link
Collaborator

sbc100 commented Apr 9, 2025

Would you have time to send a PR?

Also, have you looked into disabled the AVX2 code in the library in question? The emulation path is unlikely to give you great results in wasm anyway.

@PaperStrike
Copy link
Author

PaperStrike commented Apr 10, 2025

Would you have time to send a PR?

I would like to, but I need to spend some time understanding why SIMDe chose void* and why __m{128,256}i_u works in GCC/Clang before I can submit a PR with a good solution.

Also, have you looked into disabled the AVX2 code in the library in question? The emulation path is unlikely to give you great results in wasm anyway.

Yes, astc-encoder has various build modes including sse2, sse4.1, avx2, neno, and "none" which use different SIMD instruction set. Except for "none", each mode requires some minor adjustments to the cmake or cpp files to be built successfully. I've managed to find the fastest mode by building all the modes and comparing the encoding speed of the same large image file, and they do have noticeable perfomance differences.

It's interesting that the SSE4.1 wasm build performs optimally across both x86-64 and ARM.

Intel i7-13700 Apple M3 Pro
none (no intrinsics) 7.49s 3.57s
sse2 3.94s 3.03s
sse4.1 3.51s 2.53s
avx2 (without f16c) 4.17s 2.95s
neon 4.25s 2.86s

@sbc100
Copy link
Collaborator

sbc100 commented Apr 10, 2025

Would you have time to send a PR?

I would like to, but I need to spend some time understanding why SIMDe chose void* and why __m{128,256}i_u works in GCC/Clang before I can submit a PR with a good solution.

@tlively ?

Also, have you looked into disabled the AVX2 code in the library in question? The emulation path is unlikely to give you great results in wasm anyway.

Yes, astc-encoder has various build modes including sse2, sse4.1, avx2, neno, and "none" which use different SIMD instruction set. Except for "none", each mode requires some minor adjustments to the cmake or cpp files to be built successfully. I've managed to find the fastest mode by building all the modes and comparing the encoding speed of the same large image file, and they do have noticeable perfomance differences.

It's interesting that the SSE4.1 wasm build performs optimally across both x86-64 and ARM.

Intel i7-13700 Apple M3 Pro
none (no intrinsics) 7.49s 3.57s
sse2 3.94s 3.03s
sse4.1 3.51s 2.53s
avx2 (without f16c) 4.17s 2.95s
neon 4.25s 2.86s

Are you talking about wasm performance or native build performance?

@PaperStrike
Copy link
Author

Are you talking about wasm performance or native build performance?

The wasm performance when running in Chrome 135. The table shows the avg time they took to encode a large image file.

@sbc100
Copy link
Collaborator

sbc100 commented Apr 14, 2025

@tlively are these number what we might expect? I suspect the reason is that sse4.1 is closest to wasm-simd and therefore has the smallest emulation overhead?

@olokobayusuf
Copy link

From #24095 , I'm curious whether a more ideal fix (i.e. more inline with other compilers) is to somehow unify the definitions of __m256i and __m256i_u. Other compilers (Xcode, GCC, Clang) do this, but in Emscripten it's a bit more complicated because the __m256i* types contain corresponding __m128i* types.

If we could somehow unify both definitions, then we could typedef both __m256i and __m256i_u to the same underlying type with only differing alignments. This obviates the need to define intrinsic overloads which might be cumbersome.

@olokobayusuf
Copy link

@sbc100 will something like this work?

 typedef int64_t __m128i_u __attribute__((__vector_size__(16), __aligned__(1))); 

 typedef struct { 
   __m128i_u v0; 
   __m128i_u v1; 
 } __m256i_base;
 
typedef __m256i_base __m256i_u __attribute__((__aligned__(1)));
typedef __m256i_base __m256i __attribute__((__aligned__(32)));

@sbc100 sbc100 marked this as a duplicate of #24095 Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants