Open
Description
PR #143607 for libc replaces a builtin_memset with an open coded equivalent and reports substantial performance gains (and minor reg count increase). That shouldn't be possible.
memset (and memcpy, memmove etc) have completely specified semantics and intrinsics with alignment metadata and similar on them. The backend could be lowering those optimally. In particular, enabling inactive lanes in the exec mask and clearing them again afterwards is likely to outperform doing the operation on whatever subset of the warp is active at the call site, at least when pointers are uniform and so forth.
Leaving this issue as a reminder to look into this, fix it, then move libc back to using builtin_memset.