-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Excessive moves when returning array from nested struct #46458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
(I'm not sure this component is correct!) |
My first guess is a pass phase ordering problem. If I feed the IR for function 'c' to "opt -memcpyopt -instcombine", it reduces to the expected single memset call. So we have the ability to catch this, but it's not happening in the -O3 case shown in the godbolt link above. |
I guess I don't know enough about LLVM's pass architecture to help move this forward. I imagine "just re-run those passes" both a) would work, b) have downsides (e.g. compile time). I imagine re-ordering the passes just pessimizes some other code. Is there a third option? |
Sure - if we can make -memcpyopt smarter, it might be able to handle whatever is stopping it currently. That or your 1st option is our best hope. We just have to do some sanity-checking of compile-time and perf measurements if we do need to adjust the pass pipeline. It may come down to adding extra passes only at -O3, but hopefully not. If you can get a snapshot of the current IR going into the last -memcpyopt in the current pipeline and compare that to the IR after -O2 or -O3 that would be helpful. Then we need to figure out which pass(es) are altering the IR (and what they are doing) to make it amenable to an extra round of -memcpyopt. |
FPM.addPass(MergedLoadStoreMotionPass()); // Specially optimize memory movement as it doesn't look like dataflow in SSA. // Sparse conditional constant propagation. // Delete dead bit computations (instcombine runs after to fold away the dead // Run instcombine after redundancy and dead bit elimination to exploit => FPM.addPass(MergedLoadStoreMotionPass()); // Sparse conditional constant propagation. // Delete dead bit computations (instcombine runs after to fold away the dead // Run instcombine after redundancy and dead bit elimination to exploit // Specially optimize memory movement as it doesn't look like dataflow in SSA. Look at log, this should fix your testcase. |
(And it makes sense, memcpyopt should run after some cleanup pass) |
Ah, great - if that's all it takes, that's easier than anything I was imagining. :) Let's make sure we're not duplicating effort. Who wants to draft the patch and run some tests? |
Feel free to take it :) |
I might be doing something wrong in my local experiment: if I only move the MemCpyOpt pass later in the pipeline, it does not solve the problem. I do get the expected optimization if I add an extra run of MemCpyOpt after the next InstCombine. |
Ah, yes :/ MemCpyOpt changes call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(24) %9, i8* nonnull align 8 dereferenceable(24) %5, i64 24, i1 false), !dbg !49, !tbaa.struct !60 To call void @llvm.memset.p0i8.i64(i8* align 8 %9, i8 0, i64 24, i1 false), !dbg !49 Then InstCombine does cleanup, and then extra MemCpyOpt does the job. |
Maybe it should be enough copy memcpy/memset related transformations from instcombine and run them in memcpyopt pass as well. |
Instruction *InstCombiner::SimplifyAnyMemTransfer |
Extended Description
Extracted from: rust-lang/rust#74267
Given the following C code, the
b
andc
functions are behaviorally identical (https://godbolt.org/z/5eKxE5):#include <stdlib.h>
#include <stdint.h>
#define N 2
typedef struct {
size_t length;
size_t capacity;
uint8_t* data;
} String;
static String new_string() {
String s = {0, 0, NULL};
return s;
}
struct Arr {
String data[N];
};
struct Arr b() {
struct Arr data;
for (size_t i = 0; i < N; i++) {
data.data[i] = new_string();
}
return data;
}
struct PartialArr {
struct Arr value;
};
struct Arr c() {
struct PartialArr data;
String (*slots)[N] = &data.value.data;
}
However, they end up optimized very differently:
b: # @b
mov rax, rdi
vxorps xmm0, xmm0, xmm0
vmovups xmmword ptr [rdi], xmm0
mov qword ptr [rdi + 16], 0
vmovups xmmword ptr [rdi + 24], xmm0
mov qword ptr [rdi + 40], 0
ret
c: # @c
vxorps xmm0, xmm0, xmm0
vmovaps xmmword ptr [rsp - 56], xmm0
mov qword ptr [rsp - 40], 0
vmovups xmmword ptr [rsp - 32], xmm0
mov rax, rdi
mov qword ptr [rsp - 16], 0
vmovups xmm0, xmmword ptr [rsp - 56]
vmovups xmmword ptr [rdi], xmm0
mov rcx, qword ptr [rsp - 40]
mov qword ptr [rdi + 16], rcx
mov rcx, qword ptr [rsp - 32]
mov qword ptr [rdi + 24], rcx
mov rcx, qword ptr [rsp - 40]
mov qword ptr [rdi + 16], rcx
vmovups xmm0, xmmword ptr [rsp - 32]
vmovups xmmword ptr [rdi + 24], xmm0
mov rcx, qword ptr [rsp - 16]
mov qword ptr [rdi + 40], rcx
ret
GCC is able to optimize this better:
b:
mov QWORD PTR [rdi], 0
mov QWORD PTR [rdi+8], 0
mov QWORD PTR [rdi+16], 0
mov QWORD PTR [rdi+24], 0
mov QWORD PTR [rdi+32], 0
mov QWORD PTR [rdi+40], 0
mov rax, rdi
ret
c:
mov QWORD PTR [rdi], 0
mov QWORD PTR [rdi+8], 0
mov QWORD PTR [rdi+16], 0
mov QWORD PTR [rdi+24], 0
mov QWORD PTR [rdi+32], 0
mov QWORD PTR [rdi+40], 0
mov rax, rdi
ret
The text was updated successfully, but these errors were encountered: