Skip to content

Excessive moves when returning array from nested struct #46458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alex mannequin opened this issue Aug 11, 2020 · 12 comments
Open

Excessive moves when returning array from nested struct #46458

alex mannequin opened this issue Aug 11, 2020 · 12 comments
Labels
bugzilla Issues migrated from bugzilla llvm:optimizations

Comments

@alex
Copy link
Mannequin

alex mannequin commented Aug 11, 2020

Bugzilla Link 47114
Version trunk
OS All
CC @comex,@CryZe,@davidbolvansky,@dtolnay,@jplatte,@nelhage,@rotateright

Extended Description

Extracted from: rust-lang/rust#74267

Given the following C code, the b and c functions are behaviorally identical (https://godbolt.org/z/5eKxE5):

#include <stdlib.h>
#include <stdint.h>

#define N 2

typedef struct {
size_t length;
size_t capacity;
uint8_t* data;
} String;

static String new_string() {
String s = {0, 0, NULL};
return s;
}

struct Arr {
String data[N];
};

struct Arr b() {
struct Arr data;
for (size_t i = 0; i < N; i++) {
data.data[i] = new_string();
}
return data;
}

struct PartialArr {
struct Arr value;
};

struct Arr c() {
struct PartialArr data;
String (*slots)[N] = &data.value.data;

for (size_t i = 0; i < N; i++) {
    (*slots)[i] = new_string();
}
return data.value;

}

However, they end up optimized very differently:

b: # @​b
mov rax, rdi
vxorps xmm0, xmm0, xmm0
vmovups xmmword ptr [rdi], xmm0
mov qword ptr [rdi + 16], 0
vmovups xmmword ptr [rdi + 24], xmm0
mov qword ptr [rdi + 40], 0
ret
c: # @​c
vxorps xmm0, xmm0, xmm0
vmovaps xmmword ptr [rsp - 56], xmm0
mov qword ptr [rsp - 40], 0
vmovups xmmword ptr [rsp - 32], xmm0
mov rax, rdi
mov qword ptr [rsp - 16], 0
vmovups xmm0, xmmword ptr [rsp - 56]
vmovups xmmword ptr [rdi], xmm0
mov rcx, qword ptr [rsp - 40]
mov qword ptr [rdi + 16], rcx
mov rcx, qword ptr [rsp - 32]
mov qword ptr [rdi + 24], rcx
mov rcx, qword ptr [rsp - 40]
mov qword ptr [rdi + 16], rcx
vmovups xmm0, xmmword ptr [rsp - 32]
vmovups xmmword ptr [rdi + 24], xmm0
mov rcx, qword ptr [rsp - 16]
mov qword ptr [rdi + 40], rcx
ret

GCC is able to optimize this better:

b:
mov QWORD PTR [rdi], 0
mov QWORD PTR [rdi+8], 0
mov QWORD PTR [rdi+16], 0
mov QWORD PTR [rdi+24], 0
mov QWORD PTR [rdi+32], 0
mov QWORD PTR [rdi+40], 0
mov rax, rdi
ret
c:
mov QWORD PTR [rdi], 0
mov QWORD PTR [rdi+8], 0
mov QWORD PTR [rdi+16], 0
mov QWORD PTR [rdi+24], 0
mov QWORD PTR [rdi+32], 0
mov QWORD PTR [rdi+40], 0
mov rax, rdi
ret

@alex
Copy link
Mannequin Author

alex mannequin commented Aug 11, 2020

(I'm not sure this component is correct!)

@rotateright
Copy link
Contributor

My first guess is a pass phase ordering problem.

If I feed the IR for function 'c' to "opt -memcpyopt -instcombine", it reduces to the expected single memset call.

So we have the ability to catch this, but it's not happening in the -O3 case shown in the godbolt link above.

@alex
Copy link
Mannequin Author

alex mannequin commented Aug 13, 2020

I guess I don't know enough about LLVM's pass architecture to help move this forward. I imagine "just re-run those passes" both a) would work, b) have downsides (e.g. compile time). I imagine re-ordering the passes just pessimizes some other code. Is there a third option?

@rotateright
Copy link
Contributor

I guess I don't know enough about LLVM's pass architecture to help move this
forward. I imagine "just re-run those passes" both a) would work, b) have
downsides (e.g. compile time). I imagine re-ordering the passes just
pessimizes some other code. Is there a third option?

Sure - if we can make -memcpyopt smarter, it might be able to handle whatever is stopping it currently.

That or your 1st option is our best hope. We just have to do some sanity-checking of compile-time and perf measurements if we do need to adjust the pass pipeline. It may come down to adding extra passes only at -O3, but hopefully not.

If you can get a snapshot of the current IR going into the last -memcpyopt in the current pipeline and compare that to the IR after -O2 or -O3 that would be helpful. Then we need to figure out which pass(es) are altering the IR (and what they are doing) to make it amenable to an extra round of -memcpyopt.

@davidbolvansky
Copy link
Collaborator

FPM.addPass(MergedLoadStoreMotionPass());
if (RunNewGVN)
FPM.addPass(NewGVNPass());
else
FPM.addPass(GVN());

// Specially optimize memory movement as it doesn't look like dataflow in SSA.
FPM.addPass(MemCpyOptPass());

// Sparse conditional constant propagation.
// FIXME: It isn't clear why we do this after loop passes rather than
// before...
FPM.addPass(SCCPPass());

// Delete dead bit computations (instcombine runs after to fold away the dead
// computations, and then ADCE will run later to exploit any new DCE
// opportunities that creates).
FPM.addPass(BDCEPass());

// Run instcombine after redundancy and dead bit elimination to exploit
// opportunities opened up by them.
FPM.addPass(InstCombinePass());

=>

FPM.addPass(MergedLoadStoreMotionPass());
if (RunNewGVN)
FPM.addPass(NewGVNPass());
else
FPM.addPass(GVN());

// Sparse conditional constant propagation.
// FIXME: It isn't clear why we do this after loop passes rather than
// before...
FPM.addPass(SCCPPass());

// Delete dead bit computations (instcombine runs after to fold away the dead
// computations, and then ADCE will run later to exploit any new DCE
// opportunities that creates).
FPM.addPass(BDCEPass());

// Run instcombine after redundancy and dead bit elimination to exploit
// opportunities opened up by them.
FPM.addPass(InstCombinePass());

// Specially optimize memory movement as it doesn't look like dataflow in SSA.
FPM.addPass(MemCpyOptPass());

Look at log, this should fix your testcase.

@davidbolvansky
Copy link
Collaborator

(And it makes sense, memcpyopt should run after some cleanup pass)

@rotateright
Copy link
Contributor

(And it makes sense, memcpyopt should run after some cleanup pass)

Ah, great - if that's all it takes, that's easier than anything I was imagining. :)

Let's make sure we're not duplicating effort. Who wants to draft the patch and run some tests?

@davidbolvansky
Copy link
Collaborator

Feel free to take it :)

@rotateright
Copy link
Contributor

I might be doing something wrong in my local experiment: if I only move the MemCpyOpt pass later in the pipeline, it does not solve the problem.

I do get the expected optimization if I add an extra run of MemCpyOpt after the next InstCombine.

@davidbolvansky
Copy link
Collaborator

Ah, yes :/

MemCpyOpt changes

call void @​llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(24) %9, i8* nonnull align 8 dereferenceable(24) %5, i64 24, i1 false), !dbg !​49, !tbaa.struct !​60

To

call void @​llvm.memset.p0i8.i64(i8* align 8 %9, i8 0, i64 24, i1 false), !dbg !​49

Then InstCombine does cleanup, and then extra MemCpyOpt does the job.

@davidbolvansky
Copy link
Collaborator

Maybe it should be enough copy memcpy/memset related transformations from instcombine and run them in memcpyopt pass as well.

@davidbolvansky
Copy link
Collaborator

Instruction *InstCombiner::SimplifyAnyMemTransfer
Instruction *InstCombiner::SimplifyAnyMemSet

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla llvm:optimizations
Projects
None yet
Development

No branches or pull requests

3 participants