Excessive moves when returning array from nested struct #46458

alex · 2020-08-11T12:21:21Z


Bugzilla Link	47114
Version	trunk
OS	All
CC	@comex,@CryZe,@davidbolvansky,@dtolnay,@jplatte,@nelhage,@rotateright

Extended Description

Extracted from: rust-lang/rust#74267

Given the following C code, the b and c functions are behaviorally identical (https://godbolt.org/z/5eKxE5):

#include <stdlib.h>
#include <stdint.h>

#define N 2

typedef struct {
size_t length;
size_t capacity;
uint8_t* data;
} String;

static String new_string() {
String s = {0, 0, NULL};
return s;
}

struct Arr {
String data[N];
};

struct Arr b() {
struct Arr data;
for (size_t i = 0; i < N; i++) {
data.data[i] = new_string();
}
return data;
}

struct PartialArr {
struct Arr value;
};

struct Arr c() {
struct PartialArr data;
String (*slots)[N] = &data.value.data;

for (size_t i = 0; i < N; i++) {
    (*slots)[i] = new_string();
}
return data.value;

}

However, they end up optimized very differently:

b: # @b
mov rax, rdi
vxorps xmm0, xmm0, xmm0
vmovups xmmword ptr [rdi], xmm0
mov qword ptr [rdi + 16], 0
vmovups xmmword ptr [rdi + 24], xmm0
mov qword ptr [rdi + 40], 0
ret
c: # @c
vxorps xmm0, xmm0, xmm0
vmovaps xmmword ptr [rsp - 56], xmm0
mov qword ptr [rsp - 40], 0
vmovups xmmword ptr [rsp - 32], xmm0
mov rax, rdi
mov qword ptr [rsp - 16], 0
vmovups xmm0, xmmword ptr [rsp - 56]
vmovups xmmword ptr [rdi], xmm0
mov rcx, qword ptr [rsp - 40]
mov qword ptr [rdi + 16], rcx
mov rcx, qword ptr [rsp - 32]
mov qword ptr [rdi + 24], rcx
mov rcx, qword ptr [rsp - 40]
mov qword ptr [rdi + 16], rcx
vmovups xmm0, xmmword ptr [rsp - 32]
vmovups xmmword ptr [rdi + 24], xmm0
mov rcx, qword ptr [rsp - 16]
mov qword ptr [rdi + 40], rcx
ret

GCC is able to optimize this better:

b:
mov QWORD PTR [rdi], 0
mov QWORD PTR [rdi+8], 0
mov QWORD PTR [rdi+16], 0
mov QWORD PTR [rdi+24], 0
mov QWORD PTR [rdi+32], 0
mov QWORD PTR [rdi+40], 0
mov rax, rdi
ret
c:
mov QWORD PTR [rdi], 0
mov QWORD PTR [rdi+8], 0
mov QWORD PTR [rdi+16], 0
mov QWORD PTR [rdi+24], 0
mov QWORD PTR [rdi+32], 0
mov QWORD PTR [rdi+40], 0
mov rax, rdi
ret

The text was updated successfully, but these errors were encountered:

alex · 2020-08-11T12:21:45Z

(I'm not sure this component is correct!)

rotateright · 2020-08-11T18:01:46Z

My first guess is a pass phase ordering problem.

If I feed the IR for function 'c' to "opt -memcpyopt -instcombine", it reduces to the expected single memset call.

So we have the ability to catch this, but it's not happening in the -O3 case shown in the godbolt link above.

alex · 2020-08-13T15:34:22Z

I guess I don't know enough about LLVM's pass architecture to help move this forward. I imagine "just re-run those passes" both a) would work, b) have downsides (e.g. compile time). I imagine re-ordering the passes just pessimizes some other code. Is there a third option?

rotateright · 2020-08-13T15:51:51Z

I guess I don't know enough about LLVM's pass architecture to help move this
forward. I imagine "just re-run those passes" both a) would work, b) have
downsides (e.g. compile time). I imagine re-ordering the passes just
pessimizes some other code. Is there a third option?

Sure - if we can make -memcpyopt smarter, it might be able to handle whatever is stopping it currently.

That or your 1st option is our best hope. We just have to do some sanity-checking of compile-time and perf measurements if we do need to adjust the pass pipeline. It may come down to adding extra passes only at -O3, but hopefully not.

If you can get a snapshot of the current IR going into the last -memcpyopt in the current pipeline and compare that to the IR after -O2 or -O3 that would be helpful. Then we need to figure out which pass(es) are altering the IR (and what they are doing) to make it amenable to an extra round of -memcpyopt.

davidbolvansky · 2020-08-13T16:01:11Z

FPM.addPass(MergedLoadStoreMotionPass());
if (RunNewGVN)
FPM.addPass(NewGVNPass());
else
FPM.addPass(GVN());

// Specially optimize memory movement as it doesn't look like dataflow in SSA.
FPM.addPass(MemCpyOptPass());

// Sparse conditional constant propagation.
// FIXME: It isn't clear why we do this after loop passes rather than
// before...
FPM.addPass(SCCPPass());

// Delete dead bit computations (instcombine runs after to fold away the dead
// computations, and then ADCE will run later to exploit any new DCE
// opportunities that creates).
FPM.addPass(BDCEPass());

// Run instcombine after redundancy and dead bit elimination to exploit
// opportunities opened up by them.
FPM.addPass(InstCombinePass());

=>

FPM.addPass(MergedLoadStoreMotionPass());
if (RunNewGVN)
FPM.addPass(NewGVNPass());
else
FPM.addPass(GVN());

// Sparse conditional constant propagation.
// FIXME: It isn't clear why we do this after loop passes rather than
// before...
FPM.addPass(SCCPPass());

// Delete dead bit computations (instcombine runs after to fold away the dead
// computations, and then ADCE will run later to exploit any new DCE
// opportunities that creates).
FPM.addPass(BDCEPass());

// Run instcombine after redundancy and dead bit elimination to exploit
// opportunities opened up by them.
FPM.addPass(InstCombinePass());

// Specially optimize memory movement as it doesn't look like dataflow in SSA.
FPM.addPass(MemCpyOptPass());

Look at log, this should fix your testcase.

davidbolvansky · 2020-08-13T16:05:30Z

(And it makes sense, memcpyopt should run after some cleanup pass)

rotateright · 2020-08-13T16:11:52Z

(And it makes sense, memcpyopt should run after some cleanup pass)

Ah, great - if that's all it takes, that's easier than anything I was imagining. :)

Let's make sure we're not duplicating effort. Who wants to draft the patch and run some tests?

davidbolvansky · 2020-08-13T16:16:37Z

Feel free to take it :)

rotateright · 2020-08-15T12:28:15Z

I might be doing something wrong in my local experiment: if I only move the MemCpyOpt pass later in the pipeline, it does not solve the problem.

I do get the expected optimization if I add an extra run of MemCpyOpt after the next InstCombine.

davidbolvansky · 2020-08-15T12:39:23Z

Ah, yes :/

MemCpyOpt changes

call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(24) %9, i8* nonnull align 8 dereferenceable(24) %5, i64 24, i1 false), !dbg !49, !tbaa.struct !60

To

call void @llvm.memset.p0i8.i64(i8* align 8 %9, i8 0, i64 24, i1 false), !dbg !49

Then InstCombine does cleanup, and then extra MemCpyOpt does the job.

davidbolvansky · 2020-08-15T12:42:14Z

Maybe it should be enough copy memcpy/memset related transformations from instcombine and run them in memcpyopt pass as well.

davidbolvansky · 2020-08-15T12:44:43Z

Instruction *InstCombiner::SimplifyAnyMemTransfer
Instruction *InstCombiner::SimplifyAnyMemSet

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

Endilll added the llvm:optimizations label Apr 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive moves when returning array from nested struct #46458

Excessive moves when returning array from nested struct #46458

alex mannequin commented Aug 11, 2020

alex mannequin commented Aug 11, 2020

rotateright commented Aug 11, 2020

alex mannequin commented Aug 13, 2020

rotateright commented Aug 13, 2020

davidbolvansky commented Aug 13, 2020

davidbolvansky commented Aug 13, 2020

rotateright commented Aug 13, 2020

davidbolvansky commented Aug 13, 2020

rotateright commented Aug 15, 2020

davidbolvansky commented Aug 15, 2020

davidbolvansky commented Aug 15, 2020

davidbolvansky commented Aug 15, 2020

Excessive moves when returning array from nested struct #46458

Excessive moves when returning array from nested struct #46458

Comments

alex mannequin commented Aug 11, 2020

Extended Description

alex mannequin commented Aug 11, 2020

rotateright commented Aug 11, 2020

alex mannequin commented Aug 13, 2020

rotateright commented Aug 13, 2020

davidbolvansky commented Aug 13, 2020

davidbolvansky commented Aug 13, 2020

rotateright commented Aug 13, 2020

davidbolvansky commented Aug 13, 2020

rotateright commented Aug 15, 2020

davidbolvansky commented Aug 15, 2020

davidbolvansky commented Aug 15, 2020

davidbolvansky commented Aug 15, 2020