Skip to content

Orbit release build of glium takes a very long time in codegen passes #34166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jonas-schievink opened this issue Jun 8, 2016 · 26 comments
Closed
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times.

Comments

@jonas-schievink
Copy link
Contributor

Steps to reproduce:

  • clone glium and check out glium/glium@11a6b7b (most recent commit at the time of writing)
  • try a -Zorbit release build with the current nightly (2016-06-07):
    • For me, this involved env RUSTFLAGS="-Zorbit" rustup run nightly cargo rustc --release -- -Ztime-passes
  • Observe compilation progress (very slowly but steadily, in old Rust tradition) until llvm module passes, which itself takes about 24 seconds on my machine, and then get stuck somewhere in LLVM codegen

cc @eddyb (I heard you like nothing more than MIR trans and LLVM bugs)

@eddyb eddyb added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-compiletime Issue: Problems and improvements with respect to compile times. A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html labels Jun 8, 2016
@jonas-schievink
Copy link
Contributor Author

Output of perf after a while:

 51,06%  rustc    librustc_llvm-8102e29f.so             [.] llvm::SpillPlacement::iterate                                                                            
   9,49%  rustc    librustc_llvm-8102e29f.so             [.] _ZN4llvm14BlockFrequencypLES0_@plt                                                                       
   9,28%  rustc    librustc_llvm-8102e29f.so             [.] llvm::BlockFrequency::operator+=                                                                         
   4,89%  rustc    librustc_llvm-8102e29f.so             [.] llvm::BlockFrequency::operator+                                                                          
   4,42%  rustc    librustc_llvm-8102e29f.so             [.] _ZNK4llvm14BlockFrequencyplES0_@plt                                                                      
   1,49%  rustc    librustc_llvm-8102e29f.so             [.] llvm::AssumptionCache::scanFunction    

@arielb1
Copy link
Contributor

arielb1 commented Jun 8, 2016

Could you -C llvm-args=-print-before-all?

@jonas-schievink
Copy link
Contributor Author

@arielb1 That somehow managed to lock up rustc (using 0% cpu) before getting to codegen (or even optimizations). Fun!

@jonas-schievink
Copy link
Contributor Author

@jonas-schievink
Copy link
Contributor Author

jonas-schievink commented Jun 11, 2016

So it seems like there are multiple issues at play here. Let me try to sum up what @eddyb and me managed to dig up a few days ago:

  • gl-rs generates Rust code for loading OpenGL function pointers, and the generated code isn't quite optimal:
    • in fact, it generates 1624 lines of code for initializing the Gl struct, each of which performs 2 function calls
    • this can be seen in target/debug/build/glium-*/out/gl_bindings.rs when building manually
  • MIR trans generates a lot more LLVM values than old trans
    • in fact, the bad function results in 13001 values for old trans, but in 64978 values (!) for MIR trans (all unoptimized)
    • this is still fine, since the unoptimized MIR build works
    • after optimizations, old trans uses 53594 values, while MIR trans uses 61714 - MIR trans uses less than before, so this point might be moot (⬅️this might also indicate that the LLVM inliner should take the caller's size into account when deciding whether to inline or not)
  • LLVM's greedy register allocator seems to have some bad performance characteristics
    • in fact, the build spends over 5 minutes in there until I cancelled it (see the gist linked below)
    • specifying a different allocator by using llc --regalloc=X makes the code compile successfully

I'm really not sure why this all works out without optimizations but fails with them, but it seems like the greedy register allocator is partially at fault here. Of course it's the greedy allocator, it is used by default for an optimized build, while the "fast" allocator is used for debug builds.

For details, refer to this gist.

@eddyb
Copy link
Member

eddyb commented Jun 11, 2016

cc @alexcrichton Any chance this was already found and fixed upstream?

@alexcrichton
Copy link
Member

Hm doesn't look familiar to me, so it may or may not have been (sorry not much help on this one).

If codegen is generating so much code though it seems like this is something that should arguably be fixed in upstream at the same time the MIR regression is fixed...

@eddyb
Copy link
Member

eddyb commented Jun 11, 2016

Found something https://llvm.org/bugs/show_bug.cgi?id=19217 - and an older one https://llvm.org/bugs/show_bug.cgi?id=11811.
None of these bugs seem to have had much progress on them.
The cleanup one would suggest a potential reason for why MIR trans displays the problem.

bors added a commit that referenced this issue Jun 16, 2016
don't generate drop ladder steps for fields that don't need dropping

cc @eddyb

This should help with #34166
@arielb1
Copy link
Contributor

arielb1 commented Jun 16, 2016

I blame the LLVM inliner. This can be fixed by moving metaloadfn to a #[inline(never)] function:

                #[inline(never)]
                fn do_metaloadfn(loadfn: &mut FnMut(&str) -> *const __gl_imports::raw::c_void,
                                 symbol: &str,
                                 symbols: &[&str])
                                 -> *const __gl_imports::raw::c_void
                {
                    let mut ptr = loadfn(symbol);
                    if ptr.is_null() {
                        for &sym in symbols {
                            ptr = loadfn(sym);
                            if !ptr.is_null() { break; }
                        }
                    }
                    ptr
                }
                let mut metaloadfn = |symbol: &str, symbols: &[&str]| {
                    do_metaloadfn(&mut loadfn, symbol, symbols)
                };

That still leaves MIR ~20% slower, so there may be still some issue, but if at least finishes.

@arielb1
Copy link
Contributor

arielb1 commented Jun 16, 2016

The unoptimized "unit cell" we generate is

    bb8: {
        tmp34 = &mut var1;               // scope 2 at ../target/release/build/g
lium-ea08b897ed8f243c/out/gl_bindings.rs:4168:25: 4168:35
        tmp37 = const "glTexCoord3iv";   // scope 2 at ../target/release/build/g
lium-ea08b897ed8f243c/out/gl_bindings.rs:4168:36: 4168:51
        tmp36 = &(*tmp37);               // scope 2 at ../target/release/build/g
lium-ea08b897ed8f243c/out/gl_bindings.rs:4168:36: 4168:51
        tmp40 = promoted1620;            // scope 2 at ../target/release/build/g
lium-ea08b897ed8f243c/out/gl_bindings.rs:4168:53: 4168:56
        tmp39 = &(*tmp40);               // scope 2 at ../target/release/build/g
lium-ea08b897ed8f243c/out/gl_bindings.rs:4168:53: 4168:56
        tmp38 = tmp39 as &[&str] (Unsize); // scope 2 at ../target/release/build
/glium-ea08b897ed8f243c/out/gl_bindings.rs:4168:53: 4168:56
        tmp35 = (tmp36, tmp38);          // scope 2 at ../target/release/build/g
lium-ea08b897ed8f243c/out/gl_bindings.rs:4168:25: 4168:57
        tmp33 = <[closure@../target/release/build/glium-ea08b897ed8f243c/out/gl_bindings.rs:4161:38: 4163:18 loadfn:&mut F] as std::ops::FnMut<(&str, &[&str])>>::call_mut(tmp34, tmp35) -> [return: bb9, unwind: bb2]; // scope 2 at ../target/release/build/glium-ea08b897ed8f243c/out/gl_bindings.rs:4168:25: 4168:57
    }


    bb9: {
        tmp32 = FnPtr::new(tmp33) -> [return: bb10, unwind: bb2]; // scope 2 at ../target/release/build/glium-ea08b897ed8f243c/out/gl_bindings.rs:4168:14: 4168:58
    }

In LLVM, that becomes

  %95 = getelementptr inbounds %str_slice, %str_slice* %temp37, i32 0, i32 0
  store i8* getelementptr inbounds ([13 x i8], [13 x i8]* @str63338, i32 0, i32 0), i8** %95
  %96 = getelementptr inbounds %str_slice, %str_slice* %temp37, i32 0, i32 1
  store i64 13, i64* %96
  %97 = getelementptr inbounds %str_slice, %str_slice* %temp37, i32 0, i32 0
  %98 = load i8*, i8** %97
  %99 = getelementptr inbounds %str_slice, %str_slice* %temp37, i32 0, i32 1
  %100 = load i64, i64* %99
  store [0 x %str_slice]* @ref63335, [0 x %str_slice]** %temp40
  %101 = load [0 x %str_slice]*, [0 x %str_slice]** %temp40, !nonnull !0
  %102 = bitcast [0 x %str_slice]* %101 to %str_slice*
  %103 = getelementptr inbounds { %str_slice, { %str_slice*, i64 } }, { %str_slice, { %str_slice*, i64 } }* %temp35, i32 0, i32 0
  %104 = getelementptr inbounds %str_slice, %str_slice* %103, i32 0, i32 0
  store i8* %98, i8** %104
  %105 = getelementptr inbounds %str_slice, %str_slice* %103, i32 0, i32 1
  store i64 %100, i64* %105
  %106 = getelementptr inbounds { %str_slice, { %str_slice*, i64 } }, { %str_slice, { %str_slice*, i64 } }* %temp35, i32 0, i32 1
  %107 = getelementptr inbounds { %str_slice*, i64 }, { %str_slice*, i64 }* %106, i32 0, i32 0
  store %str_slice* %102, %str_slice** %107
  %108 = getelementptr inbounds { %str_slice*, i64 }, { %str_slice*, i64 }* %106, i32 0, i32 1
  store i64 0, i64* %108
  %109 = getelementptr inbounds { %str_slice, { %str_slice*, i64 } }, { %str_slice, { %str_slice*, i64 } }* %temp35, i32 0, i32 0
  %110 = getelementptr inbounds %str_slice, %str_slice* %109, i32 0, i32 0
  %111 = load i8*, i8** %110
  %112 = getelementptr inbounds %str_slice, %str_slice* %109, i32 0, i32 1
  %113 = load i64, i64* %112
  %114 = getelementptr inbounds { %str_slice, { %str_slice*, i64 } }, { %str_slice, { %str_slice*, i64 } }* %temp35, i32 0, i32 1
  %115 = getelementptr inbounds { %str_slice*, i64 }, { %str_slice*, i64 }* %114, i32 0, i32 0
  %116 = load %str_slice*, %str_slice** %115
  %117 = getelementptr inbounds { %str_slice*, i64 }, { %str_slice*, i64 }* %114, i32 0, i32 1
  %118 = load i64, i64* %117
  %119 = invoke i8* @"_ZN5glium2gl2Gl9load_with28_$u7b$$u7b$closure$u7d$$u7d$17h26487c07ccd16bb4E"(%closure.4* dereferenceable(8) %metaloadfn, i8* noalias nonnull readonly %111, i64 %113, %str_slice* noalias nonnull readonly %116, i64 %118)
          to label %bb9 unwind label %cleanup

bb9:                                              ; preds = %bb8
  invoke void @_ZN5glium2gl5FnPtr3new17h33e17555c3c378b0E(%"gl::FnPtr"* noalias nocapture sret dereferenceable(16) %tmp_ret3, i8* %119)
          to label %bb10 unwind label %cleanup

bb10:                                             ; preds = %bb9
  %120 = getelementptr inbounds %"gl::FnPtr", %"gl::FnPtr"* %tmp_ret3, i32 0, i32 0
  %121 = getelementptr inbounds %"gl::FnPtr", %"gl::FnPtr"* %tmp_ret3, i32 0, i32 1
  %122 = load i8*, i8** %120
  %123 = load i8, i8* %121, !range !14
  %124 = trunc i8 %123 to i1

Very ugly. Pair-passing style + lack of DPS means we have to spill out lots of things.

NRVO should fix this.

@MagaTailor
Copy link

I commented in #25478, as the compilation is able to pull through. However it's probably related as I use -Zorbit by default.

@eddyb eddyb added this to the Launch MIR into Orbit milestone Jul 12, 2016
@eddyb
Copy link
Member

eddyb commented Jul 12, 2016

Added to the milestone so it's tracked until it gets fixed.

@nikomatsakis
Copy link
Contributor

@arielb1 can you elaborate a bit on your comment. For example:

The unoptimized "unit cell" we generate is ...[MIR]

What do you mean by "unit cell"? What is the Rust code from which this is generated?

Pair-passing style + lack of DPS means we have to spill out lots of things.

Same question :) what is the rust code? What is "pair-passing style"?

Let me take my best guess :). I think by "pair passing style" you are referring to the rust-call convention, where we are tupling up the arguments? Certainly I see that in your snippet, e.g. in the creation of tmp35 = (tmp36, tmp38);. But I'm not clear on how DPS would help in this case!

@nikomatsakis
Copy link
Contributor

Is this in any way platform specific?

@jonas-schievink
Copy link
Contributor Author

This happened on x64 Arch Linux, but I can't reproduce it anymore:

  time: 1.154; rss: 290MB   llvm function passes [0]
  time: 23.388; rss: 306MB  llvm module passes [0]
  time: 10.041; rss: 305MB  codegen passes [0]
  time: 0.004; rss: 306MB   codegen passes [0]

@eddyb
Copy link
Member

eddyb commented Jul 28, 2016

@jonas-schievink Hangs Takes ages in release + -Z orbit, both on glium master and the original revision, for me.

For master, i.e. glium/glium@66d0b5b:

time: 7.424; rss: 554MB translation
time: 0.003; rss: 554MB assert dep graph
time: 0.000; rss: 554MB serialize dep graph
  time: 1.513; rss: 491MB       llvm function passes [0]
  time: 34.153; rss: 491MB      llvm module passes [0]
  time: 3127.262; rss: 495MB    codegen passes [0]
  time: 0.007; rss: 495MB       codegen passes [0]
time: 3163.323; rss: 495MB      LLVM passes
time: 0.276; rss: 495MB linking

EDIT: This was with an older Cargo.lock, and Cargo didn't react to me updating the repo itself.
It finishes in about a minute otherwise.

@nikomatsakis
Copy link
Contributor

Discussed in compiler team meeting. @arielb1 will try to produce a stand-alone test case. Seems like doing some "NRVO-style" optimization may be the best answer here.

@nikomatsakis
Copy link
Contributor

One question is whether the workaround that @arielb1 suggested has been applied for glium? @arielb1 at least is of the opinion that (if so) this need not block moving to MIR more broadly, as there are many possible O(n^2) cases we could hit, even in old trans (even if this code was not among them). Though I'd still like to see it fixed =)

@eddyb eddyb changed the title Orbit release build of glium gets stuck in codegen passes Orbit release build of glium takes a very long time in codegen passes Jul 28, 2016
@jonas-schievink
Copy link
Contributor Author

jonas-schievink commented Jul 28, 2016

Glium is still on gl_generator 0.5.1, the fix is in 0.5.2

EDIT: Ah nevermind, Cargo will update to 0.5.2 anyways :)

@arielb1
Copy link
Contributor

arielb1 commented Jul 28, 2016

self-contained, but not minified, example at www.cs.tau.ac.il/~arielb1/gl-bindings-self-contained-1.rs.xz

Because of how the O(n^2)-ess works, this only takes a few seconds to codegen, but still an anomalously high time,

@eddyb
Copy link
Member

eddyb commented Jul 28, 2016

@jonas-schievink Oops! Totally my bad, I had a stale Cargo.lock.

@arielb1
Copy link
Contributor

arielb1 commented Jul 28, 2016

minified example: http://sprunge.us/fGMM (2kLoC - macro magic should be reduce this)

@arielb1
Copy link
Contributor

arielb1 commented Jul 28, 2016

Template:

pub struct Gl {
    pub data: [FnPtr; LEN]
}


pub struct FnPtr {
    pub f: *const u8,
    pub is_loaded: bool,
}

fn missing_fn_panic() -> ! {
    panic!("gl function was not loaded")
}

impl FnPtr {
    fn new(ptr: *const u8) -> FnPtr {
        if ptr.is_null() {
            FnPtr {
                f: missing_fn_panic as *const u8,
                is_loaded: false
            }
        } else {
            FnPtr { f: ptr, is_loaded: true }
        }
    }
}


pub fn load_with(loadfn: &mut FnMut(&str) -> *const u8) -> Gl {
    let mut metaloadfn = |symbol: &str, symbols: &[&str]| {
        let mut ptr = loadfn(symbol);
        if ptr.is_null() {
            for &sym in symbols {
                ptr = loadfn(sym);
                if !ptr.is_null() { break; }
            }
        }
        ptr
    };

    Gl { data: [
i=0;
while [[ $i -lt $j ]]; do
    echo '        FnPtr::new(metaloadfn("'"function"'", &[])),';
    i=$[i+1]
done >> test-slow.rs
sed -i s/LEN/$j/ test-slow.rs
echo ']}}' >> test-slow.rs

@nikomatsakis
Copy link
Contributor

nikomatsakis commented Aug 1, 2016

Removing from the "launch MIR into orbit" milestone for the following reasons:

  • The problem with glium specifically is not present in the latest release.
  • As @arielb1 argues, there are many sources of "O(n^2)" behavior in LLVM (not to argue we shouldn't try to fix this one), some of which are present in old trans, some in MIR trans.
  • We do plan to address this source by working on optimizations like move-forwarding etc
  • Gaining experience with MIR in nightly seems very valuable

@nikomatsakis nikomatsakis removed this from the Launch MIR into Orbit milestone Aug 1, 2016
@tafia
Copy link
Contributor

tafia commented Nov 28, 2016

Is there any news on this?

I have a project (closed source) that can't compile on nightly/release but compiles fine on stable and nightly/debug.
I've stop waiting after more than 20 mins (stable compiles in about 1 min).

When running cargo rustc --release -- -Ztime-passes, the last step I see is:

time: 1.014; rss: 121MB       llvm function passes [0]

So I assume the next step (which is llvm module passes [0] on stable) is having a problem on my system.
I'm using

$ rustc -V
rustc 1.15.0-nightly (2217bd771 2016-11-25)
$ uname -r
2.6.32-431.29.2.el6.x86_64

Is there any command I can run to explore in more details what is happening?
Also I don't know how to try another allocator as specified in this comment

@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 25, 2017
@ishitatsuyuki
Copy link
Contributor

Closing this due to inactivity. This was probably a LLVM bug, which may have been already fixed today.

If the issue happens on the same code, please request to reopen. If a similar issue happens with different code, please open a new issue with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times.
Projects
None yet
Development

No branches or pull requests

9 participants