Performance regression with threads #14417

kpamnany · 2015-12-15T20:35:13Z

Compared to the old threading branch, there is a 50% performance regression for multithreading performance on master for Laplace 3D:

$ ./julia.threading test/perf/threads/laplace3d/laplace3d.jl
elapsed time: 21.330972965 seconds (2624364 bytes allocated, 0.10% gc time)
elapsed time: 22.722768437 seconds (228491524 bytes allocated, 0.19% gc time)
$ ./julia test/perf/threads/laplace3d/laplace3d.jl
 32.323305 seconds (63.30 k allocations: 2.932 MB, 0.00% gc time)
 33.433139 seconds (1.65 M allocations: 262.402 MB, 0.21% gc time)

For the code snippet here, the slowdown is over 100X:

$ ./julia.threading ../code/julia/workloads/gaussianblur.jl
elapsed time: 0.142544649 seconds (42301044 bytes allocated)
$ ./julia ../code/julia/workloads/gaussianblur.jl
 12.242326 seconds (65.43 M allocations: 1.042 GB, 5.06% gc time)

I'll try and figure out where this happened.

The text was updated successfully, but these errors were encountered:

kpamnany · 2015-12-15T20:48:05Z

Cc: @JeffBezanson, @vtjnash, @tkelman, @yuyichao.

yuyichao · 2015-12-15T20:49:07Z

FYI, in your second example, removing the @threads gives a 2x speed up =(....

Also the issue with that example is that the + in stencil2d is too long and the type inference gives up on the long tuple. adding parenthesis speeds it up by 20x and get rid of the allocations. (the non-threading version is still faster though...)

vtjnash · 2015-12-15T21:03:46Z

fwiw, with #12904 merged now, profiling is multi-thread aware and Profile.print() / ProfileView should be giving a multi-threaded view of the world

kpamnany · 2015-12-15T21:46:55Z

This regression still exists at the point jn/threading was merged. :-(

@yuyichao: can you clarify on the +? I added parentheses but still see lots of boxes in the typed AST. Wouldn't this be a problem on the old threading branch also?

yuyichao · 2015-12-15T21:52:14Z

The box intrinsics aren't the issue, the ultimate issue is that +(......) doesn't get inlined (which is bad by it's own) and it is also generating a jlcall, which requires boxing of the arguments. (Type inference might have also given up on the precise type and do sth worse but I didn't verify too carefully).

I think this is due to some tweak in inlining/type inference so it might not affect the old threading branch. My version of stencil2d is like this. The @inline may not be necessary.

@inline function stencil2d(buf, x, y)
    return (buf[x-2,y-2] * 0.0030 + buf[x-1,y-2] * 0.0133 + buf[x,y-2] * 0.0219 + buf[x+1,y-2] * 0.0133 + buf[x+2,y-2] * 0.0030) +
           (buf[x-2,y-1] * 0.0133 + buf[x-1,y-1] * 0.0596 + buf[x,y-1] * 0.0983 + buf[x+1,y-1] * 0.0596 + buf[x+2,y-1] * 0.0133) +
           (buf[x-2,y  ] * 0.0219 + buf[x-1,y  ] * 0.0983 + buf[x,y  ] * 0.1621 + buf[x+1,y  ] * 0.0983 + buf[x+2,y  ] * 0.0219) +
           (buf[x-2,y+1] * 0.0133 + buf[x-1,y+1] * 0.0596 + buf[x,y+1] * 0.0983 + buf[x+1,y+1] * 0.0596 + buf[x+2,y+1] * 0.0133) +
           (buf[x-2,y+2] * 0.0030 + buf[x-1,y+2] * 0.0133 + buf[x,y+2] * 0.0219 + buf[x+1,y+2] * 0.0133 + buf[x+2,y+2] * 0.0030)
end

kpamnany · 2015-12-15T22:32:29Z

This is not very nice. Your version of stencil2d is quick. The following one isn't:

@inline function stencil2d(buf, x, y)
    return ((buf[x-2,y-2] * 0.0030) + (buf[x-1,y-2] * 0.0133) + (buf[x,y-2] * 0.0219) + (buf[x+1,y-2] * 0.0133) + (buf[x+2,y-2] * 0.0030) +
            (buf[x-2,y-1] * 0.0133) + (buf[x-1,y-1] * 0.0596) + (buf[x,y-1] * 0.0983) + (buf[x+1,y-1] * 0.0596) + (buf[x+2,y-1] * 0.0133) +
            (buf[x-2,y  ] * 0.0219) + (buf[x-1,y  ] * 0.0983) + (buf[x,y  ] * 0.1621) + (buf[x+1,y  ] * 0.0983) + (buf[x+2,y  ] * 0.0219) +
            (buf[x-2,y+1] * 0.0133) + (buf[x-1,y+1] * 0.0596) + (buf[x,y+1] * 0.0983) + (buf[x+1,y+1] * 0.0596) + (buf[x+2,y+1] * 0.0133) +
            (buf[x-2,y+2] * 0.0030) + (buf[x-1,y+2] * 0.0133) + (buf[x,y+2] * 0.0219) + (buf[x+1,y+2] * 0.0133) + (buf[x+2,y+2] * 0.0030))
end

Which doesn't seem very reasonable?

yuyichao · 2015-12-15T22:36:38Z

Well, the issue is number of arguments to +(args...) and just adding parenthesis to each argument doesn't make a different on that reguard.

kpamnany · 2015-12-15T22:42:09Z

Sure, but this is pretty non-intuitive for your average programmer. But anyway, that's a separate issue.

yuyichao · 2015-12-15T22:46:07Z

but this is pretty non-intuitive for your average programmer

Agree. Ref #5402 #13359 and also callsite inlining hint (can't find an open issue for it)

JeffBezanson · 2015-12-15T23:07:52Z

May be caused by #13735

kpamnany · 2015-12-17T17:55:58Z

Focusing on the Gaussian blur code here running on master, serial performance (i.e. no @threads on the outer for loop) is 0.273327 seconds. Multithreaded performance with JULIA_NUM_THREADS=1 is 1.132893 seconds. The multithreaded code actually does scale -- performance with JULIA_NUM_THREADS=4 is 0.371554 seconds, and with JULIA_NUM_THREADS=8 is 0.231501 seconds.

Ignore the threading branch; that's a red herring -- serial performance is identical to performance with JULIA_NUM_THREADS=1 there.

The typed ASTs for the serial and multithreaded code are in the same gist and look pretty different. Do they help in understanding this?

JeffBezanson · 2016-07-06T16:37:34Z

Fixed on master.

kpamnany added performance Must go faster regression Regression in behavior compared to a previous version multithreading Base.Threads and related functionality labels Dec 15, 2015

JeffBezanson closed this as completed Jul 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance regression with threads #14417

Performance regression with threads #14417

kpamnany commented Dec 15, 2015

kpamnany commented Dec 15, 2015

Uh oh!

yuyichao commented Dec 15, 2015

Uh oh!

vtjnash commented Dec 15, 2015

Uh oh!

kpamnany commented Dec 15, 2015

Uh oh!

yuyichao commented Dec 15, 2015

Uh oh!

kpamnany commented Dec 15, 2015

Uh oh!

yuyichao commented Dec 15, 2015

Uh oh!

kpamnany commented Dec 15, 2015

Uh oh!

yuyichao commented Dec 15, 2015

Uh oh!

JeffBezanson commented Dec 15, 2015

Uh oh!

kpamnany commented Dec 17, 2015

Uh oh!

JeffBezanson commented Jul 6, 2016

Uh oh!

Uh oh!

Performance regression with threads #14417

Performance regression with threads #14417

Comments

kpamnany commented Dec 15, 2015

kpamnany commented Dec 15, 2015

Uh oh!

yuyichao commented Dec 15, 2015

Uh oh!

vtjnash commented Dec 15, 2015

Uh oh!

kpamnany commented Dec 15, 2015

Uh oh!

yuyichao commented Dec 15, 2015

Uh oh!

kpamnany commented Dec 15, 2015

Uh oh!

yuyichao commented Dec 15, 2015

Uh oh!

kpamnany commented Dec 15, 2015

Uh oh!

yuyichao commented Dec 15, 2015

Uh oh!

JeffBezanson commented Dec 15, 2015

Uh oh!

kpamnany commented Dec 17, 2015

Uh oh!

JeffBezanson commented Jul 6, 2016

Uh oh!