Skip to content

Performance regression with threads #14417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kpamnany opened this issue Dec 15, 2015 · 12 comments
Closed

Performance regression with threads #14417

kpamnany opened this issue Dec 15, 2015 · 12 comments
Labels
multithreading Base.Threads and related functionality performance Must go faster regression Regression in behavior compared to a previous version

Comments

@kpamnany
Copy link
Contributor

Compared to the old threading branch, there is a 50% performance regression for multithreading performance on master for Laplace 3D:

$ ./julia.threading test/perf/threads/laplace3d/laplace3d.jl
elapsed time: 21.330972965 seconds (2624364 bytes allocated, 0.10% gc time)
elapsed time: 22.722768437 seconds (228491524 bytes allocated, 0.19% gc time)
$ ./julia test/perf/threads/laplace3d/laplace3d.jl
 32.323305 seconds (63.30 k allocations: 2.932 MB, 0.00% gc time)
 33.433139 seconds (1.65 M allocations: 262.402 MB, 0.21% gc time)

For the code snippet here, the slowdown is over 100X:

$ ./julia.threading ../code/julia/workloads/gaussianblur.jl
elapsed time: 0.142544649 seconds (42301044 bytes allocated)
$ ./julia ../code/julia/workloads/gaussianblur.jl
 12.242326 seconds (65.43 M allocations: 1.042 GB, 5.06% gc time)

I'll try and figure out where this happened.

@kpamnany kpamnany added performance Must go faster regression Regression in behavior compared to a previous version multithreading Base.Threads and related functionality labels Dec 15, 2015
@kpamnany
Copy link
Contributor Author

Cc: @JeffBezanson, @vtjnash, @tkelman, @yuyichao.

@yuyichao
Copy link
Contributor

FYI, in your second example, removing the @threads gives a 2x speed up =(....

Also the issue with that example is that the + in stencil2d is too long and the type inference gives up on the long tuple. adding parenthesis speeds it up by 20x and get rid of the allocations. (the non-threading version is still faster though...)

@vtjnash
Copy link
Member

vtjnash commented Dec 15, 2015

fwiw, with #12904 merged now, profiling is multi-thread aware and Profile.print() / ProfileView should be giving a multi-threaded view of the world

@kpamnany
Copy link
Contributor Author

This regression still exists at the point jn/threading was merged. :-(

@yuyichao: can you clarify on the +? I added parentheses but still see lots of boxes in the typed AST. Wouldn't this be a problem on the old threading branch also?

@yuyichao
Copy link
Contributor

The box intrinsics aren't the issue, the ultimate issue is that +(......) doesn't get inlined (which is bad by it's own) and it is also generating a jlcall, which requires boxing of the arguments. (Type inference might have also given up on the precise type and do sth worse but I didn't verify too carefully).

I think this is due to some tweak in inlining/type inference so it might not affect the old threading branch. My version of stencil2d is like this. The @inline may not be necessary.

@inline function stencil2d(buf, x, y)
    return (buf[x-2,y-2] * 0.0030 + buf[x-1,y-2] * 0.0133 + buf[x,y-2] * 0.0219 + buf[x+1,y-2] * 0.0133 + buf[x+2,y-2] * 0.0030) +
           (buf[x-2,y-1] * 0.0133 + buf[x-1,y-1] * 0.0596 + buf[x,y-1] * 0.0983 + buf[x+1,y-1] * 0.0596 + buf[x+2,y-1] * 0.0133) +
           (buf[x-2,y  ] * 0.0219 + buf[x-1,y  ] * 0.0983 + buf[x,y  ] * 0.1621 + buf[x+1,y  ] * 0.0983 + buf[x+2,y  ] * 0.0219) +
           (buf[x-2,y+1] * 0.0133 + buf[x-1,y+1] * 0.0596 + buf[x,y+1] * 0.0983 + buf[x+1,y+1] * 0.0596 + buf[x+2,y+1] * 0.0133) +
           (buf[x-2,y+2] * 0.0030 + buf[x-1,y+2] * 0.0133 + buf[x,y+2] * 0.0219 + buf[x+1,y+2] * 0.0133 + buf[x+2,y+2] * 0.0030)
end

@kpamnany
Copy link
Contributor Author

This is not very nice. Your version of stencil2d is quick. The following one isn't:

@inline function stencil2d(buf, x, y)
    return ((buf[x-2,y-2] * 0.0030) + (buf[x-1,y-2] * 0.0133) + (buf[x,y-2] * 0.0219) + (buf[x+1,y-2] * 0.0133) + (buf[x+2,y-2] * 0.0030) +
            (buf[x-2,y-1] * 0.0133) + (buf[x-1,y-1] * 0.0596) + (buf[x,y-1] * 0.0983) + (buf[x+1,y-1] * 0.0596) + (buf[x+2,y-1] * 0.0133) +
            (buf[x-2,y  ] * 0.0219) + (buf[x-1,y  ] * 0.0983) + (buf[x,y  ] * 0.1621) + (buf[x+1,y  ] * 0.0983) + (buf[x+2,y  ] * 0.0219) +
            (buf[x-2,y+1] * 0.0133) + (buf[x-1,y+1] * 0.0596) + (buf[x,y+1] * 0.0983) + (buf[x+1,y+1] * 0.0596) + (buf[x+2,y+1] * 0.0133) +
            (buf[x-2,y+2] * 0.0030) + (buf[x-1,y+2] * 0.0133) + (buf[x,y+2] * 0.0219) + (buf[x+1,y+2] * 0.0133) + (buf[x+2,y+2] * 0.0030))
end

Which doesn't seem very reasonable?

@yuyichao
Copy link
Contributor

Well, the issue is number of arguments to +(args...) and just adding parenthesis to each argument doesn't make a different on that reguard.

@kpamnany
Copy link
Contributor Author

Sure, but this is pretty non-intuitive for your average programmer. But anyway, that's a separate issue.

@yuyichao
Copy link
Contributor

but this is pretty non-intuitive for your average programmer

Agree. Ref #5402 #13359 and also callsite inlining hint (can't find an open issue for it)

@JeffBezanson
Copy link
Member

May be caused by #13735

@kpamnany
Copy link
Contributor Author

Focusing on the Gaussian blur code here running on master, serial performance (i.e. no @threads on the outer for loop) is 0.273327 seconds. Multithreaded performance with JULIA_NUM_THREADS=1 is 1.132893 seconds. The multithreaded code actually does scale -- performance with JULIA_NUM_THREADS=4 is 0.371554 seconds, and with JULIA_NUM_THREADS=8 is 0.231501 seconds.

Ignore the threading branch; that's a red herring -- serial performance is identical to performance with JULIA_NUM_THREADS=1 there.

The typed ASTs for the serial and multithreaded code are in the same gist and look pretty different. Do they help in understanding this?

@JeffBezanson
Copy link
Member

Fixed on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

4 participants