Skip to content

muladd may unexpectedly optimize multiplications with LLVM 8 and 9 #34988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kimikage opened this issue Mar 4, 2020 · 15 comments
Closed

muladd may unexpectedly optimize multiplications with LLVM 8 and 9 #34988

kimikage opened this issue Mar 4, 2020 · 15 comments

Comments

@kimikage
Copy link
Contributor

kimikage commented Mar 4, 2020

Since I'm not familiar with the Julia's compiler or LLVM, I don't know whether this is a bug. There are three main parts to this issue.

background

I wrote a fast version of the method for division by 60 in JuliaGraphics/Colors.jl#407, as follows:

div60(x) = x / 60
div60(x::T) where T <: Union{Float32, Float64} = muladd(x, T(1/960), x * T(0x1p-6))

It works fine on Julia v1.0.5 and v1.3.1 (on Windows, Linux and macOS with the x86_64 processor).

julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

julia> 90.0f0 * inv(60.0f0) # simple reciprocal multiplication
1.5000001f0

julia> div60(90.0f0) # the method above
1.5f0

However, it may not work as expected on Julia v1.4.0-rc2 and nightly build. (All the examples below are on v1.4.0-rc2.)

julia> versioninfo()
Julia Version 1.4.0-rc2.0
Commit b99ed72c95 (2020-02-24 16:51 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

julia> div60(90.0f0)
1.5000001f0

julia> @code_native debuginfo=:none div60(90.0f0) # x * (1/960 + 0x1p-6)
        .text
        pushq   %rbp
        movq    %rsp, %rbp
        movabsq $708909116, %rax        # imm = 0x2A41183C
        vmulss  (%rax), %xmm0, %xmm0
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)
        nop

part 1

The purpose of the muladd function is to use the FMA instruction, but I do not want to use the fma function for the following reasons:

  • some systems have no hardware FMA support
  • 32-bit (i686) systems use fma_libm instead of fma_llvm, even if the system supports the FMA instructions

The first part is a pure question: is this conditional branch also appropriate on LLVM 8 or 9?

julia/base/floatfuncs.jl

Lines 327 to 329 in 3b53f54

if (Sys.ARCH !== :i686 && fma_llvm(1.0000305f0, 1.0000305f0, -1.0f0) == 6.103609f-5 &&
(fma_llvm(1.0000000009313226, 1.0000000009313226, -1.0) ==
1.8626451500983188e-9) && 0.1 + 0.2 == 0.30000000000000004)

part 2

The second part may be related to the LLVM backend.

By abandoning the use of the FMA instruction, we can use the following workaround:

julia> div60_mul(x::T) where T <: Union{Float32, Float64} = x * T(0x1p-6) + x * T(1/960);

julia> div60_mul(90.0f0)
1.5f0

However, it is inconsistent that div60_mul is not optimized into one multiplication. The difference is the contract:

julia> @code_llvm debuginfo=:none div60(90.0f0)

; Function Attrs: uwtable
define float @julia_div60_17363(float) #0 {
top:
  %1 = fmul float %0, 1.562500e-02
  %2 = fmul contract float %0, 0x3F51111120000000
  %3 = fadd contract float %2, %1
  ret float %3
}

julia> @code_llvm debuginfo=:none div60_mul(90.0f0)

; Function Attrs: uwtable
define float @julia_div60_mul_17511(float) #0 {
top:
  %1 = fmul float %0, 1.562500e-02
  %2 = fmul float %0, 0x3F51111120000000
  %3 = fadd float %1, %2
  ret float %3
}

Although I don't know the details of contract, it is counterintuitive that the contracted fmul fuses "un"-contracted fmul by adding the contracts.

Does this have anything to do with the issue #34093?

part 3

The third part concerns the context dependency of compilation, and this is not a bug. The following workaround does not work.

_div60(x::T) where T = muladd(x, T(1/960), x * T(0x1p-6))
if _div60(90.0f0) == 1.5f0  # in principle, this is true
    div60(x::T) where T <: Union{Float32, Float64} = _div60(x)
else
    # force two-step multiplication
    div60(x::T) where T <: Union{Float32, Float64} = x * T(0x1p-6) + x * T(1/960)
end

Perhaps when _div60(90.0f0) == 1.5f0 is evaluated, what _div60 will be compiled to by the LLVM backend is not definitive.
I think that is inevitable due to the mechanism of the compiler. Is there any workaround?
Edit:
Of course, in the case of div60, the difference is acceptable, so this is not a fatal problem. However, I am concerned that such contextual dependencies can cause segfaults (cf. #34121).
Edit2:
I'm going to use an ad-hoc measure if reduce(max, _div60.((90.0f0,))) == 1.5f0, but it is not a drastic measure.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 4, 2020

Note that this is documented. It’s done since this kind of optimization is much easier for llvm to do than us.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 4, 2020

However, I am concerned that such contextual dependencies can cause segfaults (cf. #34121).

Why do you think that’s the case.

I'm going to use an ad-hoc measure if reduce(max, _div60.((90.0f0,))) == 1.5f0, but it is not a drastic measure.

No you couldn’t possibly check this at compile time in julia

@kimikage
Copy link
Contributor Author

kimikage commented Mar 4, 2020

Note that this is documented.

Yes, the notes that I don't want to use the fma function follow the document. I believe that the optimization itself is fine. However, I need more explanation for the inconsistency (the part 2).

Why do you honk that’s the case.

It is generally dangerous that the result of conditional branches differs between precompilation and runtime compilation. div60 looks "pure" to me. Of course, it is reasonable not to write such code in the first place.

No you couldn’t possibly check this at compile time in julia

In this case, I don't want a perfect solution because the numerical errors are not fatal.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 4, 2020

Yes, the notes that I don't want to use the fma function follow the document. I believe that the optimization itself is fine. However, I need more explanation for the inconsistency (the part 2).

What's the inconsistency? From the code you wrote, one of them allowed fma to be used whereas the other didn't and the code generated reflects that. Are you just confused by what the llvm IR means?

It is generally dangerous that the result of conditional branches differs between precompilation and runtime compilation.

That, on it's own, you can pretty much always expect to happen whenever architecture dependent optimizaions is allowed. This includes muladd, potentially libm functions and definitely blas functions and everything that uses them.

div60 looks "pure" to me. Of course, it is reasonable not to write such code in the first place.

My question was is that what you identified as the cause of the crash in the first place? If you didn't write any unsafe code there should not be a crash and I don't see why this behavior can lead to one.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 4, 2020

The first part is a pure question: is this conditional branch also appropriate on LLVM 8 or 9?

That part is not more wrong on LLVM 8 or 9 than it was on older LLVM versions. It has been wrong after function multiversioning is enabled and I think the only way to do it correctly is to intercept the LLVM lowering of llvm.fma

@kimikage
Copy link
Contributor Author

kimikage commented Mar 4, 2020

What's the inconsistency?

In the example using muladd, all three instructions are replaced with a single multiplication, but in the example without muladd a fusion is not done.
I do not fully understand the meaning of contract. There is no problem with the fusing the fadd and the fmul. However, why is another fmul also fused? It is hard to imagine that the effect of muladd spreads so far.

My question was is that what you identified as the cause of the crash in the first place?

I didn't mean there was an identified problem. I just picked it up as the worst case. I’m sorry if I misled you.

@kimikage
Copy link
Contributor Author

kimikage commented Mar 4, 2020

That part is not more wrong on LLVM 8 or 9 than it was on older LLVM versions.

Thank you for the information. This is just my impression, but it is strange that the i686 is handled specially.

@kimikage
Copy link
Contributor Author

kimikage commented Mar 4, 2020

For anyone familiar with the LLVM, if this is not worth explaining or documenting, close this issue.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 4, 2020

OK, so

However, it is inconsistent that div60_mul is not optimized into one multiplication.

is not the inconsistency. There's nothing wrong with this one NOT being optimized. div60 is also not expected to be "consistent" with anything else since it is not the same code.

The question seems to be instead why did it contract the operations into a multiplication rather than a multiplication and a fma. That seems to be because LLVM got smarter and contracted (a * b + c * b) into (a + c) * b. It's a question for LLVM whether this should be allowed.

I didn't mean there was an identified problem. I just picked it up as the worst case. I’m sorry if I misled you.

No that's not really how it works. It's only the worst case in the sense that anything can lead to a crash. It's not more likely to lead to a crash than anything else.

@kimikage
Copy link
Contributor Author

kimikage commented Mar 4, 2020

It's a question for LLVM whether this should be allowed.

I see. Does this mean that there is no way to control it from a Julia script? In general, (a * b + c * b) != (a + c) * b in floating-point operations. Until Julia v1.3.1 (LLVM 6) we could control it by using parentheses or fma /muladd. Should it really be allowed?

@yuyichao
Copy link
Contributor

yuyichao commented Mar 4, 2020

That seems to be because LLVM got smarter and contracted (a * b + c * b) into (a + c) * b. It's a question for LLVM whether this should be allowed.

Quoting the C standard (page 56 of the archived final draft of C18, page 75 of the PDF), which should be what LLVM is following,

A floating expression may be contracted, that is, evaluated as though it were a single operation, thereby omitting rounding errors implied by the source code and the expression evaluation method.

so LLVM should be allowed to transform a * b + c * b into one operation with a single rounding, which is equivalent to (a + c) * b with a single rounding. I think allowing such a transformation in julia context is also OK.

However, that does not seems to be what's happening. Doing a single rounding is actually OK in this case.

julia> a = big(Float32(1 / 960))
0.001041666720993816852569580078125

julia> b = big(Float32(0x1p-6))
0.015625

julia> x = big(90f0)
90.0

julia> Float32(x * a + x * b)
1.5f0

But the transformation to x * (a + b) actually have two roundings that is causing the error in this case. I'll leave further discussion to people with more knowledge on floating point math but if this analysis is correct then I think it should be reported to llvm.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 4, 2020

I see. Does this mean that there is no way to control it from a Julia script?

No, this contraction only happen when you ask for it with muladd. It won't happn even if this is a valid contraction without you asking for it.

The only question is that whether this is a valid contraction when you enables it with muladd.

@kimikage
Copy link
Contributor Author

kimikage commented Mar 4, 2020

The only question is that whether this is a valid contraction when you enables it with muladd.

If we do not need to use muladd, we can control it by not using muladd. So, you are right.

I intuitively think that the effects of muladd (contract) are overkill, but I don't have enough LLVM knowledge to explain why.:sob:

@StefanKarpinski
Copy link
Member

cc @stevengj, @Keno

@kimikage
Copy link
Contributor Author

This may be an issue with LLVM rather than Julia, and for many Julia users, no additional documentation is needed, so I close this issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants