Skip to content

Re-integrate NNlibCUDA as a package extension #445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

ToucheSir
Copy link
Member

Follow-up to/dependency of FluxML/Flux.jl#2132.

Note that tests live under NNlib's src dir and not the extensions. I couldn't find any guidance on the Pkg docs about this, so I just followed what PGFPlotsX.jl (first registered package using extensions) does.

PR Checklist

  • Tests are added
  • Documentation, if applicable

@IanButterworth
Copy link
Contributor

I couldn't find any guidance on the Pkg docs about this

Were you looking at the master docs? There's actually a really nice section guiding how to do all this and be backwards compatible (maybe you are already here, I haven't reviewed the code)

@ToucheSir
Copy link
Member Author

Yes, that's exactly the section I read and it was extremely helpful. My statement above was only referring to the question of where in the repo tests for package extensions should live, but a look at the PGFPlotsX.jl codebase helped with that :)

@@ -0,0 +1,5 @@
# NNlib + CUDA.jl

This is a glue package which extends functions from [NNlib.jl](https://github.com/FluxML/NNlib.jl) to work with [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl). It should be loaded automatically when NNlib and CUDA are loaded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid calling it a package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The glue package terminology is from the NNlibCUDA README, which was written back when the package extension proposal was still using "glue package". Will change.

@@ -100,4 +89,23 @@ include("impl/depthwiseconv_im2col.jl")
include("impl/pooling_direct.jl")
include("deprecations.jl")

function __init__()
@require NNPACK_jll="a6bfbf70-4841-5cb9-aa18-3a8ad3c413ee" begin
if isdefined(NNPACK_jll, :libnnpack)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could become an extension (in another PR)

Copy link
Member Author

@ToucheSir ToucheSir Dec 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather drop it altogether in the next breaking release because neither we nor any users appear to be using it (NNPACK itself has been basically unmaintained since 2020, and NNlib no longer lists it as a dep as of 0.7.4).

Comment on lines +104 to +108
@static if !isdefined(Base, :get_extension)
@require CUDA="052768ef-5323-5732-b1bb-66c8b64840ba" begin
include("../ext/CUDAExt/CUDAExt.jl")
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes this PR a breaking change since it would conflict with NNlibCUDA.
The alternative is to remove these lines altogether, and have people explicitly load NNlibCUDA on julia <v1.9, but I think the @require here makes life easier for downstream packages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed breaking. IMO putting this in NNlib 0.9 is preferable to splitting the codebase and maintaining two sets of instructions for users. RE downstream, this PR also seeks to implement FluxML/Flux.jl#2132 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a 0.9 milestone. We should think a bit about any other changes that ought to be bundled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a TODO in this PR for one of them. Can't think of any others off the top of my head.

Comment on lines +1 to +2
using NNlib: DenseConvDims
import NNlib: conv!, ∇conv_filter!, ∇conv_data!, conv_bias_act!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for this PR, but we should move imports to the top file of the module

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's plenty of cleanup that should happen on the NNlibCUDA side (I already deleted a couple of useless files). Didn't want to change too much though since that introduces extra testing/review burden.

@@ -0,0 +1,67 @@
if isdefined(Base, :get_extension)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid the ugly duplication below? Shouldn't

using CUDA.CUDNN: ...

work in both cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no, and I found this out the hard way because I kept missing imports and making the import system freak out. This syntax is documented in https://pkgdocs.julialang.org/dev/creating-packages/#Requires.jl.

@testset "CUDA" begin
Pkg.test("NNlibCUDA")
include("cuda/runtests.jl")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the folder structure to

include("ext/cuda/runtests.jl")

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that Pkg expects a flat structure for each extension (i.e. ext/CUDAExt is the src/ folder), so there's no good place for tests. Ref. https://pkgdocs.julialang.org/dev/creating-packages/#Conditional-loading-of-code-in-packages-(Extensions)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait is the suggestion /test/ext/cuda/runtests.jl or something like /ext/CUDAExt/...?

The former just seems more verbose.

The latter is a bit messy as there's no /ext/CUDAExt/src/ folder. Unlike independent packages in the same repository. Maybe that's intentional --- extensions don't have an independent existence, so don't really deserve tests you can run separately?

Copy link
Member Author

@ToucheSir ToucheSir Dec 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ext/CUDAExt is required for the extension mechanism to work, AIUI. So if we were to move the tests into the extension dir, they'd have to be under ext/CUDAExt/test/... and that's (as you note) messy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion is to do create an ext folder inside the test ones, so have /test/ext/cuda/runtests.jlinstead of /test/cuda/runtests.jl, this also considering we will likely have other extensions in the future. Not really important though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It doesn't seem awful to have say /test/cuda/ and /test/amd/ at top level, without organising these into another layer of hierarchy.

They could be named /test/CUDAext/ etc to stress the connection I guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is something we can put a pin in for the day when we get extensions which aren't just backends.

@mcabbott
Copy link
Member

mcabbott commented Dec 11, 2022

Attempting to time this on Julia 1.8:

julia> @time using CUDA; @time cu(ones(3)) .+ 1;
  7.658277 seconds (9.66 M allocations: 688.737 MiB, 3.56% gc time, 14.19% compilation time: 69% of which was recompilation)
 25.145222 seconds (21.11 M allocations: 1.101 GiB, 1.98% gc time, 35.12% compilation time)

julia> @time @eval using NNlib, NNlibCUDA  # tagged version
  0.792670 seconds (404.08 k allocations: 27.676 MiB, 74.92% compilation time: 95% of which was recompilation)
# or
julia> @time @eval using NNlib  # https://github.com/FluxML/NNlib.jl#bc/pkgext
  5.517386 seconds (6.12 M allocations: 322.021 MiB, 2.49% gc time, 86.05% compilation time: 12% of which was recompilation)

julia> A, B, C = let n = 100
         [cu(rand(n,n,n)) for _ in 1:3]
       end;

julia> @time batched_mul!(C,A,B);  # first run, no difference
  2.235857 seconds (1.28 M allocations: 64.905 MiB, 2.07% gc time, 61.84% compilation time)  # tagged
  2.227217 seconds (1.28 M allocations: 64.869 MiB, 2.04% gc time, 62.05% compilation time)  # PR

julia> @time batched_mul!(C,A,B);  # second run, just noise
  0.000216 seconds
  0.000118 seconds

Does this look right? 5s for Requires sounds like a lot. But perhaps not catastrophic given 30s for CUDA. (This is on quite a slow CPU.)

Attempting to have a look with Julia nightly, I don't get this example to run:

(jl_9wgDIS) pkg> st
Status `/tmp/jl_9wgDIS/Project.toml`
  [872c559c] NNlib v0.8.11 `https://github.com/FluxML/NNlib.jl#bc/pkgext`

julia> VERSION
v"1.10.0-DEV.122"

julia> @time batched_mul!(C,A,B);
ERROR: MethodError: no method matching _batched_gemm!(::Type{CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}}, ::Char, ::Char, ::Float32, ::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, ::Float32, ::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})

Closest candidates are:
  _batched_gemm!(::Type{<:Array}, ::Char, ::Char, ::Number, ::Any, ::Any, ::Number, ::Any)
   @ NNlib ~/.julia/packages/NNlib/DYfcJ/src/batched/batchedmul.jl:253

@ToucheSir
Copy link
Member Author

ToucheSir commented Dec 11, 2022

If this is the same machine as FluxML/Flux.jl#2132 (comment), then 5s seems like a lot. @IanButterworth any tips for investigating what's taking up all that time?

I don't understand what's going on with the batchedmul error. The line which should be adding the overload is attaching it directly to the NNlib namespace, so an incorrect import shouldn't be the culprit. Does NNlibCUDA work on nightly? I know CUDA.jl often won't support brand new nightlies, but it would be strange if that lack of support manifested this way.

@mcabbott
Copy link
Member

Yes, same machine. I did not try the Flux and NNlib PRs together.

5s is quite a lot of extra loading time. TTFG is mostly after loading though. Even my simple .+ 1 operation takes ages, and gradients of Flux models take longer.

And using tagged versions of everything, I could install NNlibCUDA & run the same operation. I don't see what causes the failure above.

(@v1.10) pkg> add NNlibCUDA
    Updating registry at `~/.julia/registries/General`
    Updating git-repo `https://github.com/JuliaRegistries/General.git`
   Resolving package versions...
    Updating `~/.julia/environments/v1.10/Project.toml`
  [a00861dc] + NNlibCUDA v0.2.4
  No Changes to `~/.julia/environments/v1.10/Manifest.toml`

julia> @time @eval using NNlib, NNlibCUDA
  0.143792 seconds (105.42 k allocations: 8.808 MiB)

julia> A, B, C = let n = 100
         [cu(rand(n,n,n)) for _ in 1:3]
       end;

julia> @time batched_mul!(C,A,B);  # first run
  2.638759 seconds (792.60 k allocations: 52.614 MiB, 4.29% gc time, 79.42% compilation time)

julia> @time batched_mul!(C,A,B);
  0.000155 seconds

@mcabbott mcabbott added this to the v0.9 milestone Dec 11, 2022
@ToucheSir
Copy link
Member Author

I see this on the latest nightly:

julia> methods(NNlib._batched_gemm!)
# 2 methods for generic function "_batched_gemm!" from NNlib:
 [1] _batched_gemm!(::Type{<:Array}, transA::Char, transB::Char, α::Number, A, B, β::Number, C)
     @ ~/.julia/dev/NNlib/src/batched/batchedmul.jl:253
 [2] _batched_gemm!(::Type{<:CuArray}, transA::Char, transB::Char, α::Number, A, B, β::Number, C)
     @ CUDAExt ~/.julia/dev/NNlib/ext/CUDAExt/batchedmul.jl:3

Is there possibly some bug where the 2nd method isn't looked up correctly?

@ToucheSir
Copy link
Member Author

ToucheSir commented Dec 11, 2022

Looking for invalidations:

julia> using SnoopCompile, CUDA

julia> invalidations = @snoopr begin
           using NNlib
       end

julia> trees = invalidation_trees(invalidations)
3-element Vector{SnoopCompile.MethodInvalidations}:
 inserting broadcasted(::typeof(identity), x::CuArray{T}) where T<:Union{Float16, Float32, Float64} in NNlib.CUDAExt at ~/.julia/dev/NNlib/ext/CUDAExt/cudnn/activations.jl:34 invalidated:
   backedges: 1: superseding broadcasted(f, arg1, args...) in Base.Broadcast at broadcast.jl:1295 with MethodInstance for Base.Broadcast.broadcasted(::typeof(identity), ::Any) (2 children)

 inserting collect(b::Union{Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}} where T, Union{Base.LogicalIndex{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, Base.ReinterpretArray{T, N, <:Any, <:Union{SubArray{<:Any, <:Any, var"#s14"}, var"#s14"}} where var"#s14"<:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}, Base.ReshapedArray{T, N, <:Union{Base.ReinterpretArray{<:Any, <:Any, <:Any, <:Union{SubArray{<:Any, <:Any, var"#s15"}, var"#s15"}}, SubArray{<:Any, <:Any, var"#s15"}, var"#s15"}} where var"#s15"<:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}, SubArray{T, N, <:Union{Base.ReinterpretArray{<:Any, <:Any, <:Any, <:Union{SubArray{<:Any, <:Any, var"#s16"}, var"#s16"}}, Base.ReshapedArray{<:Any, <:Any, <:Union{Base.ReinterpretArray{<:Any, <:Any, <:Any, <:Union{SubArray{<:Any, <:Any, var"#s16"}, var"#s16"}}, SubArray{<:Any, <:Any, var"#s16"}, var"#s16"}}, var"#s16"}} where var"#s16"<:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}, LinearAlgebra.Adjoint{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.Diagonal{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.LowerTriangular{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.Symmetric{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.Transpose{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.Tridiagonal{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.UnitLowerTriangular{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.UnitUpperTriangular{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.UpperTriangular{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, PermutedDimsArray{T, N, <:Any, <:Any, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}} where {T, N}}) in NNlib.CUDAExt at ~/.julia/dev/NNlib/ext/CUDAExt/batchedadjtrans.jl:17 invalidated:
   backedges: 1: superseding collect(X::GPUArraysCore.AnyGPUArray) in GPUArrays at ~/.julia/packages/GPUArrays/fqD8z/src/host/abstractarray.jl:40 with MethodInstance for collect(::Base.LogicalIndex{T, A} where {T, A<:(GPUArraysCore.AbstractGPUArray{Bool})}) (1 children)
              2: superseding collect(L::Base.LogicalIndex) in Base at multidimensional.jl:768 with MethodInstance for collect(::Base.LogicalIndex) (11 children)

 inserting convert(::Type{T}, b::Union{Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}} where T, Union{Base.LogicalIndex{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, Base.ReinterpretArray{T, N, <:Any, <:Union{SubArray{<:Any, <:Any, var"#s14"}, var"#s14"}} where var"#s14"<:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}, Base.ReshapedArray{T, N, <:Union{Base.ReinterpretArray{<:Any, <:Any, <:Any, <:Union{SubArray{<:Any, <:Any, var"#s15"}, var"#s15"}}, SubArray{<:Any, <:Any, var"#s15"}, var"#s15"}} where var"#s15"<:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}, SubArray{T, N, <:Union{Base.ReinterpretArray{<:Any, <:Any, <:Any, <:Union{SubArray{<:Any, <:Any, var"#s16"}, var"#s16"}}, Base.ReshapedArray{<:Any, <:Any, <:Union{Base.ReinterpretArray{<:Any, <:Any, <:Any, <:Union{SubArray{<:Any, <:Any, var"#s16"}, var"#s16"}}, SubArray{<:Any, <:Any, var"#s16"}, var"#s16"}}, var"#s16"}} where var"#s16"<:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}, LinearAlgebra.Adjoint{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.Diagonal{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.LowerTriangular{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.Symmetric{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.Transpose{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.Tridiagonal{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.UnitLowerTriangular{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.UnitUpperTriangular{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, LinearAlgebra.UpperTriangular{T, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}, PermutedDimsArray{T, N, <:Any, <:Any, <:Union{NNlib.BatchedAdjoint{T, <:CuArray{T}}, NNlib.BatchedTranspose{T, <:CuArray{T}}}}} where {T, N}}) where T<:Array in NNlib.CUDAExt at ~/.julia/dev/NNlib/ext/CUDAExt/batchedadjtrans.jl:15 invalidated:
   mt_backedges: 1: signature Tuple{typeof(convert), Type{Vector}, Any} triggered MethodInstance for Pkg.REPLMode.Command(::Nothing, ::Dict{Symbol, Any}, ::Any) (0 children)
                 2: signature Tuple{typeof(convert), Type{Vector}, Any} triggered MethodInstance for Pkg.REPLMode.Command(::Pkg.REPLMode.CommandSpec, ::Dict{Symbol, Any}, ::Any) (0 children)
                 3: signature Tuple{typeof(convert), Type{Vector{Any}}, Any} triggered MethodInstance for setindex!(::Vector{Vector{Any}}, ::Any, ::Int64) (14 children)
   backedges: 1: superseding convert(::Type{T}, a::AbstractArray) where T<:Array in Base at array.jl:617 with MethodInstance for convert(::Type, ::AbstractArray) (1 children)

Nothing seems untoward or new, so is the latency more due to https://discourse.julialang.org/t/categoricalarrays-forces-openspecfun-jll-recompilation/87759/18? What I still don't understand is how to actually implement the @require @eval suggestion in that thread, so help wanted.

Edit: JuliaPackaging/Requires.jl#65 might have a lead, is there a 2022-Julia friendly version of the same?

@CarloLucibello
Copy link
Member

subsided by #492

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants