-
Notifications
You must be signed in to change notification settings - Fork 8
scitype(X)
is slow for large tables
#12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It should be easy to add an Even if |
@nalimilan Yes, that's the way to go.
Yes. But in an "immutable" vs of CategoricalArrays I would make that a type parameter too :-) |
Yeah, but after very long discussions CategoricalArrays was designed not to hardcode the number of levels, as in most situations that's not a good idea for performance (e.g. need to recompile all functions for each number of levels). One should probably use an enum when levels are hardcoded like that. |
Resolved by PR #23: julia> Xsmall = Tables.table(rand(100, 10));
julia> Xbig = Tables.table(rand(10000, 10));
julia>
julia> @time scitype(Xsmall);
0.000117 seconds (18 allocations: 9.125 KiB)
julia>
julia> scitype(Xsmall);
julia> @time scitype(Xsmall)
0.000123 seconds (18 allocations: 9.125 KiB)
Table{AbstractArray{Continuous,1}}
julia> @time scitype(Xbig)
0.000766 seconds (28 allocations: 782.406 KiB)
Table{AbstractArray{Continuous,1}}
|
Re #mlj slack discussion https://julialang.slack.com/archives/CC57ZE7EY/p1569005060012000 and this issue JuliaAI/MLJModels.jl#63: julia> @time X = MLJBase.table(randn(5_000, 50));
0.002304 seconds (370 allocations: 1.929 MiB)
julia> @time m = machine(Standardizer(), X)
0.001994 seconds (145 allocations: 1.913 MiB)
Machine{Standardizer} @ 7…74
julia> @time fit!(m)
[ Info: Training Machine{Standardizer} @ 7…74.
0.005042 seconds (425 allocations: 3.845 MiB)
Machine{Standardizer} @ 7…74
julia> @time transform(m, X);
0.089879 seconds (5.87 k allocations: 97.647 MiB, 20.16% gc time) So more sensible. |
Fantastic! |
For 0.2.2 release
Uh oh!
There was an error while loading. Please reload this page.
Reason for slow down
By definition,
scitype(::AbstractArray) = AbstractArray{T}
whereT
is the union of element scitypes. Computing the individual element scitypes and taking the union iso(N)
whereN
is the number of elements. In most cases the eltype of the array already determines what this union should be, but, under the mlj convention, the scitype of aCategoricalValue
is not inferable from the machine type of the object, because the type does not say whether it is ordered or not (see JuliaData/CategoricalArrays.jl#184 ).Suggestion
We could, in the mlj convention, overload
scitype(::<:AbstractArray{T})
to infer the scitype fromT
for those particularT
we know actually do determine the union scitype (basically, everything that is not a subtype ofCategoricalValue
orCategoricalString
). Not sure we can do much for these troublesome types, however.More radical, but motivated by unresolved issue JuliaData/CategoricalArrays.jl#199, is to abandon
CategoricalArrays
altogether and write our own categorical element package whereorder
is a type parameter (and levels/pools are immutable).Any other suggestions?
The text was updated successfully, but these errors were encountered: