Skip to content

scitype(X) is slow for large tables #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ablaom opened this issue Sep 23, 2019 · 6 comments
Closed

scitype(X) is slow for large tables #12

ablaom opened this issue Sep 23, 2019 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@ablaom
Copy link
Member

ablaom commented Sep 23, 2019

julia> using ScientificTypes, Tables

julia> Xsmall = Tables.table(rand(100, 10));

julia> Xbig = Tables.table(rand(10000, 10));j
julia> @time scitype(Xsmall);

julia> scitype(Xsmall);
julia> @time scitype(Xsmall)
  0.004922 seconds (48 allocations: 9.594 KiB)
Table{AbstractArray{Continuous,1}}

julia> @time scitype(Xbig)
  0.279894 seconds (58 allocations: 782.875 KiB)
Table{AbstractArray{Continuous,1}}

Reason for slow down

By definition, scitype(::AbstractArray) = AbstractArray{T} where T is the union of element scitypes. Computing the individual element scitypes and taking the union is o(N) where N is the number of elements. In most cases the eltype of the array already determines what this union should be, but, under the mlj convention, the scitype of a CategoricalValue is not inferable from the machine type of the object, because the type does not say whether it is ordered or not (see JuliaData/CategoricalArrays.jl#184 ).

Suggestion

We could, in the mlj convention, overload scitype(::<:AbstractArray{T}) to infer the scitype from T for those particular T we know actually do determine the union scitype (basically, everything that is not a subtype of CategoricalValue or CategoricalString). Not sure we can do much for these troublesome types, however.

More radical, but motivated by unresolved issue JuliaData/CategoricalArrays.jl#199, is to abandon CategoricalArrays altogether and write our own categorical element package where order is a type parameter (and levels/pools are immutable).

Any other suggestions?

@nalimilan
Copy link

It should be easy to add an scitype(A::CategoricalArray, ::Val) method for efficiency, instead of calling reduce. (BTW, even for general AbstractArray{T}, you could imagine adding a fast path for when isconcrete(T) and the scientific type is determined by the type and doesn't depend on actual values.)

Even if ordered was a type parameter, wouldn't the scientific type depend on the number of levels anyway?

@ablaom
Copy link
Member Author

ablaom commented Sep 24, 2019

@nalimilan Yes, that's the way to go.

Even if ordered was a type parameter, wouldn't the scientific type depend on the number of levels anyway?

Yes. But in an "immutable" vs of CategoricalArrays I would make that a type parameter too :-)

@ablaom ablaom self-assigned this Sep 24, 2019
@nalimilan
Copy link

Yes. But in an "immutable" vs of CategoricalArrays I would make that a type parameter too :-)

Yeah, but after very long discussions CategoricalArrays was designed not to hardcode the number of levels, as in most situations that's not a good idea for performance (e.g. need to recompile all functions for each number of levels). One should probably use an enum when levels are hardcoded like that.

@ablaom
Copy link
Member Author

ablaom commented Oct 16, 2019

Resolved by PR #23:

julia> Xsmall = Tables.table(rand(100, 10));

julia> Xbig = Tables.table(rand(10000, 10));

julia> 

julia> @time scitype(Xsmall);
  0.000117 seconds (18 allocations: 9.125 KiB)

julia> 

julia> scitype(Xsmall);

julia> @time scitype(Xsmall)
  0.000123 seconds (18 allocations: 9.125 KiB)
Table{AbstractArray{Continuous,1}}

julia> @time scitype(Xbig)
  0.000766 seconds (28 allocations: 782.406 KiB)
Table{AbstractArray{Continuous,1}}

@ablaom
Copy link
Member Author

ablaom commented Oct 16, 2019

Re #mlj slack discussion https://julialang.slack.com/archives/CC57ZE7EY/p1569005060012000 and this issue JuliaAI/MLJModels.jl#63:

julia> @time X = MLJBase.table(randn(5_000, 50));
  0.002304 seconds (370 allocations: 1.929 MiB)

julia> @time m = machine(Standardizer(), X)
  0.001994 seconds (145 allocations: 1.913 MiB)
Machine{Standardizer} @ 774


julia> @time fit!(m)
[ Info: Training Machine{Standardizer} @ 774.
  0.005042 seconds (425 allocations: 3.845 MiB)
Machine{Standardizer} @ 774


julia> @time transform(m, X);
  0.089879 seconds (5.87 k allocations: 97.647 MiB, 20.16% gc time)

So more sensible.

@tlienart
Copy link
Collaborator

Fantastic!

@ablaom ablaom mentioned this issue Oct 16, 2019
2 tasks
@ablaom ablaom closed this as completed in 10c0578 Oct 16, 2019
juliohm pushed a commit to juliohm/ScientificTypes.jl that referenced this issue Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants