Skip to content
This repository was archived by the owner on May 4, 2019. It is now read-only.

Implement reductions with optional skipna argument #101

Merged
merged 7 commits into from
Jul 17, 2014
Merged

Conversation

simonster
Copy link
Member

This implements reductions with an optional skipna argument, fixing #3, fixing JuliaData/DataFrames.jl#259, and fixing JuliaData/DataFrames.jl#354, and mostly superseding #32 (except for skewness and kurtosis). The following reductions are implemented, all in terms of mapreduce:

  • sum
  • prod
  • maximum
  • minimum
  • Base.sumabs
  • Base.sumabs2
  • var
  • varm
  • std
  • stdm

With skipna=false, for some reductions that are guaranteed to return NA when any input is NA, we first check for NAs and then call _mapreduce from Base on the data Array if there are none. This has basically no overhead. For other reductions, we simply call the implementations in Base on the DataArray, which is slow due to type instability in indexing, but is the most obvious way to guarantee correctness.

With skipna=true, we first check if there are NA values. If not, we call _mapreduce from Base on the data Array. If there are, we use either an algorithm that branches on NA or an algorithm that does not branch on NA depending on the types and functors involved. For summation, we use a pairwise algorithm that divides blocks along BitArray chunk boundaries based on the number of non-NA elements. For blocks that contain no NAs, we call the implementation in Base. This has little overhead when there are few NA elements.

The tests currently fail on Travis because I based the implementation of var off of JuliaLang/julia#7502 and so it has slightly different semantics than the current Julia master, but the tests here are pretty comprehensive, and as soon as that is merged this should be good to go.

@simonster
Copy link
Member Author

It seems like some of the nightlies have not been updated in 11 days, so this is going to have to wait.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.13%) when pulling a0ad165 on sjk/reduce into e0e4a70 on master.

@simonster
Copy link
Member Author

Now with reductions across dimensions! (although not yet for mean and var)

@simonster
Copy link
Member Author

What should reductions do if there are no non-NA values and skipna=true? In Base:

  • sum of an empty array returns 0, and prod returns 1
  • minimum or maximum of an empty array throws
  • mean of an empty array returns NaN

Should the same rules apply when reducing a non-empty vector that contains NAs? R does the same things with sum and mean when na.rm=TRUE, and there's some logic to that choice, since it means a reduction with skipna gives exactly the same output as the reduction on the data with the NAs removed.

However, generalization to reductions across dimensions suggests that maybe minimum and maximum should not throw. The output of minimum(A, 1) is still useful even if one column is all NA. We can't really look to R for guidance on this, since it doesn't have a form of minimum/maximum that reduces across dimensions, but its standard max/min functions show a warning and return typemin/typemax if a vector is empty or all NA.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.83%) when pulling 55ec26e on sjk/reduce into e0e4a70 on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.47%) when pulling 2d1eec7 on sjk/reduce into e0e4a70 on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.59%) when pulling 0381da7 on sjk/reduce into e0e4a70 on master.

@simonster
Copy link
Member Author

Okay, this should be ready to merge. With few NAs, performance is quite good, typically within 50% of Base for both skipna=true and skipna=false. For the torture test case of 50% NA, which results in an unpredictable branch, performance can be up to 5x worse. It's possible more can be done, but for now I think this is adequate.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.59%) when pulling 0381da7 on sjk/reduce into e0e4a70 on master.

simonster added a commit that referenced this pull request Jul 17, 2014
Implement reductions with optional skipna argument
@simonster simonster merged commit d78a6d3 into master Jul 17, 2014
@simonster simonster deleted the sjk/reduce branch July 17, 2014 13:54
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants