-
Notifications
You must be signed in to change notification settings - Fork 194
Adding support for different weight vector types #250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! I've made a few comments, without having reviewed everything yet.
src/weights.jl
Outdated
bias::Int | ||
end | ||
|
||
immutable ProbabilityWeights{S<:Real, T<:Real, V<:RealVector} <: AbstractWeights{S, T, V} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out "sampling weights" is a much more common expression than "probability weights". Maybe we should use the former term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have very strong feelings either way, but sampling just seemed more ambiguous to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've heard both. In my line of work I don't come across these kinds of weights much so I can't speak to the prominence of either term overall, but I think "probability weights" is what I learned to call them in grad school.
src/weights.jl
Outdated
If omitted, `wsum` is computed. | ||
""" | ||
function WeightVec{S<:Real, V<:RealVector}(vs::V, s::S=sum(vs)) | ||
return WeightVec{S, eltype(vs), V}(vs, s) | ||
function Weights{S<:Real, V<:RealVector}(vs::V, s::S=sum(vs); corrected::Bool=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make more sense to decide whether to apply the correction when calling std
? For most functions (like mean
), the correction does not apply and the results will be the same whatever the kind of weights you have. It seems a bit weird to require people to decide about correction in advance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess that would fit with base better. I was initially thinking that it might make sense to precompute the value, but we only use it in a few places so it probably wouldn't cost us much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made this change in my next commit, but it ended up being a lot of little changes.
src/weights.jl
Outdated
""" | ||
weights(vs::RealVector) = WeightVec(vs) | ||
weights(vs::RealArray) = WeightVec(vec(vs)) | ||
frequency(vs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
frequency
and exponential
are really too broad terms to be claimed for weights. We should find more specific names, like fweights
and expweights
. One (smallà advantage is that it would mirror Stata (with also pweights
and aweights
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to prefer full word function names. I'd be fine with keeping frequency
and exponential
, but only exporting the aliases of pweights
, eweights
, fweights
. That way folks can choose to use the full word versions, but we won't be cluttering up the global namespace on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have two ways to access the same function. IMO we should choose one interface and stick to it. I'm with Milan on this one; I don't think we should use frequency
and exponential
for these, as those names are too general. freqweights
and expweights
seem reasonable enough names to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like scipy also uses the aweights
and fweights
names https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, why not just use the type constructors rather than separate functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iono, cause FrequencyWeights
seems long...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suppose so. Perhaps it's best then to follow suit with Stata and SciPy and use aweights
and fweights
.
src/weights.jl
Outdated
end | ||
|
||
# TODO: constructor for ProbabilityWeights, but I'm not familiar with how bias correction works with these | ||
# types of weights or if bias correction even makes sense. | ||
# https://en.wikipedia.org/wiki/Inverse_probability_weighting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See http://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/ for a summary regarding analytical weights (aweights) and probability weights (pweights). Wikipedia has interesting data on frequency weights and analytical/precision weights: https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance
I'll try to find more references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it stands, my understanding is that these weights are intended to provide inferential statistics about the population (vs sample)? Also, I guess analytical/precision weights = reliability weights on the wiki page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it stands, my understanding is that these weights are intended to provide inferential statistics about the population (vs sample)?
Yes, probability/sampling weights reflect the fact that some population groups were over/under-sampled, so individuals do not all represent the same number of people in the target population.
Also, I guess analytical/precision weights = reliability weights on the wiki page?
Right. The most explicit name is "inverse-variance weights", and this is what the link given on Wikipedia mentions.
src/weights.jl
Outdated
immutable WeightVec{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractVector{T} | ||
abstract AbstractWeights{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractVector{T} | ||
|
||
immutable Weights{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractWeights{S, T, V} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we still need this. Do we expect people to need kinds of weights which cannot have their own type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I get the impression that the kinds of weights should be kept relatively small and folks might still want to dispatch based on the storage of those weights (e.g., NullableArray vs Vector).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that the Weights
family might not be needed since all use cases should fall into a well-defined weights type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah, I've renamed Weights
to AnalyticWeights
in the next commit to be more specific. I've also added an aweights
method and made weights
an alias to aweights
, but maybe I should just deprecate the weights
method in favour of making folks think more about what type of weights they want to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's deprecate it, as that's the best way to make people discover the new weight types and choose the appropriate one for them. If it turns out some people need other weight types, they will let us know.
@deprecate _moment3(v::RealArray, m::Real, wv::AbstractWeights) _moment3(v, wv, m) | ||
@deprecate _moment4(v::RealArray, m::Real, wv::AbstractWeights) _moment4(v, wv, m) | ||
@deprecate _momentk(v::RealArray, k::Int, m::Real, wv::AbstractWeights) _momentk(v, k, wv, m) | ||
@deprecate moment(v::RealArray, k::Int, m::Real, wv::AbstractWeights) moment(v, k, wv, m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If WeightVec
is going away it should have a deprecation, and if it isn't then these deprecations should maintain the existing signatures. (They've probably been deprecated long enough that we can just remove them, but that's an aside.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've deprecated WeightVec
and I was contemplating deprecating weights
in favour of aweights
to be more specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the deprecation warning should mention all weight types. I'm also not sure we can call aweights
as a replacement, since the current cov
does not apply any correction, the deprecation should do the same: maybe we need to keep the WeightsVec
type (deprecating) to avoid breaking packages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable and would also help notify people about the other changes occurring with weights without introducing breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do folks want to handle internal calls to weights(x)
if it (and WeightVec
) are deprecated?
- We leave the code as is and accept that functions like
wsample
,wmedian
, etc will throw deprecation warnings when they need to creating aWeightVec
from an array. - Change the default behaviour to creating
AnalyticWeights
which may break existing behaviour for some people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the kind of weights doesn't matter for wsample
or wmedian
, it only makes a difference for var
and similar functions. IOW the type of weights created for the former functions is an implementation detail, we can choose any type we want (though I would choose frequency weights, as they are simpler to understand).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess wsample
and wmedian
weren't good examples. Alright, I'll switch that over to avoid the warnings. Thanks.
src/weights.jl
Outdated
@@ -2,43 +2,112 @@ | |||
###### Weight vector ##### | |||
|
|||
if VERSION < v"0.6.0-dev.2123" | |||
immutable WeightVec{S<:Real, T<:Real, V<:RealVector} <: RealVector{T} | |||
abstract AbstractWeights{S<:Real, T<:Real, V<:RealVector} <: RealVector{T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be @compat abstract type ... end
. Also why is this being conditioned on the version? Looks like that predates this PR, but I don't see why we would need to condition for these definitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AbstractWeights{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractVector{T}
isn't valid syntax cause 0.5 doesn't support triangular dispatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh dang you're right. 😞
src/weights.jl
Outdated
# Arguments | ||
* `n::Integer`: the desired length of the `Weights` | ||
* `λ::Real`: is a smoothing factor or rate paremeter between 0 .. 1. | ||
As this value approaches 0 the resulting weights will be almost equal(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the parentheses after "equal"? Also should remove "is" on the previous line, "paremeter" -> "parameter", and ".." -> "and".
src/weights.jl
Outdated
while values closer to 1 will put higher weight on the end elements of the vector. | ||
""" | ||
function exponential(n::Integer, λ::Real=0.99) | ||
@assert 0 <= λ <= 1 && n > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically for argument checking it's better to use condition || throw(ArgumentError("..."))
. In this case,
n > 0 || throw(ArgumentError("cannot construct weights of length < 1"))
0 <= λ <= 1 || throw(ArgumentError("smoothing factor must be between 0 and 1"))
That provides more descriptive error messages for the user, plus it's better to use specifically-typed exceptions where possible.
src/weights.jl
Outdated
""" | ||
wmedian(v::RealVector, w::RealVector) = median(v, weights(w)) | ||
wmedian{W<:Real}(v::RealVector, w::WeightVec{W}) = median(v, w) | ||
wmedian(v::RealVector, w::RealVector) = median(v, weights(w, false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the weights
API is changing to require a boolean parameter, the existing API should have a formal deprecation.
I have found more references for the variance correction to apply to different types of weights. Regarding sampling/probability weights, a good reference is n = count(!iszero, w)
sw = sum(w)
xbar = sum(w .* x)/sw
sum(w .* (x - xbar).^2)/sw * n/(n - 1) (Of course this is just a Julia illustration of the algorithm. This code should be optimized to avoid making copies.) A good reference for analytical/precision weights is n = length(w) # Could be count(!iszero, w) instead
w = w * n/sum(w)
sw = sum(w) # This is now equal to n, but maybe we should support non-normalized weights?
xbar = sum(w .* x)/sw
sum(w * (x - xbar).^2)/(sw - sum(w.^2)/sw) This corresponds to the Wikipedia formula, and indeed it was written by the same person. Regarding the actual implementation, we could normalize the weights when creating the vector if that's always needed. Another interesting reference (for analytical/precision weights) is SAS: |
src/weights.jl
Outdated
bias::Real | ||
end | ||
|
||
immutable FrequencyWeights{S<:Integer, T<:Integer, V<:IntegerVector} <: AbstractWeights{S, T, V} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should restrict frequency weights to integers. While this seems obvious at first, formulas generalize to non-integer weights, and other software generally accepts arbitrary values. This can be useful in some cases (e.g. sometimes you need to split observations between groups). Also this will be more practical if your variable is stored as a float for some reason.
Sorry about the code churn in these commits folks. Turns out moving the Changes:
|
src/weights.jl
Outdated
end | ||
|
||
""" | ||
`@weights name` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backticks shouldn't be used for indented blocks like this, since Markdown code formatting is applied by virtue of being indented. So this will actually show up in the docstring and docs like
`@weights name`
src/weights.jl
Outdated
bias(w::ProbabilityWeights, [corrected]) | ||
|
||
```math | ||
\fraction{n}{∑w × (n - 1)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be \frac{n}{(n - 1) \sum w}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Similarly \frac
and LaTeX commands rather than Unicode elsewhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with just using LaTeX commands, but the julia docs recommend using unicode characters.
Use Unicode characters rather than their LaTeX escape sequence, i.e.
α = 1
rather than\\alpha = 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, weird. Anyway, should still be \frac
rather than \fraction
, though I guess with the backslash escaped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that remark applies to \sum
, as it does more than just printing the ∑
character: it also uses the right size for the symbol. "characters" in this context mostly means "letters" AFAICT.
test/cov.jl
Outdated
@@ -88,62 +88,62 @@ end | |||
# weighted covariance | |||
|
|||
if VERSION < v"0.5.0-dev+679" | |||
@test cov(X, wv1) ≈ S1w ./ sum(wv1) | |||
@test cov(X, wv2; vardim=2) ≈ S2w ./ sum(wv2) | |||
@test cov(X, wv1; corrected=false) ≈ S1w ./ sum(wv1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having to change all of the tests suggests that this is a breaking change. To keep from breaking users' code, we should probably make corrected=false
the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with doing that for this PR, but it would be nice to switch it back to corrected=true
in a later release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tricky to change things like this, since we'd need some kind of deprecation to avoid suddenly and silently breaking others' code. I think @nalimilan has gone through such a process recently; perhaps he could be of help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I just noted in a comment above, I think the best deprecation path is to deprecate WeightsVec
but keep it around so that existing code does not use the correction, and point to new weights types in the deprecation warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan To clarify, you'd like var
, std
and cov
to always return the uncorrected version for WeightVec
s even if corrected=true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The corrected
argument wouldn't be supported at all for WeightVec
, or at least and error would be raised if corrected=true
is passed.
src/StatsBase.jl
Outdated
AbstractWeights, # the abstract type to represent any weight vector | ||
AnalyticWeights, # the default type for representing a analytic/precision/reliability weight vectors | ||
FrequencyWeights, # the type for representing a frequency weight vectors | ||
ProbabilityWeights,# the type for representing a probability/sampling weight vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space before #
. Also align with the rest of the block.
src/moments.jl
Outdated
Base.varm!(R::AbstractArray, A::RealArray, wv::WeightVec, M::RealArray, dim::Int) = | ||
scale!(_wsum_centralize!(R, @functorize(abs2), A, values(wv), M, dim, true), inv(sum(wv))) | ||
function Base.varm!(R::AbstractArray, A::RealArray, wv::AbstractWeights, M::RealArray, dim::Int; corrected=true) | ||
scale!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the first argument on this line, and the closing parenthesis on the same line as the second argument. Same below.
BTW, @functorize
is no longer needed with Julia 0.5 and above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, maybe a style guide would be helpful. The argument I've heard for spreading args (for long function calls) across multiple lines is that it minimizes the size of you diffs later on. I'm not saying that's the right approach, but given that there are multiple styles floating around and the julia style guide doesn't discuss them maybe it's worth clarifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everyone has a different preferred style; it's mostly about being consistent within a particular package. That said, I'd love for the style guide in the manual to be more specific about things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of thing is worth clarifying, but I think it's pretty clear from the existing code base of JuliaStats packages that this style is against the convention.
src/moments.jl
Outdated
@@ -15,10 +15,12 @@ whereas it's `length(x)-1` in `Base.varm`. The impact is that this is not a | |||
weighted estimate of the population variance based on the sample; it's the weighted | |||
variance of the sample. | |||
""" | |||
Base.varm(v::RealArray, wv::WeightVec, m::Real) = _moment2(v, wv, m) | |||
function Base.varm(v::RealArray, wv::AbstractWeights, m::Real; corrected=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, you can also add a line break while keeping the short form without function... end
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware that the short form support line breaks, but I think that's harder for people to parse. The function ... end
syntax with proper indentation provides a more consistent read through. I tend to have similar views on ternaries that cover multiple lines (particularly if they're nested). I'm fine with keeping to the short form if that's the desired style for this repo, but maybe we should add a style guide to make that clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really care. There's a short style guide in Julia's CONTRIBUTING.md, but I don't think it's mentioned.
src/moments.jl
Outdated
n = length(v) | ||
s = 0.0 | ||
w = values(wv) | ||
for i = 1:n | ||
@inbounds z = v[i] - m | ||
@inbounds s += (z * z) * w[i] | ||
end | ||
s / sum(wv) | ||
|
||
result = s * bias(wv, corrected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result
isn't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Sorry, this is leftover from some debugging I was doing (removed the print statements, but forgot to reverse that).
src/moments.jl
Outdated
n = length(v) | ||
s = 0.0 | ||
for i = 1:n | ||
@inbounds z = v[i] - m | ||
s += z * z | ||
end | ||
s / n | ||
s * bias(n, corrected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiplying by a bias is backwards. Maybe call that function cov_correction
, moment2_correction
or something like that? It's also good to choose a specific name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that bias
isn't really descriptive of what it represents anymore. Would something like bias_factor
, scale_factor
, bias_coef
, scale_coef
, etc make more sense? The nice thing about keeping a relatively general name is that we can use the same function call var
, std
and cov
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var
and std
are in some way based on covariance, so cov
should be fine.
src/weights.jl
Outdated
@weights AnalyticWeights | ||
|
||
""" | ||
AnalyticWeights(vs, [wsum]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wsum=sum(vs)
, without brackets.
src/weights.jl
Outdated
end | ||
|
||
""" | ||
aweights(vs) | ||
|
||
Construct a `AnalyticWeights` type from a given array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type -> vector.
Maybe add something like "See the documentation for that type for more information"?
src/weights.jl
Outdated
Construct a `AnalyticWeights` type from a given array. | ||
""" | ||
aweights(vs::RealVector) = AnalyticWeights(vs) | ||
aweights(vs::RealArray) = AnalyticWeights(vec(vs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method really needed? I don't think we perform this kind of conversion automatically in general.
src/weights.jl
Outdated
n > 0 || throw(ArgumentError("cannot construct weights of length < 1")) | ||
0 <= λ <= 1 || throw(ArgumentError("smoothing factor must be between 0 and 1")) | ||
w0 = map(i -> λ * (1 - λ)^(1 - i), 1:n) | ||
return weights(w0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be ExponentialWeights
? But why do we need a specific weights type: don't them enter into one of the above families? Or maybe they should just use a generic weights type (without any correction)?
src/weights.jl
Outdated
function bias(w::AbstractWeights, corrected=true) | ||
s = sum(w) | ||
if corrected | ||
return inv(s * (1 - sum(normalize(values(w), 1) .^ 2))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normalize
and .^2
are going to create copies of the weights, which should be avoided. Formula should be adapted so that no allocation happens at all. Maybe you'll need a loop for that.
I just found WeightVec and the much cleaner reworking above. I will use the new abstraction and its support to hold the vector of weights used with rolling, windowed stats. Usually, these weights are normalized in some way and then used. It would be pleasant to do (should this as yet not be). my_weights = new_abstraction_above( my_weights_as_values_in_sequence )
my_normalized_weights = normalize(my_weights)
my_normalized_weights = normalize(my_weights, p_norm = 2)
normalize!(my_weights)
normalize!(my_weights, p_norm = 1.618) without reaching inside your type to access the sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still isn't ready yet, but I thought I should push what I have and point out a couple spots where I could use a second opinion. @nalimilan and @ararslan Thanks for bearing with this PR, I know it's been a lot of work.
src/StatsBase.jl
Outdated
AnalyticWeights, # the default type for representing a analytic/precision/reliability weight vectors | ||
FrequencyWeights, # the type for representing a frequency weight vectors | ||
ProbabilityWeights, # the type for representing a probability/sampling weight vectors | ||
ExponentialWeights, # the type for representing exponential weights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to delete ExponentialWeights
from export.
src/deprecates.jl
Outdated
@@ -43,3 +37,28 @@ findat(a::AbstractArray, b::AbstractArray) = findat!(Array{Int}(size(b)), a, b) | |||
|
|||
@deprecate df(obj::StatisticalModel) dof(obj) | |||
@deprecate df_residual(obj::StatisticalModel) dof_residual(obj) | |||
|
|||
@weights WeightVec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this WeightVec
should go here or in weights.jl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's deprecated it should live in this file, just needs @deprecate
as appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing using depwarn
inside the WeightVec
and weights
methods is the correct approach here since we want to deprecate the functionality rather than the signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. There's also @deprecate_binding
to deprecate WeightVec
itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't @deprecate_binding
only work if you're just changing the name of the binding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it's needed to print a warning when people use WeightVec
directly. You can deprecate it in favor of FrequencyWeights
, even if that's not fully correct it's better than nothing.
src/weights.jl
Outdated
(ie: [Bessel's correction](https://en.wikipedia.org/wiki/Bessel's_correction)), | ||
otherwise it will return ``\\frac{1}{n}``. | ||
""" | ||
cfactor(n::Integer, corrected=false) = 1 / (n - Int(corrected)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cfactor
(for correction factor) seemed like the best choice for this function now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this name either: "factor" really is secondary here, what matters is 1) that it's a correction, 2) that it applies to var/cov/std.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
varcorrection
, maybe biascorrection
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
varcorrection
seems a little long, maybe cvar
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's even less descriptive than cfactor
, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, varcorrection
it is then. I don't really care about the name since we're not exporting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, what about varden
(variance denominator)? I feel like the fact that it takes a corrected
argument should cover that it may provide bias correction. I'd need to slightly change what's returned, but that might be more understandable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like varcorrection
better. Or maybe bessel_correction
?
src/weights.jl
Outdated
``\\frac{1}{\sum w}`` | ||
""" | ||
cfactor(wv::AbstractWeights, ::Type{Val{false}}) = 1 / sum(wv) | ||
cfactor(wv::AbstractWeights, ::Type{Val{true}}) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd understand if folks are opposed to the Type{Val{True}}
, but this means subtypes only need to implement the correction condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Val
is really not what you want here: since the value of corrected
will only be known at runtime, you're forcing the call to cfactor
to go via dispatch at runtime, while it would have been inlined if you used ::Bool
.
Again, I don't think we should provide a fallback correction for AbstractWeight
: it's unlikely to be valid, which is worse than raising an error. Creating new weight types shouldn't be a frequent need, people can afford the cost of defining this simple method if it applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I had missed that you throw an error. But still I think it's better to repeat the uncorrected equation in each method than to use Val
.
test/scalarstats.jl
Outdated
@test zscore(a) ≈ zscore(a, mean(a), std(a)) | ||
@test zscore(a, 1) ≈ zscore(a, mean(a,1), std(a,1)) | ||
@test zscore(a, 2) ≈ zscore(a, mean(a,2), std(a,2)) | ||
@test zscore(a) ≈ zscore(a, mean(a), std(a; corrected=false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The corrected=false
is because mean_and std
now has corrected=false
. In order to support all of the existing behaviour we may need to mix and match which methods have corrected=true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand: why not adapt mean_and_std
to default to corrected=true
? To avoid breaking existing code, we could keep corrected=false
for a while, but print a warning when corrected
isn't specified so that we can make the switch later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm clearly missing something here. How am I suppose to check if a keyword argument is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: We could add that warning if we chose to make corrected
a regular (vs keyword) argument. We'd just need to have a function which doesn't take the corrected argument, so that it can print a warning and call the function with the default setting? I initially thought making corrected
a keyword argument was the best way to stay consistent with base, but var
, std
and cov
aren't even consistent with each other in base, so it might make more sense to do whatever is best for our use case.
std(
...
std(A::AbstractArray, region; corrected, mean) in Base at statistics.jl:263
julia> cov(
cov(x::AbstractArray{T,1} where T, corrected::Bool) in Base at statistics.jl:346
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just have corrected=nothing
to detect whether a keyword argument has been left to its default. Then you just need to replace its value with false
if it's equal to nothing
after printing the warning using Base.depwarn
.
src/StatsBase.jl
Outdated
wmedian, # weighted median | ||
wquantile, # weighted quantile | ||
AbstractWeights, # the abstract type to represent any weight vector | ||
AnalyticWeights, # the default type for representing a analytic/precision/reliability weight vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Singular "vector". "an analytic". Since these lines are wider than 92 chars, make them a bit shorter e.g. by removing "the", "default", and "for representing"/"to represent".
src/StatsBase.jl
Outdated
FrequencyWeights, # the type for representing a frequency weight vectors | ||
ProbabilityWeights, # the type for representing a probability/sampling weight vectors | ||
ExponentialWeights, # the type for representing exponential weights | ||
weights, # alias for aweights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we still want to export it (even if it's deprecated) to avoid breaking functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deprecate
automatically exports for this reason.
src/cov.jl
Outdated
scattermat(x::DenseMatrix, wv::WeightVec, vardim::Int=1) = | ||
scattermatm(x, Base.mean(x, wv, vardim), wv, vardim) | ||
## weighted cov | ||
function Base.covm(x::DenseMatrix, mean, wv::AbstractWeights, vardim::Int=1, corrected::Bool=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is wider than 92 chars, either remove function
or split arguments on multiple lines. Since this file seems to be using the short form for one-line method definitions, it would be more consistent to use it at least here.
src/cov.jl
Outdated
@@ -58,66 +58,41 @@ cov | |||
|
|||
|
|||
""" | |||
mean_and_cov(x, [wv::WeightVec]; vardim=1) -> (mean, cov) | |||
mean_and_cov(x, [wv::AbstractWeights]; vardim=1) -> (mean, cov) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to document corrected
. Same for mean_and_std
and mean_and_var
, var
, varm
, std
, stdm
, cov
, covm
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was holding off updating these docstrings until I was more comfortable with the behaviour. Since, folks don't seem to have an issue with the corrected=false
keyword for all these functions I'll update that.
src/hist.jl
Outdated
@@ -249,14 +247,15 @@ function append!{T,N}(h::AbstractHistogram{T,N}, vs::NTuple{N,AbstractVector}) | |||
end | |||
h | |||
end | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unrelated, and it's not justified since there's no line break before the last append!
definition either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was a typo from fixing conflicts when I rebased with master.
src/weights.jl
Outdated
eweights(n, [λ]) | ||
|
||
Constructs an `AnalyticWeights` vector with a desired length `n` and smoothing factor `λ`, | ||
where each element is set to ``λ * (1 - λ)^(1 - i)``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"element in position ``i`` "
src/weights.jl
Outdated
|
||
# Arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in the guidelines, the "arguments" section isn't recommended, except when there are many arguments. Here, the description of n
is redundant with what is said above. You can just keep the last sentence about λ in the main description or in a separate paragraph.
src/weights.jl
Outdated
Base.getindex(wv::WeightVec, i) = getindex(wv.values, i) | ||
Base.size(wv::WeightVec) = size(wv.values) | ||
""" | ||
eweights(n, [λ]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
λ=0.99
, and drop the brackets. Or maybe drop the default value, as it seems quite arbitrary and unlikely to be exactly what people need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess the 0.99 is pretty arbitrary.
src/weights.jl
Outdated
n > 0 || throw(ArgumentError("cannot construct weights of length < 1")) | ||
0 <= λ <= 1 || throw(ArgumentError("smoothing factor must be between 0 and 1")) | ||
w0 = map(i -> λ * (1 - λ)^(1 - i), 1:n) | ||
aweights(w0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these really analytical weights? The docs should mention this if that's the case. Can you explain what exponential weights are used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I often use exponential weights to describe the relative importance of observations in temporal data. For example, I often want to get summary statistics about some lookback data, but putting greater value on more recent observations (as they better reflect the "current" state of the system). Since, these types of weights definitely aren't probability or frequency weights I got the impression that analytic or reliability weights best fit this. I'm fine with removing this a niche use case, but my understanding is that exponential weights are pretty common when working with temporal data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, these aren't strictly exponential weights. Maybe this means we should keep a generic Weights
type... Feel free to reintroduce it (to replace WeightsVec
), though I'm not sure what's the best name for them.
test/moments.jl
Outdated
# AnalyticWeights | ||
@test var(x, aweights(ones(10)); corrected=true) ≈ var(x) | ||
|
||
w = aweights(rand(10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than applying the formulas, this should use fixed values and hardcode the result. Then I can check that it's correct in other software if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I can check that it's correct in other software if you like
That's a great idea, I'd really appreciate that! What kind of software were you thinking of testing against? This PR is probably big enough as it is, but I was wondering if it would make sense to automate testing against R using RCall
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of the survey and Hmisc R packages, of Stata and maybe SAS. I wouldn't worry about using RCall, we can just hardcode the results for a few cases.
@JeffreySarnoff |
I just want to say thanks so much for your hard work and perseverence here, @rofinn. You're doing fantastic work and I'm excited to see this get merged. |
Is it possible to turn deprecation warnings on and off during testing in julia? It would be nice to check that I haven't broken any existing behaviour without producing a bunch of deprecation warning during testing. |
|
src/cov.jl
Outdated
@@ -44,80 +37,73 @@ that the data are centered and hence there's no need to subtract the mean. | |||
When `vardim = 1`, the variables are considered columns with observations in rows; | |||
when `vardim = 2`, variables are in rows with observations in columns. | |||
""" | |||
function scattermat end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line was actually correct since the docstring documents several methods, not just the following one. Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but is that line even necessary (calling help on the method name still works the same)? Also, the other line is just cov
vs function cov end
, what would the preference be to maintain consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cov
refers to an existing object; function cov end
creates a function with 0 methods. Typically the latter is only used for forward-declaring functions to applying docstrings. I think it's usually preferable to use the former when the function has already been defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general it doesn't really make a big difference whether you attach the docstring to a method or to the function, but it matters in some cases: the docstring for the function will (or should) always appear first, and when a link to the source is provided (in the online manual) it wouldn't be really correct to point to a particular method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay, to clarify, the preference moving forward is that we want the docstring attached to function <name> end
with all other methods below that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just keep the existing organization. Always better to avoid making changes unrelated to the PR, else it's very hard to review. For example, moving scattermat_zm
makes it very hard to see where its code changed.
My point was that the docstring should be attached to the method whose signature matches that shown in the first line. If that method doesn't exist, then attach the docstring to the function itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just keep the existing organization. Always better to avoid making changes unrelated to the PR, else it's very hard to review. For example, moving scattermat_zm makes it very hard to see where its code changed.
Fair point. I do tend to get carried away when I'm editing code.
src/cov.jl
Outdated
mean == nothing ? scattermat_zm(x .- Base.mean(x, wv, vardim), wv, vardim) : | ||
scattermat_zm(x .- mean, wv, vardim) | ||
end | ||
* AnalyticWeights: ``\\frac{1}{\sum w - \sum {w^2} / \sum{w}^2}`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formula is incorrect AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it should be \\frac{1}{\sum w - \sum {w^2} / \sum w}
, which wouldn't be a problem if w
was prenormalized, but still.
julia> cvar1(w) = sum(w) * (1 - sum(normalize(w, 1) .^ 2))
cvar1 (generic function with 1 method)
julia> cvar2(w) = sum(w) - (sum(w .^2) / sum(w))
cvar2 (generic function with 1 method)
julia> w = rand(10)
10-element Array{Float64,1}:
0.608948
0.980859
0.656496
0.994447
0.912615
0.654867
0.0928164
0.308836
0.782151
0.952136
julia> cvar1(w)
6.132433099237098
julia> cvar2(w)
6.132433099237098
Alright, apart from a couple remaining points about how to use @deprecate_binding and documenting groups of methods with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I have check other software with the data you used in the tests, and here are the (positive) results:
AnalyticalWeights
: Confirmed 0.0694434 with R'sHmisc::wtd.var
andnorm=TRUE
.FrequencyWeights
: Confirmed 0.054666 with R'sHmisc::wtd.var
andnorm=FALSE
, with Stata'ssummarize x [iweight w]
(sincefweight
does not accept non-integer weights), and with SAS'sproc means ... vardef=wdf
and with custom computation code.ProbabilityWeights
: confirmed 0.06628969 with R'ssurvey::svyvar
, with Stata'ssvy: mean x; estat sd
andmean x [aweight=w]
(which confusingly is known to give the correct estimation for population variance with... probability weights).
I still have many comments, but the core features are OK.
test/weights.jl
Outdated
@test wquantile(data[1], weights(w), 0.5) ≈ answer | ||
@test quantile(data[1], fweights(w), 0.5) ≈ answer | ||
@test wquantile(data[1], fweights(w), [0.5]) ≈ [answer] | ||
@test wquantile(data[1], fweights(w), 0.5) ≈ answer | ||
@test wquantile(data[1], w, [0.5]) ≈ [answer] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve alignment.
test/weights.jl
Outdated
@test isa(weights([1, 2, 3]), WeightVec{Int}) | ||
@test isa(weights([1., 2., 3.]), WeightVec{Float64}) | ||
@test isa(weights([1 2 3; 4 5 6]), WeightVec{Int}) | ||
@test isa(fweights([1, 2, 3]), AbstractWeights{Int}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the tests in this file should be put inside a loop and run for all types of weights (AbstractWeights
should be replaced with the specific type).
test/deprecates.jl
Outdated
|
||
@testset "StatsBase.Deprecates" begin | ||
|
||
@testset "Deprecates WeightVec and weights" begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we don't test deprecated features, we just test their replacement. So just remove this file since you test the new types with a similar code below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a good idea to test deprecated features as it ensures the deprecations don't break functionality. Adding these tests actually helped me catch a few issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, doing this will print lots of deprecation warnings, which will make the output hard to read, and deprecations from Base that need to be fixed won't be easy to spot. As long as you've tested it once locally, it's OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to keep the ENV["TEST_DEPRECATES"]
behaviour around to make testing locally easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not. Deprecated code is supposed to be removed relatively soon anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll add an item to my checklist to remove it before dropping the "[WIP]" from this PR, but please just ignore that until then.
src/weights.jl
Outdated
|
||
|
||
""" | ||
wquantile(v, w, p) | ||
|
||
Compute the `p`th quantile(s) of `v` with weights `w`, given as either a vector | ||
or a `WeightVec`. | ||
or a `AbstractWeights`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"An AbstractWeights
object/vector" (since vec is no longer in the name).
src/deprecates.jl
Outdated
|
||
Construct a `WeightVec` with weight values `vs` and sum of weights `wsum`. | ||
""" | ||
function WeightVec{S<:Real, V<:RealVector}(vs::V, s::S=sum(vs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All code for WeightVec
should be moved to deprecated.jl
. That way it's easy to remove in the next version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, this is in deprecates.jl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not sure how I missed that...
test/moments.jl
Outdated
@test moment(x, 4, 4.0) ≈ sum((x .- 4).^4) / length(x) | ||
@test moment(x, 5, 4.0) ≈ sum((x .- 4).^5) / length(x) | ||
|
||
w = fweights([1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test other weight types too?
test/deprecates.jl
Outdated
end | ||
end | ||
|
||
@testset "Covariance" begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should check corrected=true
too, and for all weights types. That will be easier if you just hardcode the expected values, or call scattermat
instead of copying the full formulas here.
test/runtests.jl
Outdated
@@ -1,5 +1,13 @@ | |||
using StatsBase | |||
|
|||
opts = Base.JLOptions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be needed either.
test/weights.jl
Outdated
@testset "eweights" begin | ||
λ = 0.2 | ||
wv = eweights(4, λ) | ||
@test round(values(wv), 4) == [0.2, 0.25, 0.3125, 0.3906] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy the full precision numbers rather than rounding. That could help spotting precision issues if somebody tweaks the formula.
Should also check the type of wv
.
test/weights.jl
Outdated
|
||
## the sum and mean syntax | ||
@testset "Sum" begin | ||
@test sum([1.0, 2.0, 3.0], fweights([1.0, 0.5, 0.5])) ≈ 3.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this inside a loop and test all weights types? Same below (and everywhere this applies).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll note that this causes tests to take significantly longer to run (particularly test/weights.jl
). Would it make sense to reduce the median and quantile testsets slightly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless some tests are really redundant, I'd rather have too many tests than too few of them.
src/weights.jl
Outdated
""" | ||
varcorrection(w::ProbabilityWeights, corrected=false) | ||
|
||
``\\frac{n}{(n - 1) \sum w}`` where `n = length(w)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're at it, we should use the more correct n = count(!iszero, w)
. The code and the other docstring will need to be adjusted. To test this, you could simply add an element with a zero weight to the data: the computed variance must remain the same.
As a future optimization, we could store the number of non-zero weights at construction, when the sum is computed, but that's not a priority for this PR.
Alright, I think that addresses all the comments from the last review iteration. Please ignore the |
… a brief description of different weight types.
…ize) or `f` where appropriate.
…tions and decided to only reference the `var`, `std` and `cov` docstrings.
…testset` changes (`cov` was kind of a lost cause).
…e in `var`, `std` and `cov` docstrings.
Yeah, that's painful, sorry about that. Unfortunately, the new .rst docstrings need some more adjustments: some single backticks need to be changed to double (around Julia expressions), lists should use |
Thanks! If you're still willing to work on weighting, it would be great to use the new types in GLM.jl, which interprets weights as frequency weights and should therefore only accept |
Awesome, thanks @nalimilan and @ararslan for putting so much work into reviewing this! Hopefully, my future PRs will be a bit smoother. |
@nalimilan I'll take a look at GLM.jl, but I might not have time to get that PR ready till later this week. |
Base.std(v::RealArray, w::AbstractWeights; mean=nothing, corrected::DepBool=nothing) = | ||
sqrt.(var(v, w; mean=mean, corrected=depcheck(:std, corrected))) | ||
|
||
Base.stdm(v::RealArray, m::RealArray, dim::Int; corrected::DepBool=nothing) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guess this signature was here before, but it's a bit of piracy, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's been there for a while, but I didn't notice when I updated it. We had a discussion on #248 about it and I think the plan was to move this method to base julia and add a version check, but I haven't gotten around to doing it yet.
Great pull request! Thanks everyone for doing it. Just to be sure, are we sure about the plural form Weights (used in R) rather than the singular form Weight (used in SAS, Stata, Python)? |
Yes, because a vector of weights contains more than one weight. 😉 |
This PR is intended to address concerns raised in issues #53 and #249
Requirements:
Weights
.