Skip to content

Weighted sem #754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 66 commits into from
Feb 6, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
f2f83cf
Add weighted `sem`
Jan 15, 2022
46be952
add weighted sem method
Jan 15, 2022
158fb35
Correct missing method
Jan 17, 2022
43c7419
bug fix
Jan 17, 2022
e770708
Bug fix
Jan 17, 2022
e5a03a8
Update src/scalarstats.jl
Jan 17, 2022
82e4de2
Update src/scalarstats.jl
Jan 17, 2022
5eba9d4
Update src/scalarstats.jl
Jan 17, 2022
2ef18eb
change μ to mean
Jan 18, 2022
1307200
Update src/scalarstats.jl
Jan 18, 2022
7aafe38
broadcast weights
Jan 18, 2022
aed6531
Update src/scalarstats.jl
Jan 19, 2022
e49430c
Work with arbitrary iterators, fuse loops
Jan 19, 2022
99d8505
Update src/scalarstats.jl
Jan 19, 2022
5cd2e4d
Update src/scalarstats.jl
Jan 19, 2022
43e354f
Apply review comments
Jan 19, 2022
44d0b0e
RealArray instead of Array{<:Number}
Jan 19, 2022
094c30c
Improve citation
Jan 22, 2022
a3c7e61
Apply code review suggestions
Jan 23, 2022
5d25c7a
whitespace
Jan 23, 2022
59f8b8e
Update src/scalarstats.jl
Jan 23, 2022
73cfb64
Update src/scalarstats.jl
Jan 23, 2022
af78ffa
Update src/scalarstats.jl
Jan 23, 2022
9264d0a
Update src/scalarstats.jl
Jan 23, 2022
50ee853
Update src/scalarstats.jl
Jan 23, 2022
c83beea
Update src/scalarstats.jl
Jan 23, 2022
4f25268
Update src/scalarstats.jl
Jan 23, 2022
9ef970a
Apply suggestions
Jan 23, 2022
1ea8ef6
formatting
Jan 23, 2022
3f2361f
Update src/scalarstats.jl
Jan 23, 2022
b882c49
Drop real requirement
Jan 24, 2022
18cf255
Update src/scalarstats.jl
Jan 24, 2022
8b556bc
Test mean keyword with FrequencyWeights
Jan 24, 2022
dcff0b1
Update src/scalarstats.jl
Jan 24, 2022
162631b
Update src/scalarstats.jl
Jan 24, 2022
d213c93
Update src/scalarstats.jl
Jan 24, 2022
a93a15d
Use original `sem` implementation
Jan 24, 2022
74d2c3c
use original SEM implementation
Jan 24, 2022
adaf472
original sem implementation
Jan 24, 2022
acf81ea
original sem implementation
Jan 24, 2022
303c514
test fix
Jan 24, 2022
5bd4a91
Dispatch on mean
Jan 27, 2022
8d904b5
remove internal method for weights
Jan 27, 2022
f762e37
Bug
Jan 27, 2022
529cd21
Apply code review
Jan 29, 2022
8afc950
Update src/scalarstats.jl
Jan 29, 2022
989a383
Fix tests
Jan 29, 2022
b469d6d
Return NaN for FrequencyWeights
Jan 29, 2022
e25c4b0
Type instability fix
Jan 30, 2022
d56a98f
Fix array version to return NaN instead of error
Jan 30, 2022
22c8cc0
Remove useless tests
Jan 30, 2022
6b90e77
Update src/scalarstats.jl
Jan 31, 2022
f3c9714
Update src/scalarstats.jl
nalimilan Feb 1, 2022
cfba9a1
Update test/scalarstats.jl
nalimilan Feb 1, 2022
d024475
Update src/scalarstats.jl
Feb 2, 2022
b64d44d
Add test
Feb 2, 2022
e31748a
Update src/scalarstats.jl
Feb 2, 2022
72d1e48
Fix type instability
Feb 2, 2022
6ceeac4
Bug fix
Feb 2, 2022
5779144
Test for type stability
Feb 2, 2022
a520d20
Fix type instability
Feb 3, 2022
7b40f1c
mean type fix
Feb 3, 2022
4683ec4
test for type inference
Feb 3, 2022
ab29aa7
Update src/scalarstats.jl
Feb 5, 2022
1f32003
Update src/scalarstats.jl
Feb 5, 2022
632e212
Update src/scalarstats.jl
Feb 5, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 90 additions & 25 deletions src/scalarstats.jl
Original file line number Diff line number Diff line change
Expand Up @@ -261,36 +261,101 @@ realXcY(x::Real, y::Real) = x*y
realXcY(x::Complex, y::Complex) = real(x)*real(y) + imag(x)*imag(y)

"""
sem(x)
sem(x; mean=nothing)
sem(x::AbstractArray[, weights::AbstractWeights]; mean=nothing)

Return the standard error of the mean of collection `x`,
i.e. `sqrt(var(x, corrected=true) / length(x))`.
Return the standard error of the mean for a collection `x`.
A pre-computed `mean` may be provided.

When not using weights, this is the (sample) standard deviation
divided by the sample size. If weights are used, the
variance of the sample mean is calculated as follows:

* `AnalyticWeights`: Not implemented.
* `FrequencyWeights`: ``\\frac{\\sum_{i=1}^n w_i (x_i - \\bar{x_i})^2}{(\\sum w_i) (\\sum w_i - 1)}``
* `ProbabilityWeights`: ``\\frac{n}{n-1} \\frac{\\sum_{i=1}^n w_i^2 (x_i - \\bar{x_i})^2}{\\left( \\sum w_i \\right)^2}``

The standard error is then the square root of the above quantities.

# References

Carl-Erik Särndal, Bengt Swensson, Jan Wretman (1992). Model Assisted Survey Sampling.
New York: Springer. pp. 51-53.
"""
function sem(x)
y = iterate(x)
if y === nothing
function sem(x; mean=nothing)
if isempty(x)
# Return the NaN of the type that we would get for a nonempty x
T = eltype(x)
# Return the NaN of the type that we would get, had this collection
# contained any elements (this is consistent with std)
return oftype(sqrt((abs2(zero(T)) + abs2(zero(T)))/2), NaN)
end
count = 1
value, state = y
y = iterate(x, state)
# Use Welford algorithm as seen in (among other places)
# Knuth's TAOCP, Vol 2, page 232, 3rd edition.
M = value / 1
S = real(zero(M))
while y !== nothing
_mean = mean === nothing ? zero(T) / 1 : mean
z = abs2(zero(T) - _mean)
return oftype((z + z) / 2, NaN)
elseif mean === nothing
n = 0
y = iterate(x)
value, state = y
# Use Welford algorithm as seen in (among other places)
# Knuth's TAOCP, Vol 2, page 232, 3rd edition.
_mean = value / 1
sse = real(zero(_mean))
while y !== nothing
value, state = y
y = iterate(x, state)
n += 1
new_mean = _mean + (value - _mean) / n
sse += realXcY(value - _mean, value - new_mean)
_mean = new_mean
end
else
n = 1
y = iterate(x)
value, state = y
y = iterate(x, state)
count += 1
new_M = M + (value - M) / count
S = S + realXcY(value - M, value - new_M)
M = new_M
sse = abs2(value - mean)
while (y = iterate(x, state)) !== nothing
value, state = y
n += 1
sse += abs2(value - mean)
end
end
variance = sse / (n - 1)
return sqrt(variance / n)
end

function sem(x::AbstractArray; mean=nothing)
if isempty(x)
# Return the NaN of the type that we would get for a nonempty x
T = eltype(x)
_mean = mean === nothing ? zero(T) / 1 : mean
z = abs2(zero(T) - _mean)
return oftype((z + z) / 2, NaN)
end
return sqrt(var(x; mean=mean, corrected=true) / length(x))
end

function sem(x::AbstractArray, weights::UnitWeights; mean=nothing)
if length(x) ≠ length(weights)
throw(DimensionMismatch("array and weights do not have the same length"))
end
return sem(x; mean=mean)
end


# Weighted methods for the above
sem(x::AbstractArray, weights::FrequencyWeights; mean=nothing) =
sqrt(var(x, weights; mean=mean, corrected=true) / sum(weights))

function sem(x::AbstractArray, weights::ProbabilityWeights; mean=nothing)
if isempty(x)
# Return the NaN of the type that we would get for a nonempty x
return var(x, weights; mean=mean, corrected=true) / 0
else
_mean = mean === nothing ? Statistics.mean(x, weights) : mean
# sum of squared errors = sse
sse = sum(Broadcast.instantiate(Broadcast.broadcasted(x, weights) do x_i, w
return abs2(w * (x_i - _mean))
end))
n = count(!iszero, weights)
return sqrt(sse * n / (n - 1)) / sum(weights)
end
var = S / (count - 1)
return sqrt(var/count)
end

# Median absolute deviation
Expand Down
34 changes: 30 additions & 4 deletions test/scalarstats.jl
Original file line number Diff line number Diff line change
Expand Up @@ -107,10 +107,36 @@ z2 = [8. 2. 3. 1.; 24. 10. -1. -1.; 20. 12. 1. -2.]
@test variation([1:5;]) ≈ 0.527046276694730
@test variation(skipmissing([missing; 1:5; missing])) ≈ 0.527046276694730

@test sem([1:5;]) ≈ 0.707106781186548
@test sem(skipmissing([missing; 1:5; missing])) ≈ 0.707106781186548
@test sem(Int[]) === NaN
@test sem(skipmissing(Union{Int,Missing}[missing, missing])) === NaN
@test @inferred(sem([1:5;])) ≈ 0.707106781186548
@test @inferred(sem(skipmissing([missing; 1:5; missing]))) ≈ 0.707106781186548
@test @inferred(sem(skipmissing([missing; 1:5; missing]), mean=3.0)) ≈ 0.707106781186548
@test @inferred(sem([1:5;], UnitWeights{Int}(5))) ≈ 0.707106781186548
@test @inferred(sem([1:5;], UnitWeights{Int}(5); mean=mean(1:5))) ≈ 0.707106781186548
@test_throws DimensionMismatch sem(1:5, UnitWeights{Int}(4))
@test @inferred(sem([1:5;], ProbabilityWeights([1:5;]))) ≈ 0.6166 rtol=.001
μ = mean(1:5, ProbabilityWeights([1:5;]))
@test @inferred(sem([1:5;], ProbabilityWeights([1:5;]); mean=μ)) ≈ 0.6166 rtol=.001
@test @inferred(sem([10; 1:5;], ProbabilityWeights([0; 1:5;]); mean=μ)) ≈ 0.6166 rtol=.001
x = sort!(vcat([5:-1:i for i in 1:5]...))
μ = mean(x)
@test @inferred(sem([1:5;], FrequencyWeights([1:5;]))) ≈ sem(x)
@test @inferred(sem([1:5;], FrequencyWeights([1:5;]); mean=μ)) ≈ sem(x)

@inferred sem([1:5f0;]; mean=μ) ≈ sem(x)
@inferred sem([1:5f0;], ProbabilityWeights([1:5;]); mean=μ)
@inferred sem([1:5f0;], FrequencyWeights([1:5;]); mean=μ)
# Broken: Bug to do with Statistics.jl's implementation of `var`
# @inferred sem([1:5f0;], UnitWeights{Int}(5); mean=μ)

@test @inferred(isnan(sem(Int[])))
@test @inferred(isnan(sem(Int[], FrequencyWeights(Int[]))))
@test @inferred(isnan(sem(Int[], ProbabilityWeights(Int[]))))

@test @inferred(isnan(sem(Int[]; mean=0f0)))
@test @inferred(isnan(sem(Int[], FrequencyWeights(Int[]); mean=0f0)))
@test @inferred(isnan(sem(Int[], ProbabilityWeights(Int[]); mean=0f0)))

@test @inferred(isnan(sem(skipmissing(Union{Int,Missing}[missing, missing]))))
@test_throws MethodError sem(Any[])
@test_throws MethodError sem(skipmissing([missing]))

Expand Down