Skip to content

Commit 6a78bce

Browse files
rofinnnalimilan
authored andcommitted
Add support for different weight vector types (#250)
Introduce four different weight types with different bias correction factors for variance, covariance and standard deviation.
1 parent ba476dc commit 6a78bce

21 files changed

+1289
-791
lines changed

docs/source/counts.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Counting over an Integer Range
88

99
.. function:: counts(x, a:b[, wv])
1010

11-
Count the number of times (or total weights if a weight vector ``wv`` is given) values in ``a:b`` appear in array ``x``. Here, the optional argument ``wv`` should be a weight vector of type ``WeightVec`` (see :ref:`weightvec`).
11+
Count the number of times (or total weights if a weight vector ``wv`` is given) values in ``a:b`` appear in array ``x``. Here, the optional argument ``wv`` should be a weight vector of type ``AbstractWeights`` (see :ref:`weightvec`).
1212

1313
This function returns a vector ``r`` of length ``n``, with ``n = length(a:b) = b-a+1``. In particular, we have
1414

docs/source/cov.rst

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,21 @@ This package implements functions for computing scatter matrix, as well as weigh
2222

2323
.. function:: scatter(X, wv[; vardim=..., mean=...])
2424

25-
Weighted scatter matrix. The weights are given by a weight vector ``wv`` of type ``WeightVec`` (see :ref:`weightvec`).
25+
Weighted scatter matrix. The weights are given by a weight vector ``wv`` of type ``AbstractWeights`` (see :ref:`weightvec`).
2626

27-
.. function:: cov(X, wv[; vardim=..., mean=...])
27+
.. function:: cov(X, w[; vardim=..., mean=..., corrected=...])
2828

29-
Weighted covariance matrix.
29+
Compute the weighted covariance matrix. Similar to ``var`` and ``std`` the biased covariance matrix (``corrected=false``) is computed by multiplying ``scattermat(X, w)`` by :math:`\frac{1}{\sum{w}}` to normalize.
30+
However, the unbiased covariance matrix (``corrected=true``) is dependent on the type of weights used:
3031

31-
**Note:** By default, the covariance is normalized by the sum of weights, that is, ``cov(X, wv)`` is equal to ``scatter(X, wv) / sum(wv)``.
32+
* ``AnalyticWeights``: :math:`\frac{1}{\sum w - \sum {w^2} / \sum w}`
33+
* ``FrequencyWeights``: :math:`\frac{1}{\sum{w} - 1}`
34+
* ``ProbabilityWeights``: :math:`\frac{n}{(n - 1) \sum w}` where ``n`` equals ``count(!iszero, w)``
35+
* ``Weights``: ``ArgumentError`` (bias correction not supported)
3236

33-
.. function:: mean_and_cov(x[, wv][; vardim=...])
37+
.. function:: mean_and_cov(x[, wv][; vardim=..., corrected=...])
3438

35-
Jointly compute the mean and covariance of ``x``.
39+
Jointly compute the mean and covariance matrix as a tuple.
40+
A weighting vector ``wv`` can be specified. ``vardim`` that designates whether the variables are columns in the matrix (``1``) or rows (``2``).
41+
Finally, bias correction is applied to the covariance calculation if ``corrected=true``.
42+
See ``cov`` documentation for more details.

docs/source/empirical.rst

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,12 @@ Histograms can be fitted to data using the ``fit`` method.
1414

1515
**Arguments:**
1616

17-
``data``
17+
``data``
1818
is either a vector (for a 1-dimensional histogram), or a tuple of
1919
vectors of equal length (for an *n*-dimensional histogram).
2020

2121
``weight``
22-
is an optional ``:ref:`weightvec` WeightVec``` (of the same length as the
22+
is an optional ``:ref:`weightvec` AbstractWeights``` (of the same length as the
2323
data vectors), denoting the weight each observation contributes to the
2424
bin. If no weight vector is supples, each observation has weight 1.
2525

@@ -30,7 +30,7 @@ Histograms can be fitted to data using the ``fit`` method.
3030

3131
**Keyword arguments:**
3232

33-
``closed=:left/:right``
33+
``closed=:left/:right``
3434
determines whether the bin intervals are left-closed [a,b), or right-closed
3535
(a,b] (default = ``:right``).
3636

@@ -48,7 +48,7 @@ Histograms can be fitted to data using the ``fit`` method.
4848
h = fit(Histogram, rand(100), weights(rand(100)), 0:0.1:1.0)
4949
h = fit(Histogram, [20], 0:20:100)
5050
h = fit(Histogram, [20], 0:20:100, closed=:left)
51-
51+
5252
# Multivariate
5353
h = fit(Histogram, (rand(100),rand(100)))
5454
h = fit(Histogram, (rand(100),rand(100)),nbins=10)
@@ -60,7 +60,6 @@ Empirical Cumulative Distribution Function
6060

6161
.. function:: ecdf(x)
6262

63-
Return an empirical cumulative distribution function based on a vector of samples given in ``x``.
63+
Return an empirical cumulative distribution function based on a vector of samples given in ``x``.
6464

6565
**Note:** this is a higher-level function that returns a function, which can then be applied to evaluate CDF values on other samples.
66-

docs/source/means.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ The package provides functions to compute means of different kinds.
3232
3333
.. function:: mean(x, w)
3434

35-
The ``mean`` function is also extended to accept a weight vector of type ``WeightVec`` (see :ref:`weightvec`) to compute weighted mean.
35+
The ``mean`` function is also extended to accept a weight vector of type ``AbstractWeights`` (see :ref:`weightvec`) to compute weighted mean.
3636

3737
**Examples:**
3838

@@ -43,7 +43,7 @@ The package provides functions to compute means of different kinds.
4343
4444
.. function:: mean(x, w, dim)
4545

46-
Compute weighted means of ``x`` along a certain dimension (specified by an integer ``dim``). The weights are given by a weight vector ``w`` (of type ``WeightVec``).
46+
Compute weighted means of ``x`` along a certain dimension (specified by an integer ``dim``). The weights are given by a weight vector ``w`` (of type ``AbstractWeights``).
4747

4848
.. function:: mean!(dst, x, w, dim)
4949

docs/source/sampling.rst

Lines changed: 13 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,11 @@ The package provides functions for sampling from a given population (with or wit
99
.. function:: sample([rng], a)
1010

1111
Randomly draw an element from an array ``a``.
12-
1312
Optionally specify a random number generator ``rng`` as the first argument (defaults to ``Base.GLOBAL_RNG``).
1413

15-
.. function:: sample([rng], a, n[; replace=true, ordered=false])
14+
.. function:: sample([rng], a, n[; replace=true, ordered=false])
1615

17-
Randomly draw ``n`` elements from ``a``.
16+
Randomly draw ``n`` elements from ``a``.
1817

1918
Optionally specify a random number generator ``rng`` as the first argument (defaults to ``Base.GLOBAL_RNG``).
2019

@@ -26,14 +25,13 @@ The package provides functions for sampling from a given population (with or wit
2625
.. function:: sample!([rng], a, x[; replace=true, ordered=false])
2726

2827
Draw ``length(x)`` elements from ``a`` and write them to a pre-allocated array ``x``.
29-
3028
Optionally specify a random number generator ``rng`` as the first argument (defaults to ``Base.GLOBAL_RNG``).
3129

32-
.. function:: sample([rng], wv)
30+
.. function:: sample([rng], wv)
3331

34-
Draw an integer in ``1:length(wv)`` with probabilities proportional to the weights given in ``wv``.
32+
Draw an integer in ``1:length(wv)`` with probabilities proportional to the weights given in ``wv``.
3533

36-
Here, ``wv`` should be a weight vector of type ``WeightVec`` (see :ref:`weightvec`).
34+
Here, ``wv`` should be a weight vector of type ``AbstractWeights`` (see :ref:`weightvec`).
3735

3836
Optionally specify a random number generator ``rng`` as the first argument (defaults to ``Base.GLOBAL_RNG``).
3937

@@ -52,7 +50,7 @@ The package provides functions for sampling from a given population (with or wit
5250
**Keyword arguments**
5351

5452
- ``replace``: indicates whether to have replacement (default = ``true``).
55-
- ``ordered``: indicates whether to arrange the samples in ascending order (default = ``false``).
53+
- ``ordered``: indicates whether to arrange the samples in ascending order (default = ``false``).
5654

5755
.. function:: sample!([rng], a, wv, x[; replace=true, ordered=false])
5856

@@ -74,7 +72,7 @@ Here are a list of algorithms implemented in the package. The functions below ar
7472

7573
- ``a``: source array representing the population
7674
- ``x``: the destination array
77-
- ``wv``: the weight vector (of type ``WeightVec``), for weighted sampling
75+
- ``wv``: the weight vector (of type ``AbstractWeights``), for weighted sampling
7876
- ``n``: the length of ``a``
7977
- ``k``: the length of ``x``. For sampling without replacement, ``k`` must not exceed ``n``.
8078
- ``rng``: optional random number generator (defaults to ``Base.GLOBAL_RNG``)
@@ -108,7 +106,7 @@ All following functions write results to ``x`` (pre-allocated) and return ``x``.
108106

109107
.. function:: fisher_yates_sample!([rng], a, x)
110108

111-
*Fisher-Yates shuffling* (with early termination).
109+
*Fisher-Yates shuffling* (with early termination).
112110

113111
Pseudo-code ::
114112

@@ -118,15 +116,15 @@ All following functions write results to ``x`` (pre-allocated) and return ``x``.
118116
swap inds[i] with a random one in inds[i:n]
119117
set x[i] = a[inds[i]]
120118
end
121-
119+
122120

123121
This algorithm consumes ``k`` random numbers. It uses an integer array of length ``n`` internally to maintain the shuffled indices. It is considerably faster than Knuth's algorithm especially when ``n`` is greater than ``k``.
124122

125123
.. function:: self_avoid_sample!([rng], a, x)
126124

127-
Use a set to maintain the index that has been sampled. Each time draw a new index, if the index has already been sampled, redraw until it draws an unsampled one.
125+
Use a set to maintain the index that has been sampled. Each time draw a new index, if the index has already been sampled, redraw until it draws an unsampled one.
128126

129-
This algorithm consumes about (or slightly more than) ``k`` random numbers, and requires ``O(k)`` memory to store the set of sampled indices. Very fast when ``n >> k``.
127+
This algorithm consumes about (or slightly more than) ``k`` random numbers, and requires ``O(k)`` memory to store the set of sampled indices. Very fast when ``n >> k``.
130128

131129
However, if ``k`` is large and approaches ``n``, the rejection rate would increase drastically, resulting in poorer performance.
132130

@@ -153,7 +151,7 @@ All following functions write results to ``x`` (pre-allocated) and return ``x``.
153151

154152
*Direct sampling.*
155153

156-
Draw each sample by scanning the weight vector.
154+
Draw each sample by scanning the weight vector.
157155

158156
This algorithm: (1) consumes ``k`` random numbers; (2) has time complexity ``O(n k)``, as scanning the weight vector each time takes ``O(n)``; and (3) requires no additional memory space.
159157

@@ -173,5 +171,4 @@ All following functions write results to ``x`` (pre-allocated) and return ``x``.
173171

174172
It makes a copy of the weight vector at initialization, and sets the weight to zero when the corresponding sample is picked.
175173

176-
This algorithm consumes ``O(k)`` random numbers, and has overall time complexity ``O(n k)``.
177-
174+
This algorithm consumes ``O(k)`` random numbers, and has overall time complexity ``O(n k)``.

docs/source/scalarstats.rst

Lines changed: 41 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -6,54 +6,73 @@ The package implements functions for computing various statistics over an array
66
Moments
77
---------
88

9-
.. function:: var(x, wv[; mean=...])
9+
.. function:: var(x, w, [dim][; mean=..., corrected=...])
1010

11-
Compute weighted variance.
11+
Compute the variance of a real-valued array ``x``, optionally over a dimension ``dim``.
12+
Observations in ``x`` are weighted using weight vector ``w``.
13+
The uncorrected (when ``corrected=false``) sample variance is defined as:
1214

13-
One can set the keyword argument ``mean``, which can be either ``nothing`` (to compute the mean value within the function), ``0``, or a pre-computed mean value.
15+
:math:`\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - m}\right)^2 }`
1416

15-
**Note:** the result is normalized by ``sum(wv)`` without correction.
17+
where ``n`` is the length of the input and ``m`` is the mean.
18+
The unbiased estimate (when ``corrected=true``) of the population variance is computed by
19+
replacing :math:`\frac{1}{\sum{w}}` with a factor dependent on the type of weights used:
1620

17-
.. function:: var(x, wv, dim[; mean=...])
21+
* ``AnalyticWeights``: :math:`\frac{1}{\sum w - \sum {w^2} / \sum w}`
22+
* ``FrequencyWeights``: :math:`\frac{1}{\sum{w} - 1}`
23+
* ``ProbabilityWeights``: :math:`\frac{n}{(n - 1) \sum w}` where ``n`` equals ``count(!iszero, w)``
24+
* ``Weights``: ``ArgumentError`` (bias correction not supported)
1825

19-
Weighted variance along a specific dimension.
26+
.. function:: std(v, w, [dim][; mean=..., corrected=...])
2027

21-
.. function:: std(x, wv[; mean=...])
28+
Compute the standard deviation of a real-valued array ``x``, optionally over a dimension ``dim``.
29+
Observations in ``x`` are weighted using weight vector ``w``.
30+
The uncorrected (when ``corrected=false``) sample standard deviation is defined as:
2231

23-
Compute weighted standard deviation.
32+
:math:`\sqrt{\frac{1}{\sum{w}} \sum_{i=1}^n {w_i\left({x_i - m}\right)^2 }}`
2433

25-
One can set the keyword argument ``mean``, which can be either ``nothing`` (to compute the mean value within the function), ``0``, or a pre-computed mean value.
34+
where ``n`` is the length of the input and ``m`` is the mean.
35+
The unbiased estimate (when ``corrected=true``) of the population standard deviation is
36+
computed by replacing :math:`\frac{1}{\sum{w}}` with a factor dependent on the type of
37+
weights used:
2638

27-
.. function:: std(x, wv, dim[; mean=...])
39+
* ``AnalyticWeights``: :math:`\frac{1}{\sum w - \sum {w^2} / \sum w}`
40+
* ``FrequencyWeights``: :math:`\frac{1}{\sum{w} - 1}`
41+
* ``ProbabilityWeights``: :math:`\frac{n}{(n - 1) \sum w}` where ``n`` equals ``count(!iszero, w)``
42+
* ``Weights``: ``ArgumentError`` (bias correction not supported)
2843

29-
Weighted standard deviation along a specific dimension.
44+
.. function:: mean_and_var(x[, w][, dim][; corrected=...])
3045

31-
.. function:: mean_and_var(x[, wv][, dim])
46+
Jointly compute the mean and variance of a real-valued array ``x``, optionally over a dimension ``dim``, as a tuple.
47+
Observations in ``x`` can be weighted using weight vector ``w``.
48+
Finally, bias correction is be applied to the variance calculation if ``corrected=true``.
49+
See ``var`` documentation for more details.
3250

33-
Jointly compute the mean and variance of ``x``.
51+
.. function:: mean_and_std(x[, w][, dim][; corrected=...])
3452

35-
.. function:: mean_and_std(x[, wv][, dim])
36-
37-
Jointly compute the mean and standard deviation of ``x``.
53+
Jointly compute the mean and standard deviation of a real-valued array ``x``, optionally over a dimension ``dim``, as a tuple.
54+
A weighting vector ``w`` can be specified to weight the estimates.
55+
Finally, bias correction is applied to the standard deviation calculation if ``corrected=true``.
56+
See ``std`` documentation for more details.
3857

3958
.. function:: skewness(x[, wv])
4059

4160
Compute the (standardized) `skewness <http://en.wikipedia.org/wiki/Skewness>`_ of ``x``.
4261

43-
One can optionally supply a weight vector of type ``WeightVec`` (see :ref:`weightvec`).
62+
One can optionally supply a weight vector of type ``AbstractWeights`` (see :ref:`weightvec`).
4463

4564
.. function:: kurtosis(x[, wv])
4665

4766
Compute the (excessive) `kurtosis <http://en.wikipedia.org/wiki/Kurtosis>`_ of ``x``.
4867

49-
One can optionally supply a weight vector of type ``WeightVec`` (see :ref:`weightvec`).
68+
One can optionally supply a weight vector of type ``AbstractWeights`` (see :ref:`weightvec`).
5069

5170
.. function:: moment(x, k[, m][, wv])
5271

5372
Compute the ``k``-th order central moment of the values in `x`. It is the sample mean of
5473
``(x - mean(x)).^k``.
5574

56-
One can optionally supply the center ``m``, and/or a weight vector of type ``WeightVec`` (see :ref:`weightvec`).
75+
One can optionally supply the center ``m``, and/or a weight vector of type ``AbstractWeights`` (see :ref:`weightvec`).
5776

5877

5978
Measurements of Variation
@@ -160,7 +179,7 @@ Quantile and Friends
160179

161180
.. function:: median(x, w)
162181

163-
Compute the weighted median of ``x``, using weights given by a weight vector ``w`` (of type ``WeightVec``). The weight and data vectors must have the same length. The weighted median :math:`x_k` is the element of ``x`` that satisfies :math:`\sum_{x_i < x_k} w_i \le \frac{1}{2} \sum_{j} w_j` and :math:`\sum_{x_i > x_k} w_i \le \frac{1}{2} \sum_{j} w_j`. If a weight has value zero, then its associated data point is ignored. If none of the weights are positive, an error is thrown. ``NaN`` is returned if ``x`` contains any ``NaN`` values. An error is raised if ``w`` contains any ``NaN`` values.
182+
Compute the weighted median of ``x``, using weights given by a weight vector ``w`` (of type ``AbstractWeights``). The weight and data vectors must have the same length. The weighted median :math:`x_k` is the element of ``x`` that satisfies :math:`\sum_{x_i < x_k} w_i \le \frac{1}{2} \sum_{j} w_j` and :math:`\sum_{x_i > x_k} w_i \le \frac{1}{2} \sum_{j} w_j`. If a weight has value zero, then its associated data point is ignored. If none of the weights are positive, an error is thrown. ``NaN`` is returned if ``x`` contains any ``NaN`` values. An error is raised if ``w`` contains any ``NaN`` values.
164183

165184
**Examples:**
166185

@@ -171,8 +190,8 @@ Quantile and Friends
171190
172191
.. function:: quantile(x, w, p)
173192

174-
Compute the weighted quantiles of a vector ``x`` at a specified set of probability values ``p``, using weights given by a weight vector ``w`` (of type ``WeightVec``). Weights must not be negative. The weights and data vectors must have the same length. The quantile for :math:`p` is defined as follows. Denoting :math:`S_k = (k-1)w_k + (n-1) \sum_{i<k}w_i`, define :math:`x_{k+1}` the smallest element of ``x`` such that :math:`S_{k+1}/S_{n}` is strictly superior to :math:`p`. The function returns :math:`(1-\gamma) x_k + \gamma x_{k+1}` with :math:`\gamma = (pS_n- S_k)/(S_{k+1}-S_k)`. This corresponds to R-7, Excel, SciPy-(1,1), Maple-6 when ``w`` is one (see https://en.wikipedia.org/wiki/Quantile).
175-
193+
Compute the weighted quantiles of a vector ``x`` at a specified set of probability values ``p``, using weights given by a weight vector ``w`` (of type ``AbstractWeights``). Weights must not be negative. The weights and data vectors must have the same length. The quantile for :math:`p` is defined as follows. Denoting :math:`S_k = (k-1)w_k + (n-1) \sum_{i<k}w_i`, define :math:`x_{k+1}` the smallest element of ``x`` such that :math:`S_{k+1}/S_{n}` is strictly superior to :math:`p`. The function returns :math:`(1-\gamma) x_k + \gamma x_{k+1}` with :math:`\gamma = (pS_n- S_k)/(S_{k+1}-S_k)`. This corresponds to R-7, Excel, SciPy-(1,1), Maple-6 when ``w`` is one (see https://en.wikipedia.org/wiki/Quantile).
194+
176195
Mode and Modes
177196
---------------
178197

0 commit comments

Comments
 (0)