Description
Helps #961
Follow up for #543 and #1060
Our current percentile implementation is based on the Nearest Rank with rounding down for non-numbers and a form of linear interpolation for number types.
However, this implementation makes some bold assumptions that can influence the expected outcome, especially for smaller data sizes.
Small background into the topic:
If I mention quantiles; they are the generic form of cutting up a range into continuous intervals with equal probabilities (sub-range size). A percentile is thus a 100-quantile; a subdivision of a range into a hundred pieces of equal size. A quartile is like a 4-quantile. A median can thus be seen as 2-quantile or the 2nd quartile, or 50th percentile, etc.
There are, however, 9 commonly used algorithms to calculate the i-th percentile/quartile/etc. They were collected by Hyndman, Rob & Fan, Yanan. (1996). Sample Quantiles in Statistical Packages. The American Statistician. 50. 361-365. 10.1080/00031305.1996.10473566. and used by almost all libraries/tools that can calculate quantiles, like R, Numpy, SciPy, Apache commons-math (legacy), Apache commons-statistics, Wolfram Mathematica, Matlab
Adapted from Wikipedia:
All 9 methods compute Qp, the (estimate for) the p-quantile.
(When talking about the k-th q-quantile, you get "p = k/q". So for the median, this can be p = 50 / 100 = 2, aka, the 2-quantile)
This is computed from a sample of size N by computing a real valued index h. When h is a whole number, the h-th smallest of the N values, xh, is the quantile estimate. Otherwise a rounding or interpolation scheme is used to compute the quantile estimate from h, x⌊h⌋, and x⌈h⌉, so by rounding down or up from h, respectively.
Type | h | Qp | Notes |
---|---|---|---|
R‑1, SAS‑3, Maple‑1 | Np | x⌈h⌉ | Inverse of empirical distribution function. |
R‑2, SAS‑5, Maple‑2, Stata | Np + 1/2 | (x⌈h – 1/2⌉ + x⌊h + 1/2⌋) / 2 | The same as R-1, but with averaging at discontinuities. |
R‑3, SAS‑2 | Np − 1/2 | x⌊h⌉ | The observation numbered closest to Np. Here, ⌊h⌉ indicates rounding to the nearest integer, choosing the even integer in the case of a tie. |
R‑4, SAS‑1, SciPy‑(0,1), Julia‑(0,1), Maple‑3 | Np | x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) | Linear interpolation of the inverse of the empirical distribution function. |
R‑5, SciPy‑(1/2,1/2), Julia‑(1/2,1/2), Maple‑4 | Np + 1/2 | x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) | Piecewise linear function where the knots are the values midway through the steps of the empirical distribution function. |
R‑6, Excel, Python, SAS‑4, SciPy‑(0,0), Julia-(0,0), Maple‑5, Stata‑altdef | (N + 1)p | x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) | Linear interpolation of the expectations for the order statistics for the uniform distribution on [0,1]. That is, it is the linear interpolation between points (ph, xh), where ph = h/(N+1) is the probability that the last of (N+1) randomly drawn values will not exceed the h-th smallest of the first N randomly drawn values. |
R‑7, Excel, Python, SciPy‑(1,1), Julia-(1,1), Maple‑6, NumPy | (N − 1)p + 1 | x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) | Linear interpolation of the modes for the order statistics for the uniform distribution on [0,1]. |
R‑8, SciPy‑(1/3,1/3), Julia‑(1/3,1/3), Maple‑7 | (N + 1/3)p + 1/3 | x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) | Linear interpolation of the approximate medians for order statistics. |
R‑9, SciPy‑(3/8,3/8), Julia‑(3/8,3/8), Maple‑8 | (N + 1/4)p + 3/8 | x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) | The resulting quantile estimates are approximately unbiased for the expected order statistics if x is normally distributed. |
- R‑1 through R‑3 are piecewise constant, with discontinuities. (useful for self-comparables)
- R‑4 and following are piecewise linear, without discontinuities, but differ in how h is computed. (useful for primitive numbers)
- R‑3 and R‑4 are not symmetric in that they do not give h = (N + 1) / 2 when p = 1/2.
It might also make sense to create a quantile
function as main entry-point and let percentile, median, quartile, (decile?), call into it with the right value for q in q-quantile.
Now, of course it might take some time to fully implement this. So, until we fully implement this, we should at least be ready for it and take into account that there are multiple options. This could be done by, say, sticking to R-3 and R-7 dependent on the data type and later offer users more choice. These choices should be mentioned in the functions median
and percentile
.