Percentile/quantile estimation types

Helps https://github.com/Kotlin/dataframe/issues/961
Follow up for https://github.com/Kotlin/dataframe/issues/543 and https://github.com/Kotlin/dataframe/pull/1060

Our current percentile implementation is based on the [Nearest Rank](https://en.wikipedia.org/wiki/Percentile#The_nearest-rank_method) with rounding down for non-numbers and a form of [linear interpolation](https://en.wikipedia.org/wiki/Percentile#The_linear_interpolation_between_closest_ranks_method) for number types.
However, this implementation makes some bold assumptions that can influence the expected outcome, especially for smaller data sizes.

Small background into the topic:
If I mention [quantiles](https://en.wikipedia.org/wiki/Quantile); they are the generic form of cutting up a range into continuous intervals with equal probabilities (sub-range size). A [percentile](https://en.wikipedia.org/wiki/Percentile) is thus a 100-quantile; a subdivision of a range into a hundred pieces of equal size. A [quartile](https://en.wikipedia.org/wiki/Quantile#Examples) is like a 4-quantile. A [median](https://en.wikipedia.org/wiki/Median) can thus be seen as 2-quantile or the 2nd quartile, or 50th percentile, etc.

There are, however, [9 commonly used algorithms](https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample) to calculate the i-th percentile/quartile/etc. They were collected by [Hyndman, Rob & Fan, Yanan. (1996). Sample Quantiles in Statistical Packages. The American Statistician. 50. 361-365. 10.1080/00031305.1996.10473566.](https://www.researchgate.net/publication/222105754_Sample_Quantiles_in_Statistical_Packages) and used by almost all libraries/tools that can calculate quantiles, like [R](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html), [Numpy](https://numpy.org/doc/2.1/reference/generated/numpy.quantile.html#re01cd3f3acfe-1), [SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.mquantiles.html), [Apache commons-math (legacy)](https://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/descriptive/rank/Percentile.EstimationType.html), [Apache commons-statistics](https://commons.apache.org/proper/commons-statistics/commons-statistics-docs/apidocs/org/apache/commons/statistics/descriptive/Quantile.EstimationMethod.html), [Wolfram Mathematica](https://reference.wolfram.com/language/ref/Quantile.html#DetailsAndOptions), [Matlab](https://github.com/IoSR-Surrey/MatlabToolbox/blob/master/%2Biosr/%2Bstatistics/quantile.m)


Adapted from [Wikipedia](https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample):

All 9 methods compute Qp, the (estimate for) the p-quantile. 
(When talking about the k-th q-quantile, you get "p = k/q". So for the median, this can be p = 50 / 100 = 2, aka, the 2-quantile)

This is computed from a sample of size N by computing a real valued index h. When h is a whole number, the h-th smallest of the N values, xh, is the quantile estimate. Otherwise a rounding or interpolation scheme is used to compute the quantile estimate from h, x⌊h⌋, and x⌈h⌉, so by rounding down or up from h, respectively.

| Type | h | Qp | Notes |
|------|---|---------------|-------|
| R‑1, SAS‑3, Maple‑1 | Np | x⌈h⌉ | Inverse of empirical distribution function. |
| R‑2, SAS‑5, Maple‑2, Stata | Np + 1/2 | (x⌈h – 1/2⌉ + x⌊h + 1/2⌋) / 2 | The same as R-1, but with averaging at discontinuities. |
| R‑3, SAS‑2 | Np − 1/2 | x⌊h⌉ | The observation numbered closest to Np. Here, ⌊h⌉ indicates rounding to the nearest integer, choosing the even integer in the case of a tie. |
| R‑4, SAS‑1, SciPy‑(0,1), Julia‑(0,1), Maple‑3 | Np | x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) | Linear interpolation of the inverse of the empirical distribution function. |
| R‑5, SciPy‑(1/2,1/2), Julia‑(1/2,1/2), Maple‑4 | Np + 1/2 | x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) | Piecewise linear function where the knots are the values midway through the steps of the empirical distribution function. |
| R‑6, Excel, Python, SAS‑4, SciPy‑(0,0), Julia-(0,0), Maple‑5, Stata‑altdef | (N + 1)p | x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) | Linear interpolation of the expectations for the order statistics for the uniform distribution on [0,1]. That is, it is the linear interpolation between points (ph, xh), where ph = h/(N+1) is the probability that the last of (N+1) randomly drawn values will not exceed the h-th smallest of the first N randomly drawn values. |
| R‑7, Excel, Python, SciPy‑(1,1), Julia-(1,1), Maple‑6, NumPy | (N − 1)p + 1 | x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) | Linear interpolation of the modes for the order statistics for the uniform distribution on [0,1]. |
| R‑8, SciPy‑(1/3,1/3), Julia‑(1/3,1/3), Maple‑7 | (N + 1/3)p + 1/3 | x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) | Linear interpolation of the approximate medians for order statistics. |
| R‑9, SciPy‑(3/8,3/8), Julia‑(3/8,3/8), Maple‑8 | (N + 1/4)p + 3/8 | x⌊h⌋ + (h − ⌊h⌋) (x⌈h⌉ − x⌊h⌋) | The resulting quantile estimates are approximately unbiased for the expected order statistics if x is normally distributed. |

- R‑1 through R‑3 are piecewise constant, with discontinuities. (useful for self-comparables)
- R‑4 and following are piecewise linear, without discontinuities, but differ in how h is computed. (useful for primitive numbers)
- R‑3 and R‑4 are not symmetric in that they do not give h = (N + 1) / 2 when p = 1/2.

It might also make sense to create a `quantile` function as main entry-point and let percentile, median, quartile, (decile?), call into it with the right value for q in q-quantile.

Now, of course it might take some time to fully implement this. So, until we fully implement this, we should at least be ready for it and take into account that there are multiple options. This could be done by, say, sticking to R-3 and R-7 dependent on the data type and later offer users more choice. These choices should be mentioned in the functions `median` and `percentile`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Percentile/quantile estimation types #1121

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Type	h	Q_p	Notes
R‑1, SAS‑3, Maple‑1	Np	x_⌈h⌉	Inverse of empirical distribution function.
R‑2, SAS‑5, Maple‑2, Stata	Np + 1/2	(x_{⌈h – 1/2⌉} + x_{⌊h + 1/2⌋}) / 2	The same as R-1, but with averaging at discontinuities.
R‑3, SAS‑2	Np − 1/2	x_⌊h⌉	The observation numbered closest to Np. Here, ⌊h⌉ indicates rounding to the nearest integer, choosing the even integer in the case of a tie.
R‑4, SAS‑1, SciPy‑(0,1), Julia‑(0,1), Maple‑3	Np	x_⌊h⌋ + (h − ⌊h⌋) (x_⌈h⌉ − x_⌊h⌋)	Linear interpolation of the inverse of the empirical distribution function.
R‑5, SciPy‑(1/2,1/2), Julia‑(1/2,1/2), Maple‑4	Np + 1/2	x_⌊h⌋ + (h − ⌊h⌋) (x_⌈h⌉ − x_⌊h⌋)	Piecewise linear function where the knots are the values midway through the steps of the empirical distribution function.
R‑6, Excel, Python, SAS‑4, SciPy‑(0,0), Julia-(0,0), Maple‑5, Stata‑altdef	(N + 1)p	x_⌊h⌋ + (h − ⌊h⌋) (x_⌈h⌉ − x_⌊h⌋)	Linear interpolation of the expectations for the order statistics for the uniform distribution on [0,1]. That is, it is the linear interpolation between points (p_h, x_h), where p_h = h/(N+1) is the probability that the last of (N+1) randomly drawn values will not exceed the h-th smallest of the first N randomly drawn values.
R‑7, Excel, Python, SciPy‑(1,1), Julia-(1,1), Maple‑6, NumPy	(N − 1)p + 1	x_⌊h⌋ + (h − ⌊h⌋) (x_⌈h⌉ − x_⌊h⌋)	Linear interpolation of the modes for the order statistics for the uniform distribution on [0,1].
R‑8, SciPy‑(1/3,1/3), Julia‑(1/3,1/3), Maple‑7	(N + 1/3)p + 1/3	x_⌊h⌋ + (h − ⌊h⌋) (x_⌈h⌉ − x_⌊h⌋)	Linear interpolation of the approximate medians for order statistics.
R‑9, SciPy‑(3/8,3/8), Julia‑(3/8,3/8), Maple‑8	(N + 1/4)p + 3/8	x_⌊h⌋ + (h − ⌊h⌋) (x_⌈h⌉ − x_⌊h⌋)	The resulting quantile estimates are approximately unbiased for the expected order statistics if x is normally distributed.

Percentile/quantile estimation types #1121

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions