diff --git a/parquet/src/bloom_filter/mod.rs b/parquet/src/bloom_filter/mod.rs index d99c7251902e..ebdb39994244 100644 --- a/parquet/src/bloom_filter/mod.rs +++ b/parquet/src/bloom_filter/mod.rs @@ -16,7 +16,61 @@ // under the License. //! Bloom filter implementation specific to Parquet, as described -//! in the [spec](https://github.com/apache/parquet-format/blob/master/BloomFilter.md). +//! in the [spec][parquet-bf-spec]. +//! +//! # Bloom Filter Size +//! +//! Parquet uses the [Split Block Bloom Filter][sbbf-paper] (SBBF) as its bloom filter +//! implementation. For each column upon which bloom filters are enabled, the offset and length of an SBBF +//! is stored in the metadata for each row group in the parquet file. The size of each filter is +//! initialized using a calculation based on the desired number of distinct values (NDV) and false +//! positive probability (FPP). The FPP for a SBBF can be approximated as[1][bf-formulae]: +//! +//! ```text +//! f = (1 - e^(-k * n / m))^k +//! ``` +//! +//! Where, `f` is the FPP, `k` the number of hash functions, `n` the NDV, and `m` the total number +//! of bits in the bloom filter. This can be re-arranged to determine the total number of bits +//! required to achieve a given FPP and NDV: +//! +//! ```text +//! m = -k * n / ln(1 - f^(1/k)) +//! ``` +//! +//! SBBFs use eight hash functions to cleanly fit in SIMD lanes[2][sbbf-paper], therefore +//! `k` is set to 8. The SBBF will spread those `m` bits accross a set of `b` blocks that +//! are each 256 bits, i.e., 32 bytes, in size. The number of blocks is chosen as: +//! +//! ```text +//! b = NP2(m/8) / 32 +//! ``` +//! +//! Where, `NP2` denotes *the next power of two*, and `m` is divided by 8 to be represented as bytes. +//! +//! Here is a table of calculated sizes for various FPP and NDV: +//! +//! | NDV | FPP | b | Size (KB) | +//! |-----------|-----------|---------|-----------| +//! | 10,000 | 0.1 | 256 | 8 | +//! | 10,000 | 0.01 | 512 | 16 | +//! | 10,000 | 0.001 | 1,024 | 32 | +//! | 10,000 | 0.0001 | 1,024 | 32 | +//! | 100,000 | 0.1 | 4,096 | 128 | +//! | 100,000 | 0.01 | 4,096 | 128 | +//! | 100,000 | 0.001 | 8,192 | 256 | +//! | 100,000 | 0.0001 | 16,384 | 512 | +//! | 100,000 | 0.00001 | 16,384 | 512 | +//! | 1,000,000 | 0.1 | 32,768 | 1,024 | +//! | 1,000,000 | 0.01 | 65,536 | 2,048 | +//! | 1,000,000 | 0.001 | 65,536 | 2,048 | +//! | 1,000,000 | 0.0001 | 131,072 | 4,096 | +//! | 1,000,000 | 0.00001 | 131,072 | 4,096 | +//! | 1,000,000 | 0.000001 | 262,144 | 8,192 | +//! +//! [parquet-bf-spec]: https://github.com/apache/parquet-format/blob/master/BloomFilter.md +//! [sbbf-paper]: https://arxiv.org/pdf/2101.01719 +//! [bf-formulae]: http://tfk.mit.edu/pdf/bloom.pdf use crate::data_type::AsBytes; use crate::errors::ParquetError;