Description
Background
I've been exploring the statistics collection in DataFusion, particularly for parquet, in the datafusion/datasource-parquet/src/file_format.rs
file's infer_stats
method. I noticed that while DataFusion collects statistics like:
- Row counts
- Null counts
- Min/max values
- Total byte size
There doesn't appear to be any logic for computing NDV (Number of Distinct Values). The distinct_count
field is explicitly set to Precision::Absent
.
Is there existing NDV computation?
- Is there another mechanism in DataFusion for computing NDV that I've missed?
- Are there plans to implement NDV computation in the future?
Impact on Query Optimization
Without NDV statistics, the query optimizer might struggle to choose the optimal join orders, especially for queries with multiple joins. For example, in traditional optimizers, NDV is crucial for estimating join cardinalities and selecting the best join ordering. If NDV computation isn't currently available, how to ensure accurate join ordering in TPC-H queries? Are there alternative statistics or hints we're using?