Skip to content

Question about Statistics Collection(specifically NDV) #15265

Open
@xudong963

Description

@xudong963

Background

I've been exploring the statistics collection in DataFusion, particularly for parquet, in the datafusion/datasource-parquet/src/file_format.rs file's infer_stats method. I noticed that while DataFusion collects statistics like:

  • Row counts
  • Null counts
  • Min/max values
  • Total byte size

There doesn't appear to be any logic for computing NDV (Number of Distinct Values). The distinct_count field is explicitly set to Precision::Absent.

Is there existing NDV computation?

  1. Is there another mechanism in DataFusion for computing NDV that I've missed?
  2. Are there plans to implement NDV computation in the future?

Impact on Query Optimization

Without NDV statistics, the query optimizer might struggle to choose the optimal join orders, especially for queries with multiple joins. For example, in traditional optimizers, NDV is crucial for estimating join cardinalities and selecting the best join ordering. If NDV computation isn't currently available, how to ensure accurate join ordering in TPC-H queries? Are there alternative statistics or hints we're using?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions