Make aggr fuzzer query builder more configurable #15851

Rachelint · 2025-04-25T07:08:23Z

Which issue does this PR close?

Closes #.

Rationale for this change

When adding fuzzy tests for #15591 , I found there is the requirement to costom the generated sql.

And the QueryBuilder is still not configurable to do that.

What changes are included in this PR?

Improve aggr fuzzer query builder more configurable, also some simple refactors.

Are these changes tested?

Not need.

Are there any user-facing changes?

No.

Rachelint · 2025-04-25T07:10:35Z

@waynexia

jayzhan211 · 2025-04-25T07:40:52Z

datafusion/core/tests/fuzz_cases/aggregation_fuzzer/query_builder.rs

+    ///
+    /// Limited to 3 group by columns to ensure coverage for large groups. With
+    /// larger numbers of columns, each group has many fewer values.
+    fn random_group_by(&self) -> Vec<String> {


I use this recently and want to control the minimum group by args, for example I want to test multi group by so I need this to be at least 2.

fn random_group_by(&self, min_group_by:Option<usize>)

It would be great if this adds in too

ok, I will add it soon.

Have added min_group_by_columns in QueryBuilder, it will control the num group by columns together with
max_group_by_columns.

For example:

min_group_by_columns: 2 max_group_by_columns: 4 generated sql will group by 2~4 columns.

jayzhan211 · 2025-04-25T07:48:36Z

datafusion/core/tests/fuzz_cases/aggregation_fuzzer/query_builder.rs

+
+    /// Add sort keys of dataset if any, then the builder will generate queries basing on it
+    /// to cover the sort-optimization cases
+    pub fn with_dataset_sort_keys(mut self, dataset_sort_keys: Vec<Vec<String>>) -> Self {


Another issue I encounter is that it is not easy to control the ordering of the columns by this API.

For example, we can generate query like this, but there is no easy way to determine the ordering of the group by columns.

pub fn generate_multi_group_query(&self) -> String { let group_by_cols = self.random_group_by(Some(2)); format!( "SELECT sum({}) FROM {} GROUP BY {}, {}", group_by_cols[0], self.table_name, group_by_cols[0], group_by_cols[1] ) }

The workaround I did is to limit the number column I provided like this
, I only have single column so it is easy to add it to sort key sets

let columns = get_supported_types_columns(rng.gen()); for column in columns { let sort_key = column.name.to_string(); let data_gen_config = DatasetGeneratorConfig { columns: vec![column], rows_num_range, sort_keys_set: vec![vec![sort_key]], }; .... }

But if I have many columns and different ordering requirement like

select a, b, c from t group by a(ordered), b(ordered), c(non-ordered)

It is not that straightforward anymore

@jayzhan211 I am not sure if I understand the cases rightly?

Actually it will specially generate queries grouping by only the columns in the sort_keys_set.

For example,
If we define a dataset configs like:

columns: [a, b, c, d, e] sort_keys: [a,b]

And the query builder will firstly generate sql with all columns, and in this round the generated sqls are possibly not grouping on the sort keys:

SELECT aggr from t GROUP BY c SELECT aggr from t GROUP BY c,d SELECT aggr from t GROUP BY e ...

But the special round generating sql gouping on the sort keys([a,b] here) exist:

SELECT aggr from t GROUP BY a SELECT aggr from t GROUP BY b SELECT aggr from t GROUP BY a,b

However, the sql picking logic indeed has some problems now as I noticed.

For example, assume dataset t1 sorted by [a, b], and t2 sorted by [c,d].

It is possible that

we randomly picked sql select xxx from xxx group by c,d for t1

but select xxx from xxx group by a,b for t2

And finally it can not cover the sort optimization path..

Rachelint · 2025-04-25T19:11:19Z

@jayzhan211 Maybe we can improve the testing of sorted cases like that?

In data generation part, we randomly generate the sort keys both number and type, rather than only [u8_low] and [utf8_low, u8_low] right now.

And in sql generation part:

We need to ensure that the specific generated sql must be picked by the right sorted dataset.
Like SELECT xxx from xxx GROUP BY a,b must be picked by dataset sorted by a,b.
We should ensure all ordering situations can be covered.
At least 1 full ordering + 1 partial ordering + 1 no ordering .
Like dataset sorted by a,b, can ensure at least following three cases:

SELECT xxx from xxx GROUP BY a,b
SELECT xxx from xxx GROUP BY a,c
SELECT xxx from xxx GROUP BY b

But I want to improve it in an new pr, for not blocking #15591 , is it ok?

jayzhan211 · 2025-04-26T00:47:31Z

But I want to improve it in an new pr, for not blocking #15591 , is it ok?

Sure.

We should ensure all ordering situations can be covered.
At least 1 full ordering + 1 partial ordering + 1 no ordering .
Like dataset sorted by a,b, can ensure at least following three cases:

We need to generate random queries covered all the cases easily AND
We need to generate random specific query easily too.

I think the current implementation issue is that the order is defined once in the dataset level with column name, so all the generated query has the ordering information based on the dataset.

Dataset1: a, b ordered, c not ordered.
Any generated query has the same ordering as dataset defined

SELECT xxx from xxx GROUP BY a,b (a, b ordered)
SELECT xxx from xxx GROUP BY a,c (a ordered, c ordered)
SELECT xxx from xxx GROUP BY b (b ordered)

But I want ordering with specific index for each query, assume I want first column ordered and second column unorderd.

Dataset2: a,b,c,d,e ....

These are random generated query with fixed columns (2).

SELECT xxx from xxx GROUP BY a,b (a ordered, b unorderd)
SELECT xxx from xxx GROUP BY a,c (a ordered, c unorderd)
SELECT xxx from xxx GROUP BY c,d (c ordered, d unorderd)
SELECT xxx from xxx GROUP BY d,e (d ordered, e unorderd)

I expect is to be index-specific ordering not based on the column name

With index-specific, we can expect the generated query is always full order or partial order. In the current implementation, we can't guarantee this.

Rachelint · 2025-04-26T18:01:40Z

But I want to improve it in an new pr, for not blocking #15591 , is it ok?

Sure.

We should ensure all ordering situations can be covered.
At least 1 full ordering + 1 partial ordering + 1 no ordering .
Like dataset sorted by a,b, can ensure at least following three cases:
1. We need to generate random queries covered all the cases easily AND

2. We need to generate random specific query easily too.
I think the current implementation issue is that the order is defined once in the dataset level with column name, so all the generated query has the ordering information based on the dataset.

Dataset1: a, b ordered, c not ordered. Any generated query has the same ordering as dataset defined
SELECT xxx from xxx GROUP BY a,b (a, b ordered)
SELECT xxx from xxx GROUP BY a,c (a ordered, c ordered)
SELECT xxx from xxx GROUP BY b (b ordered)
But I want ordering with specific index for each query, assume I want first column ordered and second column unorderd.

Dataset2: a,b,c,d,e ....

These are random generated query with fixed columns (2).
SELECT xxx from xxx GROUP BY a,b (a ordered, b unorderd)
SELECT xxx from xxx GROUP BY a,c (a ordered, c unorderd)
SELECT xxx from xxx GROUP BY c,d (c ordered, d unorderd)
SELECT xxx from xxx GROUP BY d,e (d ordered, e unorderd)
I expect is to be index-specific ordering not based on the column name

With index-specific, we can expect the generated query is always full order or partial order. In the current implementation, we can't guarantee this.

Thanks @jayzhan211
Make sense, it is indeed better sql define dataset sorting rather than pre-sorted dataset define sql.
I will try to make it in follow up prs.

Rachelint added 3 commits April 25, 2025 14:57

refactor and make QueryBuilder more configurable.

2dcea97

fix tests.

1e48505

fix clippy.

aa5c1c8

github-actions bot added the core Core DataFusion crate label Apr 25, 2025

extract QueryBuilder to a dedicated module.

3d66d5e

Rachelint force-pushed the improve-aggr-fuzzer-query-builder branch from 17294e4 to 3d66d5e Compare April 25, 2025 07:19

jayzhan211 reviewed Apr 25, 2025

View reviewed changes

Rachelint mentioned this pull request Apr 25, 2025

Implement intermeidate result blocked approach sketch #15591

Open

add min_group_by_columns, and fix some bugs.

eb2074a

Rachelint force-pushed the improve-aggr-fuzzer-query-builder branch from f7e8dba to eb2074a Compare April 26, 2025 17:57

jayzhan211 approved these changes Apr 27, 2025

View reviewed changes

Rachelint merged commit 74dc419 into apache:main Apr 27, 2025
27 checks passed

Rachelint mentioned this pull request Apr 27, 2025

Tracking: improve aggreagation fuzzer #15870

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make aggr fuzzer query builder more configurable #15851

Make aggr fuzzer query builder more configurable #15851

Rachelint commented Apr 25, 2025

Rachelint commented Apr 25, 2025

jayzhan211 Apr 25, 2025 •

edited

Loading

Rachelint Apr 25, 2025

Rachelint Apr 26, 2025

jayzhan211 Apr 25, 2025

jayzhan211 Apr 25, 2025 •

edited

Loading

jayzhan211 Apr 25, 2025

This comment was marked as outdated.

Rachelint Apr 25, 2025 •

edited

Loading

Rachelint Apr 25, 2025 •

edited

Loading

Rachelint commented Apr 25, 2025 •

edited

Loading

jayzhan211 commented Apr 26, 2025 •

edited

Loading

Rachelint commented Apr 26, 2025

Make aggr fuzzer query builder more configurable #15851

Make aggr fuzzer query builder more configurable #15851

Conversation

Rachelint commented Apr 25, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Rachelint commented Apr 25, 2025

jayzhan211 Apr 25, 2025 • edited Loading

Choose a reason for hiding this comment

Rachelint Apr 25, 2025

Choose a reason for hiding this comment

Rachelint Apr 26, 2025

Choose a reason for hiding this comment

jayzhan211 Apr 25, 2025

Choose a reason for hiding this comment

jayzhan211 Apr 25, 2025 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Apr 25, 2025

Choose a reason for hiding this comment

This comment was marked as outdated.

Rachelint Apr 25, 2025 • edited Loading

Choose a reason for hiding this comment

Rachelint Apr 25, 2025 • edited Loading

Choose a reason for hiding this comment

Rachelint commented Apr 25, 2025 • edited Loading

jayzhan211 commented Apr 26, 2025 • edited Loading

Rachelint commented Apr 26, 2025

jayzhan211 Apr 25, 2025 •

edited

Loading

jayzhan211 Apr 25, 2025 •

edited

Loading

Rachelint Apr 25, 2025 •

edited

Loading

Rachelint Apr 25, 2025 •

edited

Loading

Rachelint commented Apr 25, 2025 •

edited

Loading

jayzhan211 commented Apr 26, 2025 •

edited

Loading