Skip to content

Make aggr fuzzer query builder more configurable #15851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 27, 2025

Conversation

Rachelint
Copy link
Contributor

Which issue does this PR close?

  • Closes #.

Rationale for this change

When adding fuzzy tests for #15591 , I found there is the requirement to costom the generated sql.

And the QueryBuilder is still not configurable to do that.

What changes are included in this PR?

Improve aggr fuzzer query builder more configurable, also some simple refactors.

Are these changes tested?

Not need.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the core Core DataFusion crate label Apr 25, 2025
@Rachelint
Copy link
Contributor Author

@waynexia

@Rachelint Rachelint force-pushed the improve-aggr-fuzzer-query-builder branch from 17294e4 to 3d66d5e Compare April 25, 2025 07:19
///
/// Limited to 3 group by columns to ensure coverage for large groups. With
/// larger numbers of columns, each group has many fewer values.
fn random_group_by(&self) -> Vec<String> {
Copy link
Contributor

@jayzhan211 jayzhan211 Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this recently and want to control the minimum group by args, for example I want to test multi group by so I need this to be at least 2.

fn random_group_by(&self, min_group_by:Option<usize>)

It would be great if this adds in too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will add it soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added min_group_by_columns in QueryBuilder, it will control the num group by columns together with
max_group_by_columns.

For example:

min_group_by_columns: 2
max_group_by_columns: 4

generated sql will group by 2~4 columns.


/// Add sort keys of dataset if any, then the builder will generate queries basing on it
/// to cover the sort-optimization cases
pub fn with_dataset_sort_keys(mut self, dataset_sort_keys: Vec<Vec<String>>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue I encounter is that it is not easy to control the ordering of the columns by this API.

For example, we can generate query like this, but there is no easy way to determine the ordering of the group by columns.

pub fn generate_multi_group_query(&self) -> String {
        let group_by_cols = self.random_group_by(Some(2));
        format!(
            "SELECT sum({}) FROM {} GROUP BY {}, {}",
            group_by_cols[0], self.table_name, group_by_cols[0], group_by_cols[1]
        )
    }

Copy link
Contributor

@jayzhan211 jayzhan211 Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workaround I did is to limit the number column I provided like this
, I only have single column so it is easy to add it to sort key sets

    let columns = get_supported_types_columns(rng.gen());

    for column in columns {
        let sort_key = column.name.to_string();
        let data_gen_config = DatasetGeneratorConfig {
            columns: vec![column],
            rows_num_range,
            sort_keys_set: vec![vec![sort_key]],
        };
        ....
     }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if I have many columns and different ordering requirement like

select a, b, c from t group by a(ordered), b(ordered), c(non-ordered)

It is not that straightforward anymore

This comment was marked as outdated.

Copy link
Contributor Author

@Rachelint Rachelint Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 I am not sure if I understand the cases rightly?

Actually it will specially generate queries grouping by only the columns in the sort_keys_set.

For example,
If we define a dataset configs like:

columns: [a, b, c, d, e]
sort_keys: [a,b]

And the query builder will firstly generate sql with all columns, and in this round the generated sqls are possibly not grouping on the sort keys:

SELECT aggr from t GROUP BY c
SELECT aggr from t GROUP BY c,d
SELECT aggr from t GROUP BY e
...

But the special round generating sql gouping on the sort keys([a,b] here) exist:

SELECT aggr from t GROUP BY a
SELECT aggr from t GROUP BY b
SELECT aggr from t GROUP BY a,b

Copy link
Contributor Author

@Rachelint Rachelint Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the sql picking logic indeed has some problems now as I noticed.

For example, assume dataset t1 sorted by [a, b], and t2 sorted by [c,d].

It is possible that

  • we randomly picked sql select xxx from xxx group by c,d for t1
  • but select xxx from xxx group by a,b for t2

And finally it can not cover the sort optimization path..

@Rachelint
Copy link
Contributor Author

Rachelint commented Apr 25, 2025

@jayzhan211 Maybe we can improve the testing of sorted cases like that?

In data generation part, we randomly generate the sort keys both number and type, rather than only [u8_low] and [utf8_low, u8_low] right now.

And in sql generation part:

  • We need to ensure that the specific generated sql must be picked by the right sorted dataset.
    Like SELECT xxx from xxx GROUP BY a,b must be picked by dataset sorted by a,b.

  • We should ensure all ordering situations can be covered.
    At least 1 full ordering + 1 partial ordering + 1 no ordering .
    Like dataset sorted by a,b, can ensure at least following three cases:

SELECT xxx from xxx GROUP BY a,b
SELECT xxx from xxx GROUP BY a,c
SELECT xxx from xxx GROUP BY b

But I want to improve it in an new pr, for not blocking #15591 , is it ok?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Apr 26, 2025

But I want to improve it in an new pr, for not blocking #15591 , is it ok?

Sure.

We should ensure all ordering situations can be covered.
At least 1 full ordering + 1 partial ordering + 1 no ordering .
Like dataset sorted by a,b, can ensure at least following three cases:

  1. We need to generate random queries covered all the cases easily AND
  2. We need to generate random specific query easily too.

I think the current implementation issue is that the order is defined once in the dataset level with column name, so all the generated query has the ordering information based on the dataset.

Dataset1: a, b ordered, c not ordered.
Any generated query has the same ordering as dataset defined

SELECT xxx from xxx GROUP BY a,b (a, b ordered)
SELECT xxx from xxx GROUP BY a,c (a ordered, c ordered)
SELECT xxx from xxx GROUP BY b (b ordered)

But I want ordering with specific index for each query, assume I want first column ordered and second column unorderd.

Dataset2: a,b,c,d,e ....

These are random generated query with fixed columns (2).

SELECT xxx from xxx GROUP BY a,b (a ordered, b unorderd)
SELECT xxx from xxx GROUP BY a,c (a ordered, c unorderd)
SELECT xxx from xxx GROUP BY c,d (c ordered, d unorderd)
SELECT xxx from xxx GROUP BY d,e (d ordered, e unorderd)

I expect is to be index-specific ordering not based on the column name

With index-specific, we can expect the generated query is always full order or partial order. In the current implementation, we can't guarantee this.

@Rachelint Rachelint force-pushed the improve-aggr-fuzzer-query-builder branch from f7e8dba to eb2074a Compare April 26, 2025 17:57
@Rachelint
Copy link
Contributor Author

But I want to improve it in an new pr, for not blocking #15591 , is it ok?

Sure.

We should ensure all ordering situations can be covered.
At least 1 full ordering + 1 partial ordering + 1 no ordering .
Like dataset sorted by a,b, can ensure at least following three cases:

1. We need to generate random queries covered all the cases easily AND

2. We need to generate random specific query easily too.

I think the current implementation issue is that the order is defined once in the dataset level with column name, so all the generated query has the ordering information based on the dataset.

Dataset1: a, b ordered, c not ordered. Any generated query has the same ordering as dataset defined

SELECT xxx from xxx GROUP BY a,b (a, b ordered)
SELECT xxx from xxx GROUP BY a,c (a ordered, c ordered)
SELECT xxx from xxx GROUP BY b (b ordered)

But I want ordering with specific index for each query, assume I want first column ordered and second column unorderd.

Dataset2: a,b,c,d,e ....

These are random generated query with fixed columns (2).

SELECT xxx from xxx GROUP BY a,b (a ordered, b unorderd)
SELECT xxx from xxx GROUP BY a,c (a ordered, c unorderd)
SELECT xxx from xxx GROUP BY c,d (c ordered, d unorderd)
SELECT xxx from xxx GROUP BY d,e (d ordered, e unorderd)

I expect is to be index-specific ordering not based on the column name

With index-specific, we can expect the generated query is always full order or partial order. In the current implementation, we can't guarantee this.

Thanks @jayzhan211
Make sense, it is indeed better sql define dataset sorting rather than pre-sorted dataset define sql.
I will try to make it in follow up prs.

@Rachelint Rachelint merged commit 74dc419 into apache:main Apr 27, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants