-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Make aggr fuzzer query builder more configurable #15851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make aggr fuzzer query builder more configurable #15851
Conversation
17294e4
to
3d66d5e
Compare
/// | ||
/// Limited to 3 group by columns to ensure coverage for large groups. With | ||
/// larger numbers of columns, each group has many fewer values. | ||
fn random_group_by(&self) -> Vec<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use this recently and want to control the minimum group by args, for example I want to test multi group by so I need this to be at least 2.
fn random_group_by(&self, min_group_by:Option<usize>)
It would be great if this adds in too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will add it soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added min_group_by_columns
in QueryBuilder
, it will control the num group by columns
together with
max_group_by_columns
.
For example:
min_group_by_columns: 2
max_group_by_columns: 4
generated sql will group by 2~4 columns.
|
||
/// Add sort keys of dataset if any, then the builder will generate queries basing on it | ||
/// to cover the sort-optimization cases | ||
pub fn with_dataset_sort_keys(mut self, dataset_sort_keys: Vec<Vec<String>>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue I encounter is that it is not easy to control the ordering of the columns by this API.
For example, we can generate query like this, but there is no easy way to determine the ordering of the group by columns.
pub fn generate_multi_group_query(&self) -> String {
let group_by_cols = self.random_group_by(Some(2));
format!(
"SELECT sum({}) FROM {} GROUP BY {}, {}",
group_by_cols[0], self.table_name, group_by_cols[0], group_by_cols[1]
)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workaround I did is to limit the number column I provided like this
, I only have single column so it is easy to add it to sort key sets
let columns = get_supported_types_columns(rng.gen());
for column in columns {
let sort_key = column.name.to_string();
let data_gen_config = DatasetGeneratorConfig {
columns: vec![column],
rows_num_range,
sort_keys_set: vec![vec![sort_key]],
};
....
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if I have many columns and different ordering requirement like
select a, b, c from t group by a(ordered), b(ordered), c(non-ordered)
It is not that straightforward anymore
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 I am not sure if I understand the cases rightly?
Actually it will specially generate queries grouping by only the columns in the sort_keys_set
.
For example,
If we define a dataset configs like:
columns: [a, b, c, d, e]
sort_keys: [a,b]
And the query builder will firstly generate sql with all columns
, and in this round the generated sqls are possibly not grouping on the sort keys:
SELECT aggr from t GROUP BY c
SELECT aggr from t GROUP BY c,d
SELECT aggr from t GROUP BY e
...
But the special round generating sql gouping on the sort keys([a,b] here) exist:
SELECT aggr from t GROUP BY a
SELECT aggr from t GROUP BY b
SELECT aggr from t GROUP BY a,b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the sql picking logic indeed has some problems now as I noticed.
For example, assume dataset t1 sorted by [a, b], and t2 sorted by [c,d].
It is possible that
- we randomly picked sql
select xxx from xxx group by c,d
fort1
- but
select xxx from xxx group by a,b
fort2
And finally it can not cover the sort optimization path..
@jayzhan211 Maybe we can improve the testing of sorted cases like that? In And in
But I want to improve it in an new pr, for not blocking #15591 , is it ok? |
Sure.
I think the current implementation issue is that the order is defined once in the dataset level with column name, so all the generated query has the ordering information based on the dataset. Dataset1: a, b ordered, c not ordered.
But I want ordering with specific index for each query, assume I want first column ordered and second column unorderd. Dataset2: a,b,c,d,e .... These are random generated query with fixed columns (2).
I expect is to be index-specific ordering not based on the column name With |
f7e8dba
to
eb2074a
Compare
Thanks @jayzhan211 |
Which issue does this PR close?
Rationale for this change
When adding fuzzy tests for #15591 , I found there is the requirement to costom the generated sql.
And the
QueryBuilder
is still not configurable to do that.What changes are included in this PR?
Improve aggr fuzzer query builder more configurable, also some simple refactors.
Are these changes tested?
Not need.
Are there any user-facing changes?
No.