Skip to content

Minor: add a sql_planner benchmarks to reflecte select many field on a huge table #9536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 12, 2024

Conversation

haohuaijin
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

I find current sql_planner doesn't have benchmarks that reflect the query type that selects many fields on a huge table(more than 1000). maybe this also can reflect the query type describe in #7698.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Mar 10, 2024
@@ -194,6 +196,16 @@ fn criterion_benchmark(c: &mut Criterion) {
b.iter(|| physical_plan(&ctx, "SELECT c1 FROM t700"))
});

// Test simplest
c.bench_function("logical_select_all_from_1000", |b| {
b.iter(|| logical_plan(&ctx, "SELECT * FROM t1000"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously 1000 > 700, but I'm wondering if we should just reuse the t700 table which already intents to be a "huge" table?

So the query would be "SELECT * FROM t700"

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @haohuaijin -- thank you. I think @simonvandel 's suggestion to use a 700 col table is worth considering too.

@haohuaijin
Copy link
Contributor Author

@simonvandel and @alamb, I chose 1000 instead of 700 because I noticed that the plan time does not increase linearly with the number of columns. This is evident from the benchmarks shown below. I separately ran benchmarks for select * from t700, t1000, and t1500. The plan time increased from 289ms for t700 to 1.3s for t1500. This is a 5x slowdown, even though the number of columns only doubled. But I don't have a strong opinion on wanting to keep 1000, 700 also mean selecting many columns in a huge table. Do I need to change it?

logical_select_all_from_700
                        time:   [39.054 ms 39.098 ms 39.144 ms]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Benchmarking physical_select_all_from_700: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 28.9s, or reduce sample count to 10.
physical_select_all_from_700
                        time:   [287.53 ms 289.03 ms 291.39 ms]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

Benchmarking logical_select_all_from_1000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.5s, or reduce sample count to 50.
logical_select_all_from_1000
                        time:   [85.450 ms 85.689 ms 85.945 ms]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Benchmarking physical_select_all_from_1000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 62.7s, or reduce sample count to 10.
physical_select_all_from_1000
                        time:   [626.48 ms 627.34 ms 628.22 ms]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

Benchmarking logical_select_all_from_1500: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 17.9s, or reduce sample count to 20.
logical_select_all_from_1500
                        time:   [179.09 ms 179.34 ms 179.59 ms]

Benchmarking physical_select_all_from_1500: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 132.7s, or reduce sample count to 10.
physical_select_all_from_1500
                        time:   [1.3271 s 1.3289 s 1.3308 s]

@alamb
Copy link
Contributor

alamb commented Mar 12, 2024

But I don't have a strong opinion on wanting to keep 1000, 700 also mean selecting many columns in a huge table. Do I need to change it?

Your rationale makes a lot of sense to me. I think 1000 is totally fine -- thanks @haohuaijin and thank you @simonvandel

@alamb alamb merged commit ef9bc90 into apache:main Mar 12, 2024
@haohuaijin haohuaijin deleted the add_benchmarks branch March 12, 2024 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants