Description
Is your feature request related to a problem or challenge?
Follow on to #7052
There is an interesting database benchark called "H20.ai database like benchmark" that DuckDB seems to have revived (perhaps because the original went dormant with very old with very old/ slow duckdb results). More background here: https://duckdb.org/2023/04/14/h2oai.html#results
@Dandandan added a new solution for datafusion here: duckdblabs/db-benchmark#18
However, there is no easy way to run the h2o benchmark within the datafusion repo. There is an old version of some of these benchmarks in the code: https://github.com/apache/arrow-datafusion/blob/main/benchmarks/src/bin/h2o.rs
Describe the solution you'd like
I would like someone to make it easy to run the h20.ai benchmark in the datafusion repo.
Ideally this would look like
# generate data
./benchmarks/bench.sh data h20.ai
# run
./benchmarks/bench.sh run h20.ai
I would expect to be able to run the individual queries like this
cargo run --bin dfbench -- h2o.ai --query=3
Some steps might be
- port the existing benchmark script to dfbench following the model in Add parquet-filter and sort benchmarks to dfbench #7120
- update
bench.sh
, following the model of existing benchmarks - Update the documentation
Describe alternatives you've considered
We could also simply remove the h20.ai benchmark script as it is not clear how important it will be long term
Additional context
I think this is a good first issue as the task is clear, and there are existing patterns in bench.sh
, dfbench
and in