Skip to content

Add H2O.ai Database-like Ops benchmark to dfbench #7209

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Follow on to #7052
There is an interesting database benchark called "H20.ai database like benchmark" that DuckDB seems to have revived (perhaps because the original went dormant with very old with very old/ slow duckdb results). More background here: https://duckdb.org/2023/04/14/h2oai.html#results

@Dandandan added a new solution for datafusion here: duckdblabs/db-benchmark#18

However, there is no easy way to run the h2o benchmark within the datafusion repo. There is an old version of some of these benchmarks in the code: https://github.com/apache/arrow-datafusion/blob/main/benchmarks/src/bin/h2o.rs

Describe the solution you'd like

I would like someone to make it easy to run the h20.ai benchmark in the datafusion repo.

Ideally this would look like

# generate data
./benchmarks/bench.sh data h20.ai
# run 
./benchmarks/bench.sh run h20.ai

I would expect to be able to run the individual queries like this

cargo run  --bin dfbench -- h2o.ai --query=3

Some steps might be

  1. port the existing benchmark script to dfbench following the model in Add parquet-filter and sort benchmarks to dfbench #7120
  2. update bench.sh, following the model of existing benchmarks
  3. Update the documentation

Describe alternatives you've considered

We could also simply remove the h20.ai benchmark script as it is not clear how important it will be long term

Additional context

I think this is a good first issue as the task is clear, and there are existing patterns in bench.sh, dfbench and in

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions