Skip to content

Commit 58d15c7

Browse files
authored
Add bench.sh script to automate benchmarking DataFusion against itself (#6131)
* Add bench script to benchmark datafusion against itself * improve docs
1 parent fd785b2 commit 58d15c7

File tree

4 files changed

+411
-59
lines changed

4 files changed

+411
-59
lines changed

benchmarks/.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
data
1+
data
2+
results

benchmarks/README.md

Lines changed: 126 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -19,29 +19,139 @@
1919

2020
# DataFusion Benchmarks
2121

22-
This crate contains benchmarks based on popular public data sets and open source benchmark suites, making it easy to
23-
run real-world benchmarks to help with performance and scalability testing and for comparing performance with other Arrow
24-
implementations as well as other query engines.
22+
This crate contains benchmarks based on popular public data sets and
23+
open source benchmark suites, making it easy to run more realistic
24+
benchmarks to help with performance and scalability testing of DataFusion.
2525

26-
## Benchmark derived from TPC-H
26+
# Benchmarks Against Other Engines
2727

28-
These benchmarks are derived from the [TPC-H][1] benchmark. And we use this repo as the source of tpch-gen and answers:
29-
https://github.com/databricks/tpch-dbgen.git, based on [2.17.1](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf) version of TPC-H.
28+
DataFusion is included in the benchmark setups for several popular
29+
benchmarks that compare performance with other engines. For example:
3030

31-
## Generating Test Data
31+
* [ClickBench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion)
32+
* [H2o.ai `db-benchmark`] scripts are in [db-benchmark](db-benchmark) directory
3233

33-
TPC-H data can be generated using the `tpch-gen.sh` script, which creates a Docker image containing the TPC-DS data
34-
generator.
34+
[ClickBench]: https://github.com/ClickHouse/ClickBench/tree/main
35+
[H2o.ai `db-benchmark`]: https://github.com/h2oai/db-benchmark
3536

36-
```bash
37-
# scale_factor: scale of the database population. scale 1.0 represents ~1 GB of data
38-
./tpch-gen.sh <scale_factor>
37+
# Running the benchmarks
38+
39+
## Running Benchmarks
40+
41+
The easiest way to run benchmarks from DataFusion source checkouts is
42+
to use the [bench.sh](bench.sh) script. Usage instructions can be
43+
found with:
44+
45+
```shell
46+
# show usage
47+
./bench.sh
48+
```
49+
50+
## Generating Data
51+
52+
You can create data for all these benchmarks using the [bench.sh](bench.sh) script:
53+
54+
```shell
55+
./bench.sh data
56+
```
57+
58+
Data is generated in the `data` subdirectory and will not be checked
59+
in because this directory has been added to the `.gitignore` file.
60+
61+
62+
## Example to compare peformance on main to a branch
63+
64+
```shell
65+
git checkout main
66+
67+
# Create the data
68+
./benchmarks/bench.sh data
69+
70+
# Gather baseline data for tpch benchmark
71+
./benchmarks/bench.sh run tpch
72+
73+
# Switch to the branch the branch name is mybranch and gather data
74+
git checkout mybranch
75+
./benchmarks/bench.sh run tpch
76+
77+
# Compare results in the two branches:
78+
./bench.sh compare main mybranch
3979
```
4080

41-
Data will be generated into the `data` subdirectory and will not be checked in because this directory has been added
42-
to the `.gitignore` file.
81+
This produces results like:
82+
83+
```shell
84+
Comparing main and mybranch
85+
--------------------
86+
Benchmark tpch.json
87+
--------------------
88+
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
89+
┃ Query ┃ main ┃ mybranch ┃ Change ┃
90+
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
91+
│ QQuery 1 │ 2520.52ms │ 2795.09ms │ 1.11x slower │
92+
│ QQuery 2 │ 222.37ms │ 216.01ms │ no change │
93+
│ QQuery 3 │ 248.41ms │ 239.07ms │ no change │
94+
│ QQuery 4 │ 144.01ms │ 129.28ms │ +1.11x faster │
95+
│ QQuery 5 │ 339.54ms │ 327.53ms │ no change │
96+
│ QQuery 6 │ 147.59ms │ 138.73ms │ +1.06x faster │
97+
│ QQuery 7 │ 605.72ms │ 631.23ms │ no change │
98+
│ QQuery 8 │ 326.35ms │ 372.12ms │ 1.14x slower │
99+
│ QQuery 9 │ 579.02ms │ 634.73ms │ 1.10x slower │
100+
│ QQuery 10 │ 403.38ms │ 420.39ms │ no change │
101+
│ QQuery 11 │ 201.94ms │ 212.12ms │ 1.05x slower │
102+
│ QQuery 12 │ 235.94ms │ 254.58ms │ 1.08x slower │
103+
│ QQuery 13 │ 738.40ms │ 789.67ms │ 1.07x slower │
104+
│ QQuery 14 │ 198.73ms │ 206.96ms │ no change │
105+
│ QQuery 15 │ 183.32ms │ 179.53ms │ no change │
106+
│ QQuery 16 │ 168.57ms │ 186.43ms │ 1.11x slower │
107+
│ QQuery 17 │ 2032.57ms │ 2108.12ms │ no change │
108+
│ QQuery 18 │ 1912.80ms │ 2134.82ms │ 1.12x slower │
109+
│ QQuery 19 │ 391.64ms │ 368.53ms │ +1.06x faster │
110+
│ QQuery 20 │ 648.22ms │ 691.41ms │ 1.07x slower │
111+
│ QQuery 21 │ 866.25ms │ 1020.37ms │ 1.18x slower │
112+
│ QQuery 22 │ 115.94ms │ 117.27ms │ no change │
113+
└──────────────┴──────────────┴──────────────┴───────────────┘
114+
--------------------
115+
Benchmark tpch_mem.json
116+
--------------------
117+
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
118+
┃ Query ┃ main ┃ mybranch ┃ Change ┃
119+
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
120+
│ QQuery 1 │ 2182.44ms │ 2390.39ms │ 1.10x slower │
121+
│ QQuery 2 │ 181.16ms │ 153.94ms │ +1.18x faster │
122+
│ QQuery 3 │ 98.89ms │ 95.51ms │ no change │
123+
│ QQuery 4 │ 61.43ms │ 66.15ms │ 1.08x slower │
124+
│ QQuery 5 │ 260.20ms │ 283.65ms │ 1.09x slower │
125+
│ QQuery 6 │ 24.24ms │ 23.39ms │ no change │
126+
│ QQuery 7 │ 545.87ms │ 653.34ms │ 1.20x slower │
127+
│ QQuery 8 │ 147.48ms │ 136.00ms │ +1.08x faster │
128+
│ QQuery 9 │ 371.53ms │ 363.61ms │ no change │
129+
│ QQuery 10 │ 197.91ms │ 190.37ms │ no change │
130+
│ QQuery 11 │ 197.91ms │ 183.70ms │ +1.08x faster │
131+
│ QQuery 12 │ 100.32ms │ 103.08ms │ no change │
132+
│ QQuery 13 │ 428.02ms │ 440.26ms │ no change │
133+
│ QQuery 14 │ 38.50ms │ 27.11ms │ +1.42x faster │
134+
│ QQuery 15 │ 101.15ms │ 63.25ms │ +1.60x faster │
135+
│ QQuery 16 │ 171.15ms │ 142.44ms │ +1.20x faster │
136+
│ QQuery 17 │ 1885.05ms │ 1953.58ms │ no change │
137+
│ QQuery 18 │ 1549.92ms │ 1914.06ms │ 1.23x slower │
138+
│ QQuery 19 │ 106.53ms │ 104.28ms │ no change │
139+
│ QQuery 20 │ 532.11ms │ 610.62ms │ 1.15x slower │
140+
│ QQuery 21 │ 723.39ms │ 823.34ms │ 1.14x slower │
141+
│ QQuery 22 │ 91.84ms │ 89.89ms │ no change │
142+
└──────────────┴──────────────┴──────────────┴───────────────┘
143+
```
144+
145+
146+
# Benchmark Descriptions:
147+
148+
## `tpch` Benchmark derived from TPC-H
149+
150+
These benchmarks are derived from the [TPC-H][1] benchmark. And we use this repo as the source of tpch-gen and answers:
151+
https://github.com/databricks/tpch-dbgen.git, based on [2.17.1](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf) version of TPC-H.
152+
43153

44-
## Running the DataFusion Benchmarks
154+
### Running the DataFusion Benchmarks Manually
45155

46156
The benchmark can then be run (assuming the data created from `dbgen` is in `./data`) with a command such as:
47157

@@ -126,7 +236,7 @@ This will produce output like
126236
└──────────────┴──────────────┴──────────────┴───────────────┘
127237
```
128238

129-
## Expected output
239+
### Expected output
130240

131241
The result of query 1 should produce the following output when executed against the SF=1 dataset.
132242

0 commit comments

Comments
 (0)