|
19 | 19 |
|
20 | 20 | # DataFusion Benchmarks
|
21 | 21 |
|
22 |
| -This crate contains benchmarks based on popular public data sets and open source benchmark suites, making it easy to |
23 |
| -run real-world benchmarks to help with performance and scalability testing and for comparing performance with other Arrow |
24 |
| -implementations as well as other query engines. |
| 22 | +This crate contains benchmarks based on popular public data sets and |
| 23 | +open source benchmark suites, making it easy to run more realistic |
| 24 | +benchmarks to help with performance and scalability testing of DataFusion. |
25 | 25 |
|
26 |
| -## Benchmark derived from TPC-H |
| 26 | +# Benchmarks Against Other Engines |
27 | 27 |
|
28 |
| -These benchmarks are derived from the [TPC-H][1] benchmark. And we use this repo as the source of tpch-gen and answers: |
29 |
| -https://github.com/databricks/tpch-dbgen.git, based on [2.17.1](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf) version of TPC-H. |
| 28 | +DataFusion is included in the benchmark setups for several popular |
| 29 | +benchmarks that compare performance with other engines. For example: |
30 | 30 |
|
31 |
| -## Generating Test Data |
| 31 | +* [ClickBench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion) |
| 32 | +* [H2o.ai `db-benchmark`] scripts are in [db-benchmark](db-benchmark) directory |
32 | 33 |
|
33 |
| -TPC-H data can be generated using the `tpch-gen.sh` script, which creates a Docker image containing the TPC-DS data |
34 |
| -generator. |
| 34 | +[ClickBench]: https://github.com/ClickHouse/ClickBench/tree/main |
| 35 | +[H2o.ai `db-benchmark`]: https://github.com/h2oai/db-benchmark |
35 | 36 |
|
36 |
| -```bash |
37 |
| -# scale_factor: scale of the database population. scale 1.0 represents ~1 GB of data |
38 |
| -./tpch-gen.sh <scale_factor> |
| 37 | +# Running the benchmarks |
| 38 | + |
| 39 | +## Running Benchmarks |
| 40 | + |
| 41 | +The easiest way to run benchmarks from DataFusion source checkouts is |
| 42 | +to use the [bench.sh](bench.sh) script. Usage instructions can be |
| 43 | +found with: |
| 44 | + |
| 45 | +```shell |
| 46 | +# show usage |
| 47 | +./bench.sh |
| 48 | +``` |
| 49 | + |
| 50 | +## Generating Data |
| 51 | + |
| 52 | +You can create data for all these benchmarks using the [bench.sh](bench.sh) script: |
| 53 | + |
| 54 | +```shell |
| 55 | +./bench.sh data |
| 56 | +``` |
| 57 | + |
| 58 | +Data is generated in the `data` subdirectory and will not be checked |
| 59 | +in because this directory has been added to the `.gitignore` file. |
| 60 | + |
| 61 | + |
| 62 | +## Example to compare peformance on main to a branch |
| 63 | + |
| 64 | +```shell |
| 65 | +git checkout main |
| 66 | + |
| 67 | +# Create the data |
| 68 | +./benchmarks/bench.sh data |
| 69 | + |
| 70 | +# Gather baseline data for tpch benchmark |
| 71 | +./benchmarks/bench.sh run tpch |
| 72 | + |
| 73 | +# Switch to the branch the branch name is mybranch and gather data |
| 74 | +git checkout mybranch |
| 75 | +./benchmarks/bench.sh run tpch |
| 76 | + |
| 77 | +# Compare results in the two branches: |
| 78 | +./bench.sh compare main mybranch |
39 | 79 | ```
|
40 | 80 |
|
41 |
| -Data will be generated into the `data` subdirectory and will not be checked in because this directory has been added |
42 |
| -to the `.gitignore` file. |
| 81 | +This produces results like: |
| 82 | + |
| 83 | +```shell |
| 84 | +Comparing main and mybranch |
| 85 | +-------------------- |
| 86 | +Benchmark tpch.json |
| 87 | +-------------------- |
| 88 | +┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ |
| 89 | +┃ Query ┃ main ┃ mybranch ┃ Change ┃ |
| 90 | +┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ |
| 91 | +│ QQuery 1 │ 2520.52ms │ 2795.09ms │ 1.11x slower │ |
| 92 | +│ QQuery 2 │ 222.37ms │ 216.01ms │ no change │ |
| 93 | +│ QQuery 3 │ 248.41ms │ 239.07ms │ no change │ |
| 94 | +│ QQuery 4 │ 144.01ms │ 129.28ms │ +1.11x faster │ |
| 95 | +│ QQuery 5 │ 339.54ms │ 327.53ms │ no change │ |
| 96 | +│ QQuery 6 │ 147.59ms │ 138.73ms │ +1.06x faster │ |
| 97 | +│ QQuery 7 │ 605.72ms │ 631.23ms │ no change │ |
| 98 | +│ QQuery 8 │ 326.35ms │ 372.12ms │ 1.14x slower │ |
| 99 | +│ QQuery 9 │ 579.02ms │ 634.73ms │ 1.10x slower │ |
| 100 | +│ QQuery 10 │ 403.38ms │ 420.39ms │ no change │ |
| 101 | +│ QQuery 11 │ 201.94ms │ 212.12ms │ 1.05x slower │ |
| 102 | +│ QQuery 12 │ 235.94ms │ 254.58ms │ 1.08x slower │ |
| 103 | +│ QQuery 13 │ 738.40ms │ 789.67ms │ 1.07x slower │ |
| 104 | +│ QQuery 14 │ 198.73ms │ 206.96ms │ no change │ |
| 105 | +│ QQuery 15 │ 183.32ms │ 179.53ms │ no change │ |
| 106 | +│ QQuery 16 │ 168.57ms │ 186.43ms │ 1.11x slower │ |
| 107 | +│ QQuery 17 │ 2032.57ms │ 2108.12ms │ no change │ |
| 108 | +│ QQuery 18 │ 1912.80ms │ 2134.82ms │ 1.12x slower │ |
| 109 | +│ QQuery 19 │ 391.64ms │ 368.53ms │ +1.06x faster │ |
| 110 | +│ QQuery 20 │ 648.22ms │ 691.41ms │ 1.07x slower │ |
| 111 | +│ QQuery 21 │ 866.25ms │ 1020.37ms │ 1.18x slower │ |
| 112 | +│ QQuery 22 │ 115.94ms │ 117.27ms │ no change │ |
| 113 | +└──────────────┴──────────────┴──────────────┴───────────────┘ |
| 114 | +-------------------- |
| 115 | +Benchmark tpch_mem.json |
| 116 | +-------------------- |
| 117 | +┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ |
| 118 | +┃ Query ┃ main ┃ mybranch ┃ Change ┃ |
| 119 | +┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ |
| 120 | +│ QQuery 1 │ 2182.44ms │ 2390.39ms │ 1.10x slower │ |
| 121 | +│ QQuery 2 │ 181.16ms │ 153.94ms │ +1.18x faster │ |
| 122 | +│ QQuery 3 │ 98.89ms │ 95.51ms │ no change │ |
| 123 | +│ QQuery 4 │ 61.43ms │ 66.15ms │ 1.08x slower │ |
| 124 | +│ QQuery 5 │ 260.20ms │ 283.65ms │ 1.09x slower │ |
| 125 | +│ QQuery 6 │ 24.24ms │ 23.39ms │ no change │ |
| 126 | +│ QQuery 7 │ 545.87ms │ 653.34ms │ 1.20x slower │ |
| 127 | +│ QQuery 8 │ 147.48ms │ 136.00ms │ +1.08x faster │ |
| 128 | +│ QQuery 9 │ 371.53ms │ 363.61ms │ no change │ |
| 129 | +│ QQuery 10 │ 197.91ms │ 190.37ms │ no change │ |
| 130 | +│ QQuery 11 │ 197.91ms │ 183.70ms │ +1.08x faster │ |
| 131 | +│ QQuery 12 │ 100.32ms │ 103.08ms │ no change │ |
| 132 | +│ QQuery 13 │ 428.02ms │ 440.26ms │ no change │ |
| 133 | +│ QQuery 14 │ 38.50ms │ 27.11ms │ +1.42x faster │ |
| 134 | +│ QQuery 15 │ 101.15ms │ 63.25ms │ +1.60x faster │ |
| 135 | +│ QQuery 16 │ 171.15ms │ 142.44ms │ +1.20x faster │ |
| 136 | +│ QQuery 17 │ 1885.05ms │ 1953.58ms │ no change │ |
| 137 | +│ QQuery 18 │ 1549.92ms │ 1914.06ms │ 1.23x slower │ |
| 138 | +│ QQuery 19 │ 106.53ms │ 104.28ms │ no change │ |
| 139 | +│ QQuery 20 │ 532.11ms │ 610.62ms │ 1.15x slower │ |
| 140 | +│ QQuery 21 │ 723.39ms │ 823.34ms │ 1.14x slower │ |
| 141 | +│ QQuery 22 │ 91.84ms │ 89.89ms │ no change │ |
| 142 | +└──────────────┴──────────────┴──────────────┴───────────────┘ |
| 143 | +``` |
| 144 | + |
| 145 | + |
| 146 | +# Benchmark Descriptions: |
| 147 | + |
| 148 | +## `tpch` Benchmark derived from TPC-H |
| 149 | + |
| 150 | +These benchmarks are derived from the [TPC-H][1] benchmark. And we use this repo as the source of tpch-gen and answers: |
| 151 | +https://github.com/databricks/tpch-dbgen.git, based on [2.17.1](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf) version of TPC-H. |
| 152 | + |
43 | 153 |
|
44 |
| -## Running the DataFusion Benchmarks |
| 154 | +### Running the DataFusion Benchmarks Manually |
45 | 155 |
|
46 | 156 | The benchmark can then be run (assuming the data created from `dbgen` is in `./data`) with a command such as:
|
47 | 157 |
|
@@ -126,7 +236,7 @@ This will produce output like
|
126 | 236 | └──────────────┴──────────────┴──────────────┴───────────────┘
|
127 | 237 | ```
|
128 | 238 |
|
129 |
| -## Expected output |
| 239 | +### Expected output |
130 | 240 |
|
131 | 241 | The result of query 1 should produce the following output when executed against the SF=1 dataset.
|
132 | 242 |
|
|
0 commit comments