Skip to content

Commit 11b7b5c

Browse files
authored
Add ClickBench queries to DataFusion benchmark runner (#7060)
* Add clickbench query runner to benchmarks, update docs * Fix numbering so it goes from 0 to 42
1 parent 5f03146 commit 11b7b5c

File tree

11 files changed

+354
-90
lines changed

11 files changed

+354
-90
lines changed

benchmarks/README.md

Lines changed: 55 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,14 @@
2020
# DataFusion Benchmarks
2121

2222
This crate contains benchmarks based on popular public data sets and
23-
open source benchmark suites, making it easy to run more realistic
24-
benchmarks to help with performance and scalability testing of DataFusion.
23+
open source benchmark suites, to help with performance and scalability
24+
testing of DataFusion.
2525

26-
# Benchmarks Against Other Engines
2726

27+
## Other engines
28+
29+
The benchmarks measure changes to DataFusion itself, rather than
30+
its performance against other engines. For competitive benchmarking,
2831
DataFusion is included in the benchmark setups for several popular
2932
benchmarks that compare performance with other engines. For example:
3033

@@ -36,30 +39,35 @@ benchmarks that compare performance with other engines. For example:
3639

3740
# Running the benchmarks
3841

39-
## Running Benchmarks
42+
## `bench.sh`
4043

41-
The easiest way to run benchmarks from DataFusion source checkouts is
42-
to use the [bench.sh](bench.sh) script. Usage instructions can be
43-
found with:
44+
The easiest way to run benchmarks is the [bench.sh](bench.sh)
45+
script. Usage instructions can be found with:
4446

4547
```shell
4648
# show usage
4749
./bench.sh
4850
```
4951

50-
## Generating Data
52+
## Generating data
53+
54+
You can create / download the data for these benchmarks using the [bench.sh](bench.sh) script:
5155

52-
You can create data for all these benchmarks using the [bench.sh](bench.sh) script:
56+
Create / download all datasets
5357

5458
```shell
5559
./bench.sh data
5660
```
5761

58-
Data is generated in the `data` subdirectory and will not be checked
59-
in because this directory has been added to the `.gitignore` file.
62+
Create / download a specific dataset (TPCH)
6063

64+
```shell
65+
./bench.sh data tpch
66+
```
6167

62-
## Example to compare peformance on main to a branch
68+
Data is placed in the `data` subdirectory.
69+
70+
## Comparing performance of main and a branch
6371

6472
```shell
6573
git checkout main
@@ -143,40 +151,24 @@ Benchmark tpch_mem.json
143151
```
144152

145153

146-
# Benchmark Descriptions:
147-
148-
## `tpch` Benchmark derived from TPC-H
149-
150-
These benchmarks are derived from the [TPC-H][1] benchmark. And we use this repo as the source of tpch-gen and answers:
151-
https://github.com/databricks/tpch-dbgen.git, based on [2.17.1](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf) version of TPC-H.
154+
### Running Benchmarks Manually
152155

153-
154-
### Running the DataFusion Benchmarks Manually
155-
156-
The benchmark can then be run (assuming the data created from `dbgen` is in `./data`) with a command such as:
156+
Assuming data in the `data` directory, the `tpch` benchmark can be run with a command like this
157157

158158
```bash
159-
cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
159+
cargo run --release --bin dfbench -- tpch --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
160160
```
161161

162-
If you omit `--query=<query_id>` argument, then all benchmarks will be run one by one (from query 1 to query 22).
162+
See the help for more details
163163

164-
```bash
165-
cargo run --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data --format tbl --batch-size 4096
166-
```
164+
### Different features
167165

168166
You can enable the features `simd` (to use SIMD instructions, `cargo nightly` is required.) and/or `mimalloc` or `snmalloc` (to use either the mimalloc or snmalloc allocator) as features by passing them in as `--features`:
169167

170168
```
171169
cargo run --release --features "simd mimalloc" --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
172170
```
173171

174-
If you want to disable collection of statistics (and thus cost based optimizers), you can pass `--disable-statistics` flag.
175-
176-
```bash
177-
cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path /mnt/tpch-parquet --format parquet --query 17 --disable-statistics
178-
```
179-
180172
The benchmark program also supports CSV and Parquet input file formats and a utility is provided to convert from `tbl`
181173
(generated by the `dbgen` utility) to CSV and Parquet.
182174

@@ -188,9 +180,10 @@ Or if you want to verify and run all the queries in the benchmark, you can just
188180

189181
### Comparing results between runs
190182

191-
Any `tpch` execution with `-o <dir>` argument will produce a summary file right under the `<dir>`
192-
directory. It is a JSON serialized form of all the runs that happened as well as the runtime metadata
193-
(number of cores, DataFusion version, etc.).
183+
Any `dfbench` execution with `-o <dir>` argument will produce a
184+
summary JSON in the specified directory. This file contains a
185+
serialized form of all the runs that happened and runtime
186+
metadata (number of cores, DataFusion version, etc.).
194187

195188
```shell
196189
$ git checkout main
@@ -253,6 +246,32 @@ Query 1 iteration 0 took 1956.1 ms
253246
Query 1 avg time: 1956.11 ms
254247
```
255248

249+
# Benchmark Descriptions
250+
251+
## `dfbench`
252+
253+
The `dfbench` program contains subcommands to run various benchmarks.
254+
255+
Full help can be found in the relevant sub command. For example to get help for tpch,
256+
run `cargo run --release --bin dfbench tpch --help`
257+
258+
```shell
259+
cargo run --release --bin dfbench --help
260+
...
261+
datafusion-benchmarks 27.0.0
262+
benchmark command
263+
264+
USAGE:
265+
dfbench <SUBCOMMAND>
266+
267+
SUBCOMMANDS:
268+
clickbench Run the clickbench benchmark
269+
help Prints this message or the help of the given subcommand(s)
270+
tpch Run the tpch benchmark.
271+
tpch-convert Convert tpch .slt files to .parquet or .csv files
272+
273+
```
274+
256275
## NYC Taxi Benchmark
257276

258277
These benchmarks are based on the [New York Taxi and Limousine Commission][2] data set.

benchmarks/bench.sh

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ BENCHMARK=all
3535
DATAFUSION_DIR=${DATAFUSION_DIR:-$SCRIPT_DIR/..}
3636
DATA_DIR=${DATA_DIR:-$SCRIPT_DIR/data}
3737
#CARGO_COMMAND=${CARGO_COMMAND:-"cargo run --release"}
38-
CARGO_COMMAND=${CARGO_COMMAND:-"cargo run --profile release-nonlto"} # TEMP: for faster iterations
38+
CARGO_COMMAND=${CARGO_COMMAND:-"cargo run --profile release-nonlto"} # for faster iterations
3939

4040
usage() {
4141
echo "
@@ -386,12 +386,18 @@ data_clickbench_partitioned() {
386386

387387
# Runs the clickbench benchmark with a single large parquet file
388388
run_clickbench_1() {
389-
echo "NOTICE: ClickBench (1 parquet file) is not yet supported"
389+
RESULTS_FILE="${RESULTS_DIR}/clickbench_1.json"
390+
echo "RESULTS_FILE: ${RESULTS_FILE}"
391+
echo "Running clickbench (1 file) benchmark..."
392+
$CARGO_COMMAND --bin dfbench -- clickbench --iterations 10 --path "${DATA_DIR}/hits.parquet" --queries-path "${SCRIPT_DIR}/queries/clickbench/queries.sql" -o ${RESULTS_FILE}
390393
}
391394

392395
# Runs the clickbench benchmark with a single large parquet file
393396
run_clickbench_partitioned() {
394-
echo "NOTICE: ClickBench (1 parquet file) is not yet supported"
397+
RESULTS_FILE="${RESULTS_DIR}/clickbench_1.json"
398+
echo "RESULTS_FILE: ${RESULTS_FILE}"
399+
echo "Running clickbench (partitioned, 100 files) benchmark..."
400+
$CARGO_COMMAND --bin dfbench -- clickbench --iterations 10 --path "${DATA_DIR}/hits_partitioned" --queries-path "${SCRIPT_DIR}/queries/clickbench/queries.sql" -o ${RESULTS_FILE}
395401
}
396402

397403
compare_benchmarks() {
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Downloaded from https://github.com/ClickHouse/ClickBench/blob/main/datafusion/queries.sql
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
SELECT COUNT(*) FROM hits;
2+
SELECT COUNT(*) FROM hits WHERE "AdvEngineID" <> 0;
3+
SELECT SUM("AdvEngineID"), COUNT(*), AVG("ResolutionWidth") FROM hits;
4+
SELECT AVG("UserID") FROM hits;
5+
SELECT COUNT(DISTINCT "UserID") FROM hits;
6+
SELECT COUNT(DISTINCT "SearchPhrase") FROM hits;
7+
SELECT MIN("EventDate"::INT::DATE), MAX("EventDate"::INT::DATE) FROM hits;
8+
SELECT "AdvEngineID", COUNT(*) FROM hits WHERE "AdvEngineID" <> 0 GROUP BY "AdvEngineID" ORDER BY COUNT(*) DESC;
9+
SELECT "RegionID", COUNT(DISTINCT "UserID") AS u FROM hits GROUP BY "RegionID" ORDER BY u DESC LIMIT 10;
10+
SELECT "RegionID", SUM("AdvEngineID"), COUNT(*) AS c, AVG("ResolutionWidth"), COUNT(DISTINCT "UserID") FROM hits GROUP BY "RegionID" ORDER BY c DESC LIMIT 10;
11+
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
12+
SELECT "MobilePhone", "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhone", "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
13+
SELECT "SearchPhrase", COUNT(*) AS c FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;
14+
SELECT "SearchPhrase", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY u DESC LIMIT 10;
15+
SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;
16+
SELECT "UserID", COUNT(*) FROM hits GROUP BY "UserID" ORDER BY COUNT(*) DESC LIMIT 10;
17+
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
18+
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
19+
SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
20+
SELECT "UserID" FROM hits WHERE "UserID" = 435090932899640449;
21+
SELECT COUNT(*) FROM hits WHERE "URL" LIKE '%google%';
22+
SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;
23+
SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;
24+
SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;
25+
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;
26+
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
27+
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY to_timestamp_seconds("EventTime"), "SearchPhrase" LIMIT 10;
28+
SELECT "CounterID", AVG(length("URL")) AS l, COUNT(*) AS c FROM hits WHERE "URL" <> '' GROUP BY "CounterID" HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
29+
SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
30+
SELECT SUM("ResolutionWidth"), SUM("ResolutionWidth" + 1), SUM("ResolutionWidth" + 2), SUM("ResolutionWidth" + 3), SUM("ResolutionWidth" + 4), SUM("ResolutionWidth" + 5), SUM("ResolutionWidth" + 6), SUM("ResolutionWidth" + 7), SUM("ResolutionWidth" + 8), SUM("ResolutionWidth" + 9), SUM("ResolutionWidth" + 10), SUM("ResolutionWidth" + 11), SUM("ResolutionWidth" + 12), SUM("ResolutionWidth" + 13), SUM("ResolutionWidth" + 14), SUM("ResolutionWidth" + 15), SUM("ResolutionWidth" + 16), SUM("ResolutionWidth" + 17), SUM("ResolutionWidth" + 18), SUM("ResolutionWidth" + 19), SUM("ResolutionWidth" + 20), SUM("ResolutionWidth" + 21), SUM("ResolutionWidth" + 22), SUM("ResolutionWidth" + 23), SUM("ResolutionWidth" + 24), SUM("ResolutionWidth" + 25), SUM("ResolutionWidth" + 26), SUM("ResolutionWidth" + 27), SUM("ResolutionWidth" + 28), SUM("ResolutionWidth" + 29), SUM("ResolutionWidth" + 30), SUM("ResolutionWidth" + 31), SUM("ResolutionWidth" + 32), SUM("ResolutionWidth" + 33), SUM("ResolutionWidth" + 34), SUM("ResolutionWidth" + 35), SUM("ResolutionWidth" + 36), SUM("ResolutionWidth" + 37), SUM("ResolutionWidth" + 38), SUM("ResolutionWidth" + 39), SUM("ResolutionWidth" + 40), SUM("ResolutionWidth" + 41), SUM("ResolutionWidth" + 42), SUM("ResolutionWidth" + 43), SUM("ResolutionWidth" + 44), SUM("ResolutionWidth" + 45), SUM("ResolutionWidth" + 46), SUM("ResolutionWidth" + 47), SUM("ResolutionWidth" + 48), SUM("ResolutionWidth" + 49), SUM("ResolutionWidth" + 50), SUM("ResolutionWidth" + 51), SUM("ResolutionWidth" + 52), SUM("ResolutionWidth" + 53), SUM("ResolutionWidth" + 54), SUM("ResolutionWidth" + 55), SUM("ResolutionWidth" + 56), SUM("ResolutionWidth" + 57), SUM("ResolutionWidth" + 58), SUM("ResolutionWidth" + 59), SUM("ResolutionWidth" + 60), SUM("ResolutionWidth" + 61), SUM("ResolutionWidth" + 62), SUM("ResolutionWidth" + 63), SUM("ResolutionWidth" + 64), SUM("ResolutionWidth" + 65), SUM("ResolutionWidth" + 66), SUM("ResolutionWidth" + 67), SUM("ResolutionWidth" + 68), SUM("ResolutionWidth" + 69), SUM("ResolutionWidth" + 70), SUM("ResolutionWidth" + 71), SUM("ResolutionWidth" + 72), SUM("ResolutionWidth" + 73), SUM("ResolutionWidth" + 74), SUM("ResolutionWidth" + 75), SUM("ResolutionWidth" + 76), SUM("ResolutionWidth" + 77), SUM("ResolutionWidth" + 78), SUM("ResolutionWidth" + 79), SUM("ResolutionWidth" + 80), SUM("ResolutionWidth" + 81), SUM("ResolutionWidth" + 82), SUM("ResolutionWidth" + 83), SUM("ResolutionWidth" + 84), SUM("ResolutionWidth" + 85), SUM("ResolutionWidth" + 86), SUM("ResolutionWidth" + 87), SUM("ResolutionWidth" + 88), SUM("ResolutionWidth" + 89) FROM hits;
31+
SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;
32+
SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;
33+
SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;
34+
SELECT "URL", COUNT(*) AS c FROM hits GROUP BY "URL" ORDER BY c DESC LIMIT 10;
35+
SELECT 1, "URL", COUNT(*) AS c FROM hits GROUP BY 1, "URL" ORDER BY c DESC LIMIT 10;
36+
SELECT "ClientIP", "ClientIP" - 1, "ClientIP" - 2, "ClientIP" - 3, COUNT(*) AS c FROM hits GROUP BY "ClientIP", "ClientIP" - 1, "ClientIP" - 2, "ClientIP" - 3 ORDER BY c DESC LIMIT 10;
37+
SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "URL" <> '' GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10;
38+
SELECT "Title", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "DontCountHits" = 0 AND "IsRefresh" = 0 AND "Title" <> '' GROUP BY "Title" ORDER BY PageViews DESC LIMIT 10;
39+
SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "IsLink" <> 0 AND "IsDownload" = 0 GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
40+
SELECT "TraficSourceID", "SearchEngineID", "AdvEngineID", CASE WHEN ("SearchEngineID" = 0 AND "AdvEngineID" = 0) THEN "Referer" ELSE '' END AS Src, "URL" AS Dst, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 GROUP BY "TraficSourceID", "SearchEngineID", "AdvEngineID", Src, Dst ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;
41+
SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "TraficSourceID" IN (-1, 6) AND "RefererHash" = 3594120000172545465 GROUP BY "URLHash", "EventDate"::INT::DATE ORDER BY PageViews DESC LIMIT 10 OFFSET 100;
42+
SELECT "WindowClientWidth", "WindowClientHeight", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "DontCountHits" = 0 AND "URLHash" = 2868770270353813622 GROUP BY "WindowClientWidth", "WindowClientHeight" ORDER BY PageViews DESC LIMIT 10 OFFSET 10000;
43+
SELECT DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) AS M, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-14' AND "EventDate"::INT::DATE <= '2013-07-15' AND "IsRefresh" = 0 AND "DontCountHits" = 0 GROUP BY DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) ORDER BY DATE_TRUNC('minute', M) LIMIT 10 OFFSET 1000;

benchmarks/src/bin/dfbench.rs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,13 +28,14 @@ static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
2828
#[global_allocator]
2929
static ALLOC: mimalloc::MiMalloc = mimalloc::MiMalloc;
3030

31-
use datafusion_benchmarks::tpch;
31+
use datafusion_benchmarks::{clickbench, tpch};
3232

3333
#[derive(Debug, StructOpt)]
3434
#[structopt(about = "benchmark command")]
3535
enum Options {
3636
Tpch(tpch::RunOpt),
3737
TpchConvert(tpch::ConvertOpt),
38+
Clickbench(clickbench::RunOpt),
3839
}
3940

4041
// Main benchmark runner entrypoint
@@ -45,5 +46,6 @@ pub async fn main() -> Result<()> {
4546
match Options::from_args() {
4647
Options::Tpch(opt) => opt.run().await,
4748
Options::TpchConvert(opt) => opt.run().await,
49+
Options::Clickbench(opt) => opt.run().await,
4850
}
4951
}

benchmarks/src/bin/tpch.rs

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,9 +43,10 @@ enum TpchOpt {
4343
Convert(tpch::ConvertOpt),
4444
}
4545

46-
/// 'tpch' entry point, with tortured command line arguments
46+
/// 'tpch' entry point, with tortured command line arguments. Please
47+
/// use `dbbench` instead.
4748
///
48-
/// This is kept to be backwards compatible with the benchmark names prior to
49+
/// Note: this is kept to be backwards compatible with the benchmark names prior to
4950
/// <https://github.com/apache/arrow-datafusion/issues/6994>
5051
#[tokio::main]
5152
async fn main() -> Result<()> {

0 commit comments

Comments
 (0)