Skip to content

Commit 2ec0bc1

Browse files
authored
Add parquet-filter and sort benchmarks to dfbench (#7120)
* Add parquet-filter and sort benchmarks to dfbench * fix * fix docs * fix ci bench * Update docs
1 parent 563a1dc commit 2ec0bc1

File tree

13 files changed

+593
-397
lines changed

13 files changed

+593
-397
lines changed

benchmarks/README.md

Lines changed: 84 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -229,31 +229,14 @@ This will produce output like
229229
└──────────────┴──────────────┴──────────────┴───────────────┘
230230
```
231231

232-
### Expected output
232+
# Benchmark Runner
233233

234-
The result of query 1 should produce the following output when executed against the SF=1 dataset.
234+
The `dfbench` program contains subcommands to run the various
235+
benchmarks. When benchmarking, it should always be built in release
236+
mode using `--release`.
235237

236-
```
237-
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
238-
| l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order |
239-
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
240-
| A | F | 37734107 | 56586554400.73001 | 53758257134.870026 | 55909065222.82768 | 25.522005853257337 | 38273.12973462168 | 0.049985295838396455 | 1478493 |
241-
| N | F | 991417 | 1487504710.3799996 | 1413082168.0541 | 1469649223.1943746 | 25.516471920522985 | 38284.467760848296 | 0.05009342667421622 | 38854 |
242-
| N | O | 74476023 | 111701708529.50996 | 106118209986.10472 | 110367023144.56622 | 25.502229680934594 | 38249.1238377803 | 0.049996589476752576 | 2920373 |
243-
| R | F | 37719753 | 56568041380.90001 | 53741292684.60399 | 55889619119.83194 | 25.50579361269077 | 38250.854626099666 | 0.05000940583012587 | 1478870 |
244-
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
245-
Query 1 iteration 0 took 1956.1 ms
246-
Query 1 avg time: 1956.11 ms
247-
```
248-
249-
# Benchmark Descriptions
250-
251-
## `dfbench`
252-
253-
The `dfbench` program contains subcommands to run various benchmarks.
254-
255-
Full help can be found in the relevant sub command. For example to get help for tpch,
256-
run `cargo run --release --bin dfbench tpch --help`
238+
Full help for each benchmark can be found in the relevant sub
239+
command. For example to get help for tpch, run
257240

258241
```shell
259242
cargo run --release --bin dfbench --help
@@ -265,61 +248,52 @@ USAGE:
265248
dfbench <SUBCOMMAND>
266249

267250
SUBCOMMANDS:
268-
clickbench Run the clickbench benchmark
269-
help Prints this message or the help of the given subcommand(s)
270-
tpch Run the tpch benchmark.
271-
tpch-convert Convert tpch .slt files to .parquet or .csv files
251+
clickbench Run the clickbench benchmark
252+
help Prints this message or the help of the given subcommand(s)
253+
parquet-filter Test performance of parquet filter pushdown
254+
sort Test performance of parquet filter pushdown
255+
tpch Run the tpch benchmark.
256+
tpch-convert Convert tpch .slt files to .parquet or .csv files
272257

273258
```
274259

275-
## h2o benchmarks
260+
# Benchmarks
276261

277-
```bash
278-
cargo run --release --bin h2o group-by --query 1 --path /mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv --mem-table --debug
279-
```
262+
The output of `dfbench` help includes a descripion of each benchmark, which is reproducedd here for convenience
280263

281-
Example run:
264+
## ClickBench
282265

283-
```
284-
Running benchmarks with the following options: GroupBy(GroupBy { query: 1, path: "/mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv", debug: false })
285-
Executing select id1, sum(v1) as v1 from x group by id1
286-
+-------+--------+
287-
| id1 | v1 |
288-
+-------+--------+
289-
| id063 | 199420 |
290-
| id094 | 200127 |
291-
| id044 | 198886 |
292-
...
293-
| id093 | 200132 |
294-
| id003 | 199047 |
295-
+-------+--------+
266+
The ClickBench[1] benchmarks are widely cited in the industry and
267+
focus on grouping / aggregation / filtering. This runner uses the
268+
scripts and queries from [2].
296269

297-
h2o groupby query 1 took 1669 ms
298-
```
270+
[1]: https://github.com/ClickHouse/ClickBench
271+
[2]: https://github.com/ClickHouse/ClickBench/tree/main/datafusion
299272

300-
[1]: http://www.tpc.org/tpch/
301-
[2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
273+
## Parquet Filter
302274

303-
## Parquet benchmarks
275+
Test performance of parquet filter pushdown
304276

305-
This is a set of benchmarks for testing and verifying performance of parquet filtering and sorting.
306-
The queries are executed on a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.
277+
The queries are executed on a synthetic dataset generated during
278+
the benchmark execution and designed to simulate web server access
279+
logs.
307280

308-
To run filter benchmarks, run:
281+
Example
309282

310-
```base
311-
cargo run --release --bin parquet -- filter --path ./data --scale-factor 1.0
312-
```
283+
dfbench parquet-filter --path ./data --scale-factor 1.0
313284

314-
This will generate the synthetic dataset at `./data/logs.parquet`. The size of the dataset can be controlled through the `size_factor`
285+
generates the synthetic dataset at `./data/logs.parquet`. The size
286+
of the dataset can be controlled through the `size_factor`
315287
(with the default value of `1.0` generating a ~1GB parquet file).
316288

317-
For each filter we will run the query using different `ParquetScanOption` settings.
289+
For each filter we will run the query using different
290+
`ParquetScanOption` settings.
318291

319-
Example run:
292+
Example output:
320293

321294
```
322-
Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data", batch_size: 8192, scale_factor: 1.0 }
295+
Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data",
296+
batch_size: 8192, scale_factor: 1.0 }
323297
Generated test dataset with 10699521 rows
324298
Executing with filter 'request_method = Utf8("GET")'
325299
Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
@@ -337,12 +311,56 @@ Iteration 2 returned 1781686 rows in 1947 ms
337311
...
338312
```
339313

340-
Similarly, to run sorting benchmarks, run:
314+
## Sort
315+
Test performance of sorting large datasets
316+
317+
This test sorts a a synthetic dataset generated during the
318+
benchmark execution, designed to simulate sorting web server
319+
access logs. Such sorting is often done during data transformation
320+
steps.
321+
322+
The tests sort the entire dataset using several different sort
323+
orders.
324+
325+
## TPCH
326+
327+
Run the tpch benchmark.
328+
329+
This benchmarks is derived from the [TPC-H][1] version
330+
[2.17.1]. The data and answers are generated using `tpch-gen` from
331+
[2].
332+
333+
[1]: http://www.tpc.org/tpch/
334+
[2]: https://github.com/databricks/tpch-dbgen.git,
335+
[2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
336+
337+
338+
# Older Benchmarks
339+
340+
## h2o benchmarks
341+
342+
```bash
343+
cargo run --release --bin h2o group-by --query 1 --path /mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv --mem-table --debug
344+
```
345+
346+
Example run:
347+
348+
```
349+
Running benchmarks with the following options: GroupBy(GroupBy { query: 1, path: "/mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv", debug: false })
350+
Executing select id1, sum(v1) as v1 from x group by id1
351+
+-------+--------+
352+
| id1 | v1 |
353+
+-------+--------+
354+
| id063 | 199420 |
355+
| id094 | 200127 |
356+
| id044 | 198886 |
357+
...
358+
| id093 | 200132 |
359+
| id003 | 199047 |
360+
+-------+--------+
341361
342-
```base
343-
cargo run --release --bin parquet -- sort --path ./data --scale-factor 1.0
362+
h2o groupby query 1 took 1669 ms
344363
```
345364

346-
This proceeds in the same way as the filter benchmarks: each sort expression
347-
combination will be run using the same set of `ParquetScanOption` as the
348-
filter benchmarks.
365+
[1]: http://www.tpc.org/tpch/
366+
[2]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

benchmarks/bench.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,7 @@ main() {
182182
# navigate to the appropriate directory
183183
pushd "${DATAFUSION_DIR}/benchmarks" > /dev/null
184184
mkdir -p "${RESULTS_DIR}"
185+
mkdir -p "${DATA_DIR}"
185186
case "$BENCHMARK" in
186187
all)
187188
run_tpch "1"

benchmarks/src/bin/dfbench.rs

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,14 +28,16 @@ static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
2828
#[global_allocator]
2929
static ALLOC: mimalloc::MiMalloc = mimalloc::MiMalloc;
3030

31-
use datafusion_benchmarks::{clickbench, tpch};
31+
use datafusion_benchmarks::{clickbench, parquet_filter, sort, tpch};
3232

3333
#[derive(Debug, StructOpt)]
3434
#[structopt(about = "benchmark command")]
3535
enum Options {
3636
Tpch(tpch::RunOpt),
3737
TpchConvert(tpch::ConvertOpt),
3838
Clickbench(clickbench::RunOpt),
39+
ParquetFilter(parquet_filter::RunOpt),
40+
Sort(sort::RunOpt),
3941
}
4042

4143
// Main benchmark runner entrypoint
@@ -47,5 +49,7 @@ pub async fn main() -> Result<()> {
4749
Options::Tpch(opt) => opt.run().await,
4850
Options::TpchConvert(opt) => opt.run().await,
4951
Options::Clickbench(opt) => opt.run().await,
52+
Options::ParquetFilter(opt) => opt.run().await,
53+
Options::Sort(opt) => opt.run().await,
5054
}
5155
}

0 commit comments

Comments
 (0)