@@ -229,31 +229,14 @@ This will produce output like
229
229
└──────────────┴──────────────┴──────────────┴───────────────┘
230
230
```
231
231
232
- ### Expected output
232
+ # Benchmark Runner
233
233
234
- The result of query 1 should produce the following output when executed against the SF=1 dataset.
234
+ The ` dfbench ` program contains subcommands to run the various
235
+ benchmarks. When benchmarking, it should always be built in release
236
+ mode using ` --release ` .
235
237
236
- ```
237
- +--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
238
- | l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order |
239
- +--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
240
- | A | F | 37734107 | 56586554400.73001 | 53758257134.870026 | 55909065222.82768 | 25.522005853257337 | 38273.12973462168 | 0.049985295838396455 | 1478493 |
241
- | N | F | 991417 | 1487504710.3799996 | 1413082168.0541 | 1469649223.1943746 | 25.516471920522985 | 38284.467760848296 | 0.05009342667421622 | 38854 |
242
- | N | O | 74476023 | 111701708529.50996 | 106118209986.10472 | 110367023144.56622 | 25.502229680934594 | 38249.1238377803 | 0.049996589476752576 | 2920373 |
243
- | R | F | 37719753 | 56568041380.90001 | 53741292684.60399 | 55889619119.83194 | 25.50579361269077 | 38250.854626099666 | 0.05000940583012587 | 1478870 |
244
- +--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
245
- Query 1 iteration 0 took 1956.1 ms
246
- Query 1 avg time: 1956.11 ms
247
- ```
248
-
249
- # Benchmark Descriptions
250
-
251
- ## ` dfbench `
252
-
253
- The ` dfbench ` program contains subcommands to run various benchmarks.
254
-
255
- Full help can be found in the relevant sub command. For example to get help for tpch,
256
- run ` cargo run --release --bin dfbench tpch --help `
238
+ Full help for each benchmark can be found in the relevant sub
239
+ command. For example to get help for tpch, run
257
240
258
241
``` shell
259
242
cargo run --release --bin dfbench --help
@@ -265,61 +248,52 @@ USAGE:
265
248
dfbench < SUBCOMMAND>
266
249
267
250
SUBCOMMANDS:
268
- clickbench Run the clickbench benchmark
269
- help Prints this message or the help of the given subcommand(s)
270
- tpch Run the tpch benchmark.
271
- tpch-convert Convert tpch .slt files to .parquet or .csv files
251
+ clickbench Run the clickbench benchmark
252
+ help Prints this message or the help of the given subcommand(s)
253
+ parquet-filter Test performance of parquet filter pushdown
254
+ sort Test performance of parquet filter pushdown
255
+ tpch Run the tpch benchmark.
256
+ tpch-convert Convert tpch .slt files to .parquet or .csv files
272
257
273
258
```
274
259
275
- ## h2o benchmarks
260
+ # Benchmarks
276
261
277
- ``` bash
278
- cargo run --release --bin h2o group-by --query 1 --path /mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv --mem-table --debug
279
- ```
262
+ The output of ` dfbench ` help includes a descripion of each benchmark, which is reproducedd here for convenience
280
263
281
- Example run:
264
+ ## ClickBench
282
265
283
- ```
284
- Running benchmarks with the following options: GroupBy(GroupBy { query: 1, path: "/mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv", debug: false })
285
- Executing select id1, sum(v1) as v1 from x group by id1
286
- +-------+--------+
287
- | id1 | v1 |
288
- +-------+--------+
289
- | id063 | 199420 |
290
- | id094 | 200127 |
291
- | id044 | 198886 |
292
- ...
293
- | id093 | 200132 |
294
- | id003 | 199047 |
295
- +-------+--------+
266
+ The ClickBench[ 1] benchmarks are widely cited in the industry and
267
+ focus on grouping / aggregation / filtering. This runner uses the
268
+ scripts and queries from [ 2] .
296
269
297
- h2o groupby query 1 took 1669 ms
298
- ```
270
+ [ 1 ] : https://github.com/ClickHouse/ClickBench
271
+ [ 2 ] : https://github.com/ClickHouse/ClickBench/tree/main/datafusion
299
272
300
- [ 1 ] : http://www.tpc.org/tpch/
301
- [ 2 ] : https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
273
+ ## Parquet Filter
302
274
303
- ## Parquet benchmarks
275
+ Test performance of parquet filter pushdown
304
276
305
- This is a set of benchmarks for testing and verifying performance of parquet filtering and sorting.
306
- The queries are executed on a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.
277
+ The queries are executed on a synthetic dataset generated during
278
+ the benchmark execution and designed to simulate web server access
279
+ logs.
307
280
308
- To run filter benchmarks, run:
281
+ Example
309
282
310
- ``` base
311
- cargo run --release --bin parquet -- filter --path ./data --scale-factor 1.0
312
- ```
283
+ dfbench parquet-filter --path ./data --scale-factor 1.0
313
284
314
- This will generate the synthetic dataset at ` ./data/logs.parquet ` . The size of the dataset can be controlled through the ` size_factor `
285
+ generates the synthetic dataset at ` ./data/logs.parquet ` . The size
286
+ of the dataset can be controlled through the ` size_factor `
315
287
(with the default value of ` 1.0 ` generating a ~ 1GB parquet file).
316
288
317
- For each filter we will run the query using different ` ParquetScanOption ` settings.
289
+ For each filter we will run the query using different
290
+ ` ParquetScanOption ` settings.
318
291
319
- Example run :
292
+ Example output :
320
293
321
294
```
322
- Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data", batch_size: 8192, scale_factor: 1.0 }
295
+ Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data",
296
+ batch_size: 8192, scale_factor: 1.0 }
323
297
Generated test dataset with 10699521 rows
324
298
Executing with filter 'request_method = Utf8("GET")'
325
299
Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
@@ -337,12 +311,56 @@ Iteration 2 returned 1781686 rows in 1947 ms
337
311
...
338
312
```
339
313
340
- Similarly, to run sorting benchmarks, run:
314
+ ## Sort
315
+ Test performance of sorting large datasets
316
+
317
+ This test sorts a a synthetic dataset generated during the
318
+ benchmark execution, designed to simulate sorting web server
319
+ access logs. Such sorting is often done during data transformation
320
+ steps.
321
+
322
+ The tests sort the entire dataset using several different sort
323
+ orders.
324
+
325
+ ## TPCH
326
+
327
+ Run the tpch benchmark.
328
+
329
+ This benchmarks is derived from the [ TPC-H] [ 1 ] version
330
+ [ 2.17.1] . The data and answers are generated using ` tpch-gen ` from
331
+ [ 2] .
332
+
333
+ [ 1 ] : http://www.tpc.org/tpch/
334
+ [ 2 ] : https://github.com/databricks/tpch-dbgen.git,
335
+ [ 2.17.1 ] : https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
336
+
337
+
338
+ # Older Benchmarks
339
+
340
+ ## h2o benchmarks
341
+
342
+ ``` bash
343
+ cargo run --release --bin h2o group-by --query 1 --path /mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv --mem-table --debug
344
+ ```
345
+
346
+ Example run:
347
+
348
+ ```
349
+ Running benchmarks with the following options: GroupBy(GroupBy { query: 1, path: "/mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv", debug: false })
350
+ Executing select id1, sum(v1) as v1 from x group by id1
351
+ +-------+--------+
352
+ | id1 | v1 |
353
+ +-------+--------+
354
+ | id063 | 199420 |
355
+ | id094 | 200127 |
356
+ | id044 | 198886 |
357
+ ...
358
+ | id093 | 200132 |
359
+ | id003 | 199047 |
360
+ +-------+--------+
341
361
342
- ``` base
343
- cargo run --release --bin parquet -- sort --path ./data --scale-factor 1.0
362
+ h2o groupby query 1 took 1669 ms
344
363
```
345
364
346
- This proceeds in the same way as the filter benchmarks: each sort expression
347
- combination will be run using the same set of ` ParquetScanOption ` as the
348
- filter benchmarks.
365
+ [ 1 ] : http://www.tpc.org/tpch/
366
+ [ 2 ] : https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
0 commit comments