-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add parquet filter and sort to bench.sh #6172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
RESULTS_FILE="${RESULTS_DIR}/parquet.json" | ||
echo "RESULTS_FILE: ${RESULTS_FILE}" | ||
echo "Running parquet filter benchmark..." | ||
$CARGO_COMMAND --bin parquet -- filter --path "${DATA_DIR}" --scale-factor 1.0 --iterations 5 -o ${RESULTS_FILE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we dont apply scale-factor for tpch? is it 1 by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is even more confusing -- the scale factor for tpch is actually a applied, but it is applied when we create the data. Specifically
The "parquet" benchmark actually doesn't use the tpch dataset at all, and instead generates its own data. Hwever, it overloads the "scale factor" terminology to describe the relative sizes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes things no worse! I think TPC-H scaling could be handled in a separate issue/PR.
RESULTS_FILE="${RESULTS_DIR}/parquet.json" | ||
echo "RESULTS_FILE: ${RESULTS_FILE}" | ||
echo "Running parquet filter benchmark..." | ||
$CARGO_COMMAND --bin parquet -- filter --path "${DATA_DIR}" --scale-factor 1.0 --iterations 5 -o ${RESULTS_FILE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is somewhat ugly to have to run a second command(parquet
) for different benchmarks -- I plan to combine them into a single benchmark runner over time
RESULTS_FILE="${RESULTS_DIR}/parquet.json" | ||
echo "RESULTS_FILE: ${RESULTS_FILE}" | ||
echo "Running parquet filter benchmark..." | ||
$CARGO_COMMAND --bin parquet -- filter --path "${DATA_DIR}" --scale-factor 1.0 --iterations 5 -o ${RESULTS_FILE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is even more confusing -- the scale factor for tpch is actually a applied, but it is applied when we create the data. Specifically
The "parquet" benchmark actually doesn't use the tpch dataset at all, and instead generates its own data. Hwever, it overloads the "scale factor" terminology to describe the relative sizes
Which issue does this PR close?
Follow on to #6131
Rationale for this change
Add support for running parquet filter and sort benchmarks to sort
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
You can now run
and
to run benchmarks