-
Notifications
You must be signed in to change notification settings - Fork 204
Commit 85fe884
chore: Merge remote-tracking branch 'apache/main' into comet-parquet-exec - 20240121 (#1316)
* feat: support array_append (#1072)
* feat: support array_append
* formatted code
* rewrite array_append plan to match spark behaviour and fixed bug in QueryPlan serde
* remove unwrap
* Fix for Spark 3.3
* refactor array_append binary expression serde code
* Disabled array_append test for spark 4.0+
* chore: Simplify CometShuffleMemoryAllocator to use Spark unified memory allocator (#1063)
* docs: Update benchmarking.md (#1085)
* feat: Require offHeap memory to be enabled (always use unified memory) (#1062)
* Require offHeap memory
* remove unused import
* use off heap memory in stability tests
* reorder imports
* test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE config (#1087)
* Add changelog for 0.4.0 (#1089)
* chore: Prepare for 0.5.0 development (#1090)
* Update version number for build
* update docs
* build: Skip installation of spark-integration and fuzz testing modules (#1091)
* Add hint for finding the GPG key to use when publishing to maven (#1093)
* docs: Update documentation for 0.4.0 release (#1096)
* update TPC-H results
* update Maven links
* update benchmarking guide and add TPC-DS results
* include q72
* fix: Unsigned type related bugs (#1095)
## Which issue does this PR close?
Closes #1067
## Rationale for this change
Bug fix. A few expressions were failing some unsigned type related tests
## What changes are included in this PR?
- For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits
- `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`)
## How are these changes tested?
Put back tests for unsigned types
* chore: Include first ScanExec batch in metrics (#1105)
* include first batch in ScanExec metrics
* record row count metric
* fix regression
* chore: Improve CometScan metrics (#1100)
* Add native metrics for plan creation
* make messages consistent
* Include get_next_batch cost in metrics
* formatting
* fix double count of rows
* chore: Add custom metric for native shuffle fetching batches from JVM (#1108)
* feat: support array_insert (#1073)
* Part of the implementation of array_insert
* Missing methods
* Working version
* Reformat code
* Fix code-style
* Add comments about spark's implementation.
* Implement negative indices
+ fix tests for spark < 3.4
* Fix code-style
* Fix scalastyle
* Fix tests for spark < 3.4
* Fixes & tests
- added test for the negative index
- added test for the legacy spark mode
* Use assume(isSpark34Plus) in tests
* Test else-branch & improve coverage
* Update native/spark-expr/src/list.rs
Co-authored-by: Andy Grove <[email protected]>
* Fix fallback test
In one case there is a zero in index and test fails due to spark error
* Adjust the behaviour for the NULL case to Spark
* Move the logic of type checking to the method
* Fix code-style
---------
Co-authored-by: Andy Grove <[email protected]>
* feat: enable decimal to decimal cast of different precision and scale (#1086)
* enable decimal to decimal cast of different precision and scale
* add more test cases for negative scale and higher precision
* add check for compatibility for decimal to decimal
* fix code style
* Update spark/src/main/scala/org/apache/comet/expressions/CometCast.scala
Co-authored-by: Andy Grove <[email protected]>
* fix the nit in comment
---------
Co-authored-by: himadripal <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
* docs: fix readme FGPA/FPGA typo (#1117)
* fix: Use RDD partition index (#1112)
* fix: Use RDD partition index
* fix
* fix
* fix
* fix: Various metrics bug fixes and improvements (#1111)
* fix: Don't create CometScanExec for subclasses of ParquetFileFormat (#1129)
* Use exact class comparison for parquet scan
* Add test
* Add comment
* fix: Fix metrics regressions (#1132)
* fix metrics issues
* clippy
* update tests
* docs: Add more technical detail and new diagram to Comet plugin overview (#1119)
* Add more technical detail and new diagram to Comet plugin overview
* update diagram
* add info on Arrow IPC
* update diagram
* update diagram
* update docs
* address feedback
* Stop passing Java config map into native createPlan (#1101)
* feat: Improve ScanExec native metrics (#1133)
* save
* remove shuffle jvm metric and update tuning guide
* docs
* add source for all ScanExecs
* address feedback
* address feedback
* chore: Remove unused StringView struct (#1143)
* Remove unused StringView struct
* remove more dead code
* docs: Add some documentation explaining how shuffle works (#1148)
* add some notes on shuffle
* reads
* improve docs
* test: enable more Spark 4.0 tests (#1145)
## Which issue does this PR close?
Part of #372 and #551
## Rationale for this change
To be ready for Spark 4.0
## What changes are included in this PR?
This PR enables more Spark 4.0 tests that were fixed by recent changes
## How are these changes tested?
tests enabled
* chore: Refactor cast to use SparkCastOptions param (#1146)
* Refactor cast to use SparkCastOptions param
* update tests
* update benches
* update benches
* update benches
* Enable more scenarios in CometExecBenchmark. (#1151)
* chore: Move more expressions from core crate to spark-expr crate (#1152)
* move aggregate expressions to spark-expr crate
* move more expressions
* move benchmark
* normalize_nan
* bitwise not
* comet scalar funcs
* update bench imports
* remove dead code (#1155)
* fix: Spark 4.0-preview1 SPARK-47120 (#1156)
## Which issue does this PR close?
Part of #372 and #551
## Rationale for this change
To be ready for Spark 4.0
## What changes are included in this PR?
This PR fixes the new test SPARK-47120 added in Spark 4.0
## How are these changes tested?
tests enabled
* chore: Move string kernels and expressions to spark-expr crate (#1164)
* Move string kernels and expressions to spark-expr crate
* remove unused hash kernel
* remove unused dependencies
* chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165)
* move CheckOverflow to spark-expr crate
* move NegativeExpr to spark-expr crate
* move UnboundColumn to spark-expr crate
* move ExpandExec from execution::datafusion::operators to execution::operators
* refactoring to remove datafusion subpackage
* update imports in benches
* fix
* fix
* chore: Add ignored tests for reading complex types from Parquet (#1167)
* Add ignored tests for reading structs from Parquet
* add basic map test
* add tests for Map and Array
* feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169)
* Add Spark-compatible SchemaAdapterFactory implementation
* remove prototype code
* fix
* refactor
* implement more cast logic
* implement more cast logic
* add basic test
* improve test
* cleanup
* fmt
* add support for casting unsigned int to signed int
* clippy
* address feedback
* fix test
* fix: Document enabling comet explain plan usage in Spark (4.0) (#1176)
* test: enabling Spark tests with offHeap requirement (#1177)
## Which issue does this PR close?
## Rationale for this change
After #1062 We have not running Spark tests for native execution
## What changes are included in this PR?
Removed the off heap requirement for testing
## How are these changes tested?
Bringing back Spark tests for native execution
* feat: Improve shuffle metrics (second attempt) (#1175)
* improve shuffle metrics
* docs
* more metrics
* refactor
* address feedback
* fix: stddev_pop should not directly return 0.0 when count is 1.0 (#1184)
* add test
* fix
* fix
* fix
* feat: Make native shuffle compression configurable and respect `spark.shuffle.compress` (#1185)
* Make shuffle compression codec and level configurable
* remove lz4 references
* docs
* update comment
* clippy
* fix benches
* clippy
* clippy
* disable test for miri
* remove lz4 reference from proto
* minor: move shuffle classes from common to spark (#1193)
* minor: refactor decodeBatches to make private in broadcast exchange (#1195)
* minor: refactor prepare_output so that it does not require an ExecutionContext (#1194)
* fix: fix missing explanation for then branch in case when (#1200)
* minor: remove unused source files (#1202)
* chore: Upgrade to DataFusion 44.0.0-rc2 (#1154)
* move aggregate expressions to spark-expr crate
* move more expressions
* move benchmark
* normalize_nan
* bitwise not
* comet scalar funcs
* update bench imports
* save
* save
* save
* remove unused imports
* clippy
* implement more hashers
* implement Hash and PartialEq
* implement Hash and PartialEq
* implement Hash and PartialEq
* benches
* fix ScalarUDFImpl.return_type failure
* exclude test from miri
* ignore correct test
* ignore another test
* remove miri checks
* use return_type_from_exprs
* Revert "use return_type_from_exprs"
This reverts commit febc1f1.
* use DF main branch
* hacky workaround for regression in ScalarUDFImpl.return_type
* fix repo url
* pin to revision
* bump to latest rev
* bump to latest DF rev
* bump DF to rev 9f530dd
* add Cargo.lock
* bump DF version
* no default features
* Revert "remove miri checks"
This reverts commit 4638fe3.
* Update pin to DataFusion e99e02b9b9093ceb0c13a2dd32a2a89beba47930
* update pin
* Update Cargo.toml
Bump to 44.0.0-rc2
* update cargo lock
* revert miri change
---------
Co-authored-by: Andrew Lamb <[email protected]>
* feat: add support for array_contains expression (#1163)
* feat: add support for array_contains expression
* test: add unit test for array_contains function
* Removes unnecessary case expression for handling null values
* chore: Move more expressions from core crate to spark-expr crate (#1152)
* move aggregate expressions to spark-expr crate
* move more expressions
* move benchmark
* normalize_nan
* bitwise not
* comet scalar funcs
* update bench imports
* remove dead code (#1155)
* fix: Spark 4.0-preview1 SPARK-47120 (#1156)
## Which issue does this PR close?
Part of #372 and #551
## Rationale for this change
To be ready for Spark 4.0
## What changes are included in this PR?
This PR fixes the new test SPARK-47120 added in Spark 4.0
## How are these changes tested?
tests enabled
* chore: Move string kernels and expressions to spark-expr crate (#1164)
* Move string kernels and expressions to spark-expr crate
* remove unused hash kernel
* remove unused dependencies
* chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165)
* move CheckOverflow to spark-expr crate
* move NegativeExpr to spark-expr crate
* move UnboundColumn to spark-expr crate
* move ExpandExec from execution::datafusion::operators to execution::operators
* refactoring to remove datafusion subpackage
* update imports in benches
* fix
* fix
* chore: Add ignored tests for reading complex types from Parquet (#1167)
* Add ignored tests for reading structs from Parquet
* add basic map test
* add tests for Map and Array
* feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169)
* Add Spark-compatible SchemaAdapterFactory implementation
* remove prototype code
* fix
* refactor
* implement more cast logic
* implement more cast logic
* add basic test
* improve test
* cleanup
* fmt
* add support for casting unsigned int to signed int
* clippy
* address feedback
* fix test
* fix: Document enabling comet explain plan usage in Spark (4.0) (#1176)
* test: enabling Spark tests with offHeap requirement (#1177)
## Which issue does this PR close?
## Rationale for this change
After #1062 We have not running Spark tests for native execution
## What changes are included in this PR?
Removed the off heap requirement for testing
## How are these changes tested?
Bringing back Spark tests for native execution
* feat: Improve shuffle metrics (second attempt) (#1175)
* improve shuffle metrics
* docs
* more metrics
* refactor
* address feedback
* fix: stddev_pop should not directly return 0.0 when count is 1.0 (#1184)
* add test
* fix
* fix
* fix
* feat: Make native shuffle compression configurable and respect `spark.shuffle.compress` (#1185)
* Make shuffle compression codec and level configurable
* remove lz4 references
* docs
* update comment
* clippy
* fix benches
* clippy
* clippy
* disable test for miri
* remove lz4 reference from proto
* minor: move shuffle classes from common to spark (#1193)
* minor: refactor decodeBatches to make private in broadcast exchange (#1195)
* minor: refactor prepare_output so that it does not require an ExecutionContext (#1194)
* fix: fix missing explanation for then branch in case when (#1200)
* minor: remove unused source files (#1202)
* chore: Upgrade to DataFusion 44.0.0-rc2 (#1154)
* move aggregate expressions to spark-expr crate
* move more expressions
* move benchmark
* normalize_nan
* bitwise not
* comet scalar funcs
* update bench imports
* save
* save
* save
* remove unused imports
* clippy
* implement more hashers
* implement Hash and PartialEq
* implement Hash and PartialEq
* implement Hash and PartialEq
* benches
* fix ScalarUDFImpl.return_type failure
* exclude test from miri
* ignore correct test
* ignore another test
* remove miri checks
* use return_type_from_exprs
* Revert "use return_type_from_exprs"
This reverts commit febc1f1.
* use DF main branch
* hacky workaround for regression in ScalarUDFImpl.return_type
* fix repo url
* pin to revision
* bump to latest rev
* bump to latest DF rev
* bump DF to rev 9f530dd
* add Cargo.lock
* bump DF version
* no default features
* Revert "remove miri checks"
This reverts commit 4638fe3.
* Update pin to DataFusion e99e02b9b9093ceb0c13a2dd32a2a89beba47930
* update pin
* Update Cargo.toml
Bump to 44.0.0-rc2
* update cargo lock
* revert miri change
---------
Co-authored-by: Andrew Lamb <[email protected]>
* update UT
Signed-off-by: Dharan Aditya <[email protected]>
* fix typo in UT
Signed-off-by: Dharan Aditya <[email protected]>
---------
Signed-off-by: Dharan Aditya <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: KAZUYUKI TANIMURA <[email protected]>
Co-authored-by: Parth Chandra <[email protected]>
Co-authored-by: Liang-Chi Hsieh <[email protected]>
Co-authored-by: Raz Luvaton <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
* feat: Add a `spark.comet.exec.memoryPool` configuration for experimenting with various datafusion memory pool setups. (#1021)
* feat: Reenable tests for filtered SMJ anti join (#1211)
* feat: reenable filtered SMJ Anti join tests
* feat: reenable filtered SMJ Anti join tests
* feat: reenable filtered SMJ Anti join tests
* feat: reenable filtered SMJ Anti join tests
* Add CoalesceBatchesExec around SMJ with join filter
* adding `CoalesceBatches`
* adding `CoalesceBatches`
* adding `CoalesceBatches`
* feat: reenable filtered SMJ Anti join tests
* feat: reenable filtered SMJ Anti join tests
---------
Co-authored-by: Andy Grove <[email protected]>
* chore: Add safety check to CometBuffer (#1050)
* chore: Add safety check to CometBuffer
* Add CometColumnarToRowExec
* fix
* fix
* more
* Update plan stability results
* fix
* fix
* fix
* Revert "fix"
This reverts commit 9bad173.
* Revert "Revert "fix""
This reverts commit d527ad1.
* fix BucketedReadWithoutHiveSupportSuite
* fix SparkPlanSuite
* remove unreachable code (#1213)
* test: Enable Comet by default except some tests in SparkSessionExtensionSuite (#1201)
## Which issue does this PR close?
Part of #1197
## Rationale for this change
Since `loadCometExtension` in the diffs were not using `isCometEnabled`, `SparkSessionExtensionSuite` was not using Comet. Once enabled, some test failures discovered
## What changes are included in this PR?
`loadCometExtension` now uses `isCometEnabled` that enables Comet by default
Temporary ignore the failing tests in SparkSessionExtensionSuite
## How are these changes tested?
existing tests
* extract struct expressions to folders based on spark grouping (#1216)
* chore: extract static invoke expressions to folders based on spark grouping (#1217)
* extract static invoke expressions to folders based on spark grouping
* Update native/spark-expr/src/static_invoke/mod.rs
Co-authored-by: Andy Grove <[email protected]>
---------
Co-authored-by: Andy Grove <[email protected]>
* chore: Follow-on PR to fully enable onheap memory usage (#1210)
* Make datafusion's native memory pool configurable
* save
* fix
* Update memory calculation and add draft documentation
* ready for review
* ready for review
* address feedback
* Update docs/source/user-guide/tuning.md
Co-authored-by: Liang-Chi Hsieh <[email protected]>
* Update docs/source/user-guide/tuning.md
Co-authored-by: Kristin Cowalcijk <[email protected]>
* Update docs/source/user-guide/tuning.md
Co-authored-by: Liang-Chi Hsieh <[email protected]>
* Update docs/source/user-guide/tuning.md
Co-authored-by: Liang-Chi Hsieh <[email protected]>
* remove unused config
---------
Co-authored-by: Kristin Cowalcijk <[email protected]>
Co-authored-by: Liang-Chi Hsieh <[email protected]>
* feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support (#1192)
* Implement native decoding and decompression
* revert some variable renaming for smaller diff
* fix oom issues?
* make NativeBatchDecoderIterator more consistent with ArrowReaderIterator
* fix oom and prep for review
* format
* Add LZ4 support
* clippy, new benchmark
* rename metrics, clean up lz4 code
* update test
* Add support for snappy
* format
* change default back to lz4
* make metrics more accurate
* format
* clippy
* use faster unsafe version of lz4_flex
* Make compression codec configurable for columnar shuffle
* clippy
* fix bench
* fmt
* address feedback
* address feedback
* address feedback
* minor code simplification
* cargo fmt
* overflow check
* rename compression level config
* address feedback
* address feedback
* rename constant
* chore: extract agg_funcs expressions to folders based on spark grouping (#1224)
* extract agg_funcs expressions to folders based on spark grouping
* fix rebase
* extract datetime_funcs expressions to folders based on spark grouping (#1222)
Co-authored-by: Andy Grove <[email protected]>
* chore: use datafusion from crates.io (#1232)
* chore: extract strings file to `strings_func` like in spark grouping (#1215)
* chore: extract predicate_functions expressions to folders based on spark grouping (#1218)
* extract predicate_functions expressions to folders based on spark grouping
* code review changes
---------
Co-authored-by: Andy Grove <[email protected]>
* build(deps): bump protobuf version to 3.21.12 (#1234)
* extract json_funcs expressions to folders based on spark grouping (#1220)
Co-authored-by: Andy Grove <[email protected]>
* test: Enable shuffle by default in Spark tests (#1240)
## Which issue does this PR close?
## Rationale for this change
Because `isCometShuffleEnabled` is false by default, some tests were not reached
## What changes are included in this PR?
Removed `isCometShuffleEnabled` and updated spark test diff
## How are these changes tested?
existing test
* chore: extract hash_funcs expressions to folders based on spark grouping (#1221)
* extract hash_funcs expressions to folders based on spark grouping
* extract hash_funcs expressions to folders based on spark grouping
---------
Co-authored-by: Andy Grove <[email protected]>
* fix: Fall back to Spark for unsupported partition or sort expressions in window aggregates (#1253)
* perf: Improve query planning to more reliably fall back to columnar shuffle when native shuffle is not supported (#1209)
* fix regression (#1259)
* feat: add support for array_remove expression (#1179)
* wip: array remove
* added comet expression test
* updated test cases
* fixed array_remove function for null values
* removed commented code
* remove unnecessary code
* updated the test for 'array_remove'
* added test for array_remove in case the input array is null
* wip: case array is empty
* removed test case for empty array
* fix: Fall back to Spark for distinct aggregates (#1262)
* fall back to Spark for distinct aggregates
* update expected plans for 3.4
* update expected plans for 3.5
* force build
* add comment
* feat: Implement custom RecordBatch serde for shuffle for improved performance (#1190)
* Implement faster encoder for shuffle blocks
* make code more concise
* enable fast encoding for columnar shuffle
* update benches
* test all int types
* test float
* remaining types
* add Snappy and Zstd(6) back to benchmark
* fix regression
* Update native/core/src/execution/shuffle/codec.rs
Co-authored-by: Liang-Chi Hsieh <[email protected]>
* address feedback
* support nullable flag
---------
Co-authored-by: Liang-Chi Hsieh <[email protected]>
* docs: Update TPC-H benchmark results (#1257)
* fix: disable initCap by default (#1276)
* fix: disable initCap by default
* Update spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
Co-authored-by: Andy Grove <[email protected]>
* address review comments
---------
Co-authored-by: Andy Grove <[email protected]>
* chore: Add changelog for 0.5.0 (#1278)
* Add changelog
* revert accidental change
* move 2 items to performance section
* update TPC-DS results for 0.5.0 (#1277)
* fix: cast timestamp to decimal is unsupported (#1281)
* fix: cast timestamp to decimal is unsupported
* fix style
* revert test name and mark as ignore
* add comment
* chore: Start 0.6.0 development (#1286)
* start 0.6.0 development
* update some docs
* Revert a change
* update CI
* docs: Fix links and provide complete benchmarking scripts (#1284)
* fix links and provide complete scripts
* fix path
* fix incorrect text
* feat: Add HasRowIdMapping interface (#1288)
---------
Signed-off-by: Dharan Aditya <[email protected]>
Co-authored-by: NoeB <[email protected]>
Co-authored-by: Liang-Chi Hsieh <[email protected]>
Co-authored-by: Raz Luvaton <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: KAZUYUKI TANIMURA <[email protected]>
Co-authored-by: Sem <[email protected]>
Co-authored-by: Himadri Pal <[email protected]>
Co-authored-by: himadripal <[email protected]>
Co-authored-by: gstvg <[email protected]>
Co-authored-by: Adam Binford <[email protected]>
Co-authored-by: Matt Butrovich <[email protected]>
Co-authored-by: Raz Luvaton <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Dharan Aditya <[email protected]>
Co-authored-by: Kristin Cowalcijk <[email protected]>
Co-authored-by: Oleks V <[email protected]>
Co-authored-by: Zhen Wang <[email protected]>
Co-authored-by: Jagdish Parihar <[email protected]>1 parent facda24 commit 85fe884Copy full SHA for 85fe884
File tree
0 file changed
+0
-0
lines changedFilter options
0 file changed
+0
-0
lines changed
0 commit comments