Skip to content

Releases: apache/datafusion-comet

0.4.0

18 Nov 21:27
Compare
Choose a tag to compare
0.4.0 Pre-release
Pre-release

DataFusion Comet 0.4.0 Changelog

This release consists of 51 commits from 10 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

  • fix: Use the number of rows from underlying arrays instead of logical row count from RecordBatch #972 (viirya)
  • fix: The spilled_bytes metric of CometSortExec should be size instead of time #984 (Kontinuation)
  • fix: Properly handle Java exceptions without error messages; fix loading of comet native library from java.library.path #982 (Kontinuation)
  • fix: Fallback to Spark if scan has meta columns #997 (viirya)
  • fix: Fallback to Spark if named_struct contains duplicate field names #1016 (viirya)
  • fix: Make comet-git-info.properties optional #1027 (andygrove)
  • fix: TopK operator should return correct results on dictionary column with nulls #1033 (viirya)
  • fix: need default value for getSizeAsMb(EXECUTOR_MEMORY.key) #1046 (neyama)

Performance related:

  • perf: Remove one redundant CopyExec for SMJ #962 (andygrove)
  • perf: Add experimental feature to replace SortMergeJoin with ShuffledHashJoin #1007 (andygrove)
  • perf: Cache jstrings during metrics collection #1029 (mbutrovich)

Implemented enhancements:

  • feat: Support GetArrayStructFields expression #993 (Kimahriman)
  • feat: Implement bloom_filter_agg #987 (mbutrovich)
  • feat: Support more types with BloomFilterAgg #1039 (mbutrovich)
  • feat: Implement CAST from struct to string #1066 (andygrove)
  • feat: Use official DataFusion 43 release #1070 (andygrove)
  • feat: Implement CAST between struct types #1074 (andygrove)
  • feat: support array_append #1072 (NoeB)
  • feat: Require offHeap memory to be enabled (always use unified memory) #1062 (andygrove)

Documentation updates:

  • doc: add documentation interlinks #975 (comphead)
  • docs: Add IntelliJ documentation for generated source code #985 (mbutrovich)
  • docs: Update tuning guide #995 (andygrove)
  • docs: Various documentation improvements #1005 (andygrove)
  • docs: clarify that Maven central only has jars for Linux #1009 (andygrove)
  • doc: fix K8s links and doc #1058 (comphead)
  • docs: Update benchmarking.md #1085 (rluvaton-flarion)

Other:

  • chore: Generate changelog for 0.3.0 release #964 (andygrove)
  • chore: fix publish-to-maven script #966 (andygrove)
  • chore: Update benchmarks results based on 0.3.0-rc1 #969 (andygrove)
  • chore: update rem expression guide #976 (kazuyukitanimura)
  • chore: Enable additional CreateArray tests #928 (Kimahriman)
  • chore: fix compatibility guide #978 (kazuyukitanimura)
  • chore: Update for 0.3.0 release, prepare for 0.4.0 development #970 (andygrove)
  • chore: Don't transform the HashAggregate to CometHashAggregate if Comet shuffle is disabled #991 (viirya)
  • chore: Make parquet reader options Comet options instead of Hadoop options #968 (parthchandra)
  • chore: remove legacy comet-spark-shell #1013 (andygrove)
  • chore: Reserve memory for native shuffle writer per partition #988 (viirya)
  • chore: Bump arrow-rs to 53.1.0 and datafusion #1001 (kazuyukitanimura)
  • chore: Revert "chore: Reserve memory for native shuffle writer per partition (#988)" #1020 (viirya)
  • minor: Remove hard-coded version number from Dockerfile #1025 (andygrove)
  • chore: Reserve memory for native shuffle writer per partition #1022 (viirya)
  • chore: Improve error handling when native lib fails to load #1000 (andygrove)
  • chore: Use twox-hash 2.0 xxhash64 oneshot api instead of custom implementation #1041 (NoeB)
  • chore: Refactor Arrow Array and Schema allocation in ColumnReader and MetadataColumnReader #1047 (viirya)
  • minor: Refactor binary expr serde to reduce code duplication #1053 (andygrove)
  • chore: Upgrade to DataFusion 43.0.0-rc1 #1057 (andygrove)
  • chore: Refactor UnaryExpr and MathExpr in protobuf #1056 (andygrove)
  • minor: use defaults instead of hard-coding values #1060 (andygrove)
  • minor: refactor UnaryExpr handling to make code more concise #1065 (andygrove)
  • chore: Refactor binary and math expression serde code #1069 (andygrove)
  • chore: Simplify CometShuffleMemoryAllocator to use Spark unified memory allocator #1063 (viirya)
  • test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE config #1087 (viirya)

Credits

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    19	Andy Grove
    13	Matt Butrovich
     8	Liang-Chi Hsieh
     3	KAZUYUKI TANIMURA
     2	Adam Binford
     2	Kristin Cowalcijk
     1	NoeB
     1	Oleks V
     1	Parth Chandra
     1	neyama

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.

0.3.0

28 Sep 14:55
Compare
Choose a tag to compare
0.3.0 Pre-release
Pre-release

DataFusion Comet 0.3.0 Changelog

This release consists of 57 commits from 12 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

  • fix: Support type coercion for ScalarUDFs #865 (Kimahriman)
  • fix: CometTakeOrderedAndProjectExec native scan node should use child operator's output #896 (viirya)
  • fix: Fix various memory leaks problems #890 (Kontinuation)
  • fix: Add output to Comet operators equal and hashCode #902 (viirya)
  • fix: Fallback to Spark when cannot resolve AttributeReference #926 (viirya)
  • fix: Fix memory bloat caused by holding too many unclosed ArrowReaderIterators #929 (Kontinuation)
  • fix: Normalize NaN and zeros for floating number comparison #953 (viirya)
  • fix: window function range offset should be long instead of int #733 (huaxingao)
  • fix: CometScanExec on Spark 3.5.2 #915 (Kimahriman)
  • fix: div and rem by negative zero #960 (kazuyukitanimura)

Performance related:

  • perf: Optimize CometSparkToColumnar for columnar input #892 (mbutrovich)
  • perf: Fall back to Spark if query uses DPP with v1 data sources #897 (andygrove)
  • perf: Report accurate total time for scans #916 (andygrove)
  • perf: Add metric for time spent casting in native scan #919 (andygrove)
  • perf: Add criterion benchmark for aggregate expressions #948 (andygrove)
  • perf: Add metric for time spent in CometSparkToColumnarExec #931 (mbutrovich)
  • perf: Optimize decimal precision check in decimal aggregates (sum and avg) #952 (andygrove)

Implemented enhancements:

  • feat: Add config option to enable converting CSV to columnar #871 (andygrove)
  • feat: Implement basic version of string to float/double/decimal #870 (andygrove)
  • feat: Implement to_json for subset of types #805 (andygrove)
  • feat: Add ShuffleQueryStageExec to direct child node for CometBroadcastExchangeExec #880 (viirya)
  • feat: Support sort merge join with a join condition #553 (viirya)
  • feat: Array element extraction #899 (Kimahriman)
  • feat: date_add and date_sub functions #910 (mbutrovich)
  • feat: implement scripts for binary release build #932 (parthchandra)
  • feat: Publish artifacts to maven #946 (parthchandra)

Documentation updates:

  • doc: Documenting Helm chart for Comet Kube execution #874 (comphead)
  • doc: Update native code path in development #921 (viirya)
  • docs: Add more detailed architecture documentation #922 (andygrove)

Other:

  • chore: Update installation.md #869 (haoxins)
  • chore: Update versions to 0.3.0 / 0.3.0-SNAPSHOT #868 (andygrove)
  • chore: Add documentation on running benchmarks with Microk8s #848 (andygrove)
  • chore: Improve CometExchange metrics #873 (viirya)
  • chore: Add spilling metrics of SortMergeJoin #878 (viirya)
  • chore: change shuffle mode default from jvm to auto #877 (andygrove)
  • chore: Enable shuffle by default #881 (andygrove)
  • chore: print Comet native version to logs after Comet is initialized #900 (SemyonSinchenko)
  • chore: Revise batch pull approach to more follow C Data interface semantics #893 (viirya)
  • chore: Close dictionary provider when iterator is closed #904 (andygrove)
  • chore: Remove unused function #906 (viirya)
  • chore: Upgrade to latest DataFusion revision #909 (andygrove)
  • build: fix build #917 (andygrove)
  • chore: Revise array import to more follow C Data Interface semantics #905 (viirya)
  • chore: Address reviews #920 (viirya)
  • chore: Enable Comet shuffle for Spark core-1 test #924 (viirya)
  • build: Add maven-compiler-plugin for java cross-build #911 (viirya)
  • build: Disable upload-test-reports for macos-13 runner #933 (viirya)
  • minor: cast timestamp test #468 #923 (himadripal)
  • build: Set Java version arg for scala-maven-plugin #934 (viirya)
  • chore: Remove redundant RowToColumnar from CometShuffleExchangeExec for columnar shuffle #944 (viirya)
  • minor: rename CometMetricNode add to set and update documentation #940 (andygrove)
  • chore: Add config for enabling SMJ with join condition #937 (andygrove)
  • chore: Change maven group ID to org.apache.datafusion #941 (andygrove)
  • chore: Upgrade to DataFusion 42.0.0 #945 (andygrove)
  • build: Fix regression in jar packaging #950 (andygrove)
  • chore: Show reason for falling back to Spark when SMJ with join condition is not enabled #956 (andygrove)
  • chore: clarify tarball installation #959 (comphead)

Credits

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    22	Andy Grove
    18	Liang-Chi Hsieh
     3	Adam Binford
     3	Matt Butrovich
     2	Kristin Cowalcijk
     2	Oleks V
     2	Parth Chandra
     1	Himadri Pal
     1	Huaxin Gao
     1	KAZUYUKI TANIMURA
     1	Semyon
     1	Xin Hao

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.

0.2.0

03 Sep 13:51
Compare
Choose a tag to compare
0.2.0 Pre-release
Pre-release

DataFusion Comet 0.2.0 Changelog

This release consists of 87 commits from 14 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

  • fix: dictionary decimal vector optimization #705 (kazuyukitanimura)
  • fix: Unsupported window expression should fall back to Spark #710 (viirya)
  • fix: ReusedExchangeExec can be child operator of CometBroadcastExchangeExec #713 (viirya)
  • fix: Fallback to Spark for window expression with range frame #719 (viirya)
  • fix: Remove skip.surefire.tests mvn property #739 (wForget)
  • fix: subquery execution under CometTakeOrderedAndProjectExec should not fail #748 (viirya)
  • fix: skip negative scale checks for creating decimals #723 (kazuyukitanimura)
  • fix: Fallback to Spark for unsupported partitioning #759 (viirya)
  • fix: Unsupported types for SinglePartition should fallback to Spark #765 (viirya)
  • fix: unwrap dictionaries in CreateNamedStruct #754 (andygrove)
  • fix: Fallback to Spark for unsupported input besides ordering #768 (viirya)
  • fix: Native window operator should be CometUnaryExec #774 (viirya)
  • fix: Fallback to Spark when shuffling on struct with duplicate field name #776 (viirya)
  • fix: withInfo was overwriting information in some cases #780 (andygrove)
  • fix: Improve support for nested structs #800 (eejbyfeldt)
  • fix: Sort on single struct should fallback to Spark #811 (viirya)
  • fix: Check sort order of SortExec instead of child output #821 (viirya)
  • fix: Fix panic in avg aggregate and disable stddev by default #819 (andygrove)
  • fix: Supported nested types in HashJoin #735 (eejbyfeldt)

Performance related:

  • perf: Improve performance of CASE .. WHEN expressions #703 (andygrove)
  • perf: Optimize IfExpr by delegating to CaseExpr #681 (andygrove)
  • fix: optimize isNullAt #732 (kazuyukitanimura)
  • perf: decimal decode improvements #727 (parthchandra)
  • fix: Remove castting on decimals with a small precision to decimal256 #741 (kazuyukitanimura)
  • fix: optimize some bit functions #718 (kazuyukitanimura)
  • fix: Optimize getDecimal for small precision #758 (kazuyukitanimura)
  • perf: add metrics to CopyExec and ScanExec #778 (andygrove)
  • fix: Optimize decimal creation macros #764 (kazuyukitanimura)
  • perf: Improve count aggregate performance #784 (andygrove)
  • fix: Optimize read_side_padding #772 (kazuyukitanimura)
  • perf: Remove some redundant copying of batches #816 (andygrove)
  • perf: Remove redundant copying of batches after FilterExec #835 (andygrove)
  • fix: Optimize CheckOverflow #852 (kazuyukitanimura)
  • perf: Add benchmarks for Spark Scan + Comet Exec #863 (andygrove)

Implemented enhancements:

  • feat: Add support for time-zone, 3 & 5 digit years: Cast from string to timestamp. #704 (akhilss99)
  • feat: Support count AggregateUDF for window function #736 (huaxingao)
  • feat: Implement basic version of RLIKE #734 (andygrove)
  • feat: show executed native plan with metrics when in debug mode #746 (andygrove)
  • feat: Add GetStructField expression #731 (Kimahriman)
  • feat: Add config to enable native upper and lower string conversion #767 (andygrove)
  • feat: Improve native explain #795 (andygrove)
  • feat: Add support for null literal with struct type #797 (eejbyfeldt)
  • feat: Optimze CreateNamedStruct preserve dictionaries #789 (eejbyfeldt)
  • feat: CreateArray support #793 (Kimahriman)
  • feat: Add native thread configs #828 (viirya)
  • feat: Add specific configs for converting Spark Parquet and JSON data to Arrow #832 (andygrove)
  • feat: Support sum in window function #802 (huaxingao)
  • feat: Simplify configs for enabling/disabling operators #855 (andygrove)
  • feat: Enable clippy::clone_on_ref_ptr on proto and spark_exprs crates #859 (comphead)
  • feat: Enable clippy::clone_on_ref_ptr on core crate #860 (comphead)
  • feat: Use CometPlugin as main entrypoint #853 (andygrove)

Documentation updates:

  • doc: Update outdated spark.comet.columnar.shuffle.enabled configuration doc #738 (wForget)
  • docs: Add explicit configs for enabling operators #801 (andygrove)
  • doc: Document CometPlugin to start Comet in cluster mode #836 (comphead)

Other:

  • chore: Make rust clippy happy #701 (Xuanwo)
  • chore: Update version to 0.2.0 and add 0.1.0 changelog #696 (andygrove)
  • chore: Use rust-toolchain.toml for better toolchain support #699 (Xuanwo)
  • chore(native): Make sure all targets in workspace been covered by clippy #702 (Xuanwo)
  • Apache DataFusion Comet Logo #697 (aocsa)
  • chore: Add logo to rat exclude list #709 (andygrove)
  • chore: Use new logo in README and website #724 (andygrove)
  • build: Add Comet logo files into exclude list #726 (viirya)
  • chore: Remove TPC-DS benchmark results #728 (andygrove)
  • chore: make Cast's logic reusable for other projects #716 (Blizzara)
  • chore: move scalar_funcs into spark-expr #712 (Blizzara)
  • chore: Bump DataFusion to rev 35c2e7e #740 (andygrove)
  • chore: add more aggregate functions to benchmark test #706 (huaxingao)
  • chore: Add criterion benchmark for decimal_div #743 (andygrove)
  • build: Re-enable TPCDS q72 for broadcast and hash join configs #781 (viirya)
  • chore: bump DataFusion to rev f4e519f #783 (huaxingao)
  • chore: Upgrade to DataFusion rev bddb641 and disable "skip partial aggregates" feature [#788](https://github.com/apa...
Read more

0.1.0

03 Sep 13:50
Compare
Choose a tag to compare
0.1.0 Pre-release
Pre-release

DataFusion Comet 0.1.0 Changelog

This release consists of 343 commits from 41 contributors. See credits at the end of this changelog for more information.

Implemented enhancements:

  • feat: Add native shuffle and columnar shuffle #30 (viirya)
  • feat: Support Emit::First for SumDecimalGroupsAccumulator #47 (viirya)
  • feat: Nested map support for columnar shuffle #51 (viirya)
  • feat: Support Count(Distinct) and similar aggregation functions #42 (huaxingao)
  • feat: Upgrade to jni-rs 0.21 #50 (sunchao)
  • feat: Handle exception thrown from native side #61 (sunchao)
  • feat: Support InSet expression in Comet #59 (viirya)
  • feat: Add CometNativeException for exceptions thrown from the native side #62 (sunchao)
  • feat: Add cause to native exception #63 (viirya)
  • feat: Pull based native execution #69 (viirya)
  • feat: Add executeColumnarCollectIterator to CometExec to collect Comet operator result #71 (viirya)
  • feat: Add CometBroadcastExchangeExec to support broadcasting the result of Comet native operator #80 (viirya)
  • feat: Reduce memory consumption when writing sorted shuffle files #82 (sunchao)
  • feat: Add struct/map as unsupported map key/value for columnar shuffle #84 (viirya)
  • feat: Support multiple input sources for CometNativeExec #87 (viirya)
  • feat: Date and timestamp trunc with format array #94 (parthchandra)
  • feat: Support First/Last aggregate functions #97 (huaxingao)
  • feat: Add support of TakeOrderedAndProjectExec in Comet #88 (viirya)
  • feat: Support Binary in shuffle writer #106 (advancedxy)
  • feat: Add license header by spotless:apply automatically #110 (advancedxy)
  • feat: Add dictionary binary to shuffle writer #111 (viirya)
  • feat: Minimize number of connections used by parallel reader #126 (parthchandra)
  • feat: Support CollectLimit operator #100 (advancedxy)
  • feat: Enable min/max for boolean type #165 (huaxingao)
  • feat: Introduce CometTaskMemoryManager and native side memory pool #83 (sunchao)
  • feat: Fix old style names #201 (comphead)
  • feat: enable comet shuffle manager for comet shell #204 (zuston)
  • feat: Support bitwise aggregate functions #197 (huaxingao)
  • feat: Support BloomFilterMightContain expr #179 (advancedxy)
  • feat: Support sort merge join #178 (viirya)
  • feat: Support HashJoin operator #194 (viirya)
  • feat: Remove use of nightly int_roundings feature #228 (psvri)
  • feat: Support Broadcast HashJoin #211 (viirya)
  • feat: Enable Comet broadcast by default #213 (viirya)
  • feat: Add CometRowToColumnar operator #206 (advancedxy)
  • feat: Document the class path / classloader issue with the shuffle manager #256 (holdenk)
  • feat: Port Datafusion Covariance to Comet #234 (huaxingao)
  • feat: Add manual test to calculate spark builtin functions coverage #263 (comphead)
  • feat: Support ANSI mode in CAST from String to Bool #290 (andygrove)
  • feat: Add extended explain info to Comet plan #255 (parthchandra)
  • feat: Improve CometSortMergeJoin statistics #304 (planga82)
  • feat: Add compatibility guide #316 (andygrove)
  • feat: Improve CometHashJoin statistics #309 (planga82)
  • feat: Support Variance #297 (huaxingao)
  • feat: Support murmur3_hash and sha2 family hash functions #226 (advancedxy)
  • feat: Disable cast string to timestamp by default #337 (andygrove)
  • feat: Improve CometBroadcastHashJoin statistics #339 (planga82)
  • feat: Implement Spark-compatible CAST from string to integral types #307 (andygrove)
  • feat: Implement Spark-compatible CAST from string to timestamp types #335 (vaibhawvipul)
  • feat: Implement Spark-compatible CAST float/double to string #346 (mattharder91)
  • feat: Only allow incompatible cast expressions to run in comet if a config is enabled #362 (andygrove)
  • feat: Implement Spark-compatible CAST between integer types #340 (ganeshkumar269)
  • feat: Supports Stddev #348 (huaxingao)
  • feat: Improve cast compatibility tests and docs #379 (andygrove)
  • feat: Implement Spark-compatible CAST from non-integral numeric types to integral types #399 (rohitrastogi)
  • feat: Implement Spark unhex #342 (tshauck)
  • feat: Enable columnar shuffle by default #250 (viirya)
  • feat: Implement Spark-compatible CAST from floating-point/double to decimal #384 (vaibhawvipul)
  • feat: Add logging to explain reasons for Comet not being able to run a query stage natively #397 (andygrove)
  • feat: Add support for TryCast expression in Spark 3.2 and 3.3 #416 (vaibhawvipul)
  • feat: Supports UUID column #395 (huaxingao)
  • feat: correlation support #456 (huaxingao)
  • feat: Implement Spark-compatible CAST from String to Date #383 (vidyasankarv)
  • feat: Add COMET_SHUFFLE_MODE config to control Comet shuffle mode #460 (viirya)
  • feat: Add random row generator in data generator #451 (advancedxy)
  • feat: Add xxhash64 function support #424 (advancedxy)
  • feat: add hex scalar function #449 (tshauck)
  • feat: Add "Comet Fuzz" fuzz-testing utility #472 (andygrove)
  • feat: Use enum to represent CAST eval_mode in expr.proto #415 (prashantksharma)
  • feat: Implement ANSI support for UnaryMinus #471 (vaibhawvipul)
  • feat: Add specific fuzz tests for cast and try_cast and fix NPE found during fuzz testing #514 (andygrove)
  • feat: Add fuzz testing for arithmetic expressions #519 (andygr...
Read more