Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-49193][SQL] Improve the performance of RowSetUtils.toColumnBas…
…edSet ### What changes were proposed in this pull request? Replace `while` loop with `foreach` in `RowSetUtils.toTColumn`. ### Why are the changes needed? Improve the performance of `RowSetUtils.toColumnBasedSet`: <img width="1196" alt="image" src="https://github.com/user-attachments/assets/f481de39-e0bf-41c5-8fee-09dc1a93c4e1"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. ```scala import org.apache.hive.service.rpc.thrift.TProtocolVersion import org.apache.spark.sql.execution.HiveResult val df = spark.sql("select id, cast(id as string), cast(id as timestamp) from range(200000)") val dataTypes = df.schema.fields.map(_.dataType) val rows = df.collect().toList val start1 = System.currentTimeMillis() RowSetUtils.toTRowSet(1, rows, dataTypes, TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V11, HiveResult.getTimeFormatters) val start2 = System.currentTimeMillis() RowSetUtils.toTRowSet(1, rows, dataTypes, TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V5, HiveResult.getTimeFormatters) val start3 = System.currentTimeMillis() println(s"toColumnBasedSet time: ${start2 - start1}, toRowBasedSet time: ${start3 - start2}") ``` Before this PR: ``` toColumnBasedSet time: 17307, toRowBasedSet time: 71 ``` After this PR: ``` toColumnBasedSet time: 128, toRowBasedSet time: 70 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47699 from wangyum/toColumnBasedSet. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 567d58c) Signed-off-by: Kent Yao <yao@apache.org>
- Loading branch information