[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained #37702

HyukjinKwon · 2022-08-29T06:21:25Z

What changes were proposed in this pull request?

This PR takes #37444 over with covering all examples in pyspark.sql.dataframe.

This PR proposes to improve the examples in pyspark.sql.dataframe by making each example self-contained with more realistic examples.

Closes #37444

Why are the changes needed?

To make the documentation more readable and able to copy and paste directly in PySpark shell.

Does this PR introduce any user-facing change?

Yes, Documentation changes only

How was this patch tested?

Manually ran each examples.

HyukjinKwon · 2022-08-29T06:53:35Z

cc @zhengruifeng @viirya @xinrong-meng @Yikun @itholic @dcoliversun @khalidmammadov in case you find some time to take a look.

dcoliversun

Looks pretty good to me. Heavy workload. I made some small suggestions, please correct me if I'm wrong.

python/pyspark/sql/dataframe.py

Transurgeon · 2022-08-29T13:54:11Z

@HyukjinKwon, thanks for taking this over.. I hope I helped you guys a bit atleast

HyukjinKwon · 2022-08-30T01:05:49Z

Thanks guys.

Merged to master.

HyukjinKwon · 2022-08-30T01:06:05Z

Sure, that was a big help @Transurgeon

### What changes were proposed in this pull request? This pr upgrade Apache Arrow from 13.0.0 to 14.0.0. ### Why are the changes needed? The Apache Arrow 14.0.0 release brings a number of enhancements and bug fixes. ‎ In terms of bug fixes, the release addresses several critical issues that were causing failures in integration jobs with Spark([GH-36332](apache/arrow#36332)) and problems with importing empty data arrays([GH-37056](apache/arrow#37056)). It also optimizes the process of appending variable length vectors([GH-37829](apache/arrow#37829)) and includes C++ libraries for MacOS AARCH 64 in Java-Jars([GH-38076](apache/arrow#38076)). ‎ The new features and improvements focus on enhancing the handling and manipulation of data. This includes the introduction of DefaultVectorComparators for large types([GH-25659](apache/arrow#25659)), support for extended expressions in ScannerBuilder([GH-34252](apache/arrow#34252)), and the exposure of the VectorAppender class([GH-37246](apache/arrow#37246)). ‎ The release also brings enhancements to the development and testing process, with the CI environment now using JDK 21([GH-36994](apache/arrow#36994)). In addition, the release introduces vector validation consistent with C++, ensuring consistency across different languages([GH-37702](apache/arrow#37702)). ‎ Furthermore, the usability of VarChar writers and binary writers has been improved with the addition of extra input methods([GH-37705](apache/arrow#37705)), and VarCharWriter now supports writing from `Text` and `String`([GH-37706](apache/arrow#37706)). The release also adds typed getters for StructVector, improving the ease of accessing data([GH-37863](apache/arrow#37863)). The full release notes as follows: - https://arrow.apache.org/release/14.0.0.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43650 from LuciferYang/arrow-14. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Initial PR

1eaedf0

HyukjinKwon force-pushed the SPARK-40012 branch from 5dd95ca to 7401b75 Compare August 29, 2022 06:21

github-actions bot added CORE PYTHON SQL labels Aug 29, 2022

HyukjinKwon force-pushed the SPARK-40012 branch 2 times, most recently from 232867a to 84801b2 Compare August 29, 2022 06:53

dcoliversun reviewed Aug 29, 2022

View reviewed changes

Make pyspark.sql.dataframe examples self-contained

fadafbb

HyukjinKwon force-pushed the SPARK-40012 branch from 84801b2 to fadafbb Compare August 29, 2022 08:05

zhengruifeng approved these changes Aug 29, 2022

View reviewed changes

HyukjinKwon force-pushed the SPARK-40012 branch 2 times, most recently from 682e286 to 2385b21 Compare August 29, 2022 11:36

Address comments

7daae2c

HyukjinKwon force-pushed the SPARK-40012 branch from 2385b21 to 7daae2c Compare August 29, 2022 11:56

srowen approved these changes Aug 29, 2022

View reviewed changes

HyukjinKwon closed this in d5c1375 Aug 30, 2022

HyukjinKwon deleted the SPARK-40012 branch January 15, 2024 00:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained #37702

[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained #37702

HyukjinKwon commented Aug 29, 2022

HyukjinKwon commented Aug 29, 2022

dcoliversun left a comment

Transurgeon commented Aug 29, 2022

HyukjinKwon commented Aug 30, 2022

HyukjinKwon commented Aug 30, 2022

[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained #37702

[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained #37702

Conversation

HyukjinKwon commented Aug 29, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Aug 29, 2022

dcoliversun left a comment

Choose a reason for hiding this comment

Transurgeon commented Aug 29, 2022

HyukjinKwon commented Aug 30, 2022

HyukjinKwon commented Aug 30, 2022