Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained #37702

Closed
wants to merge 3 commits into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR takes #37444 over with covering all examples in pyspark.sql.dataframe.

This PR proposes to improve the examples in pyspark.sql.dataframe by making each example self-contained with more realistic examples.

Closes #37444

Why are the changes needed?

To make the documentation more readable and able to copy and paste directly in PySpark shell.

Does this PR introduce any user-facing change?

Yes, Documentation changes only

How was this patch tested?

Manually ran each examples.

@HyukjinKwon
Copy link
Member Author

cc @zhengruifeng @viirya @xinrong-meng @Yikun @itholic @dcoliversun @khalidmammadov in case you find some time to take a look.

Copy link
Contributor

@dcoliversun dcoliversun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me. Heavy workload. I made some small suggestions, please correct me if I'm wrong.

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
@HyukjinKwon HyukjinKwon force-pushed the SPARK-40012 branch 2 times, most recently from 682e286 to 2385b21 Compare August 29, 2022 11:36
@Transurgeon
Copy link
Contributor

@HyukjinKwon, thanks for taking this over.. I hope I helped you guys a bit atleast

@HyukjinKwon
Copy link
Member Author

Thanks guys.

Merged to master.

@HyukjinKwon
Copy link
Member Author

Sure, that was a big help @Transurgeon

dongjoon-hyun pushed a commit that referenced this pull request Nov 4, 2023
### What changes were proposed in this pull request?
This pr upgrade Apache Arrow from 13.0.0 to 14.0.0.

### Why are the changes needed?
The Apache Arrow 14.0.0 release brings a number of enhancements and bug fixes.
‎
In terms of bug fixes, the release addresses several critical issues that were causing failures in integration jobs with Spark([GH-36332](apache/arrow#36332)) and problems with importing empty data arrays([GH-37056](apache/arrow#37056)). It also optimizes the process of appending variable length vectors([GH-37829](apache/arrow#37829)) and includes C++ libraries for MacOS AARCH 64 in Java-Jars([GH-38076](apache/arrow#38076)).
‎
The new features and improvements focus on enhancing the handling and manipulation of data. This includes the introduction of DefaultVectorComparators for large types([GH-25659](apache/arrow#25659)), support for extended expressions in ScannerBuilder([GH-34252](apache/arrow#34252)), and the exposure of the VectorAppender class([GH-37246](apache/arrow#37246)).
‎
The release also brings enhancements to the development and testing process, with the CI environment now using JDK 21([GH-36994](apache/arrow#36994)). In addition, the release introduces vector validation consistent with C++, ensuring consistency across different languages([GH-37702](apache/arrow#37702)).
‎
Furthermore, the usability of VarChar writers and binary writers has been improved with the addition of extra input methods([GH-37705](apache/arrow#37705)), and VarCharWriter now supports writing from `Text` and `String`([GH-37706](apache/arrow#37706)). The release also adds typed getters for StructVector, improving the ease of accessing data([GH-37863](apache/arrow#37863)).

The full release notes as follows:
- https://arrow.apache.org/release/14.0.0.html

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43650 from LuciferYang/arrow-14.

Lead-authored-by: yangjie01 <yangjie01@baidu.com>
Co-authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@HyukjinKwon HyukjinKwon deleted the SPARK-40012 branch January 15, 2024 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants