[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1) #37444

Transurgeon · 2022-08-09T04:16:20Z

What changes were proposed in this pull request?

This PR proposes to improve the examples in pyspark.sql.dataframe by making each example self-contained with more realistic examples

Why are the changes needed?

To make the documentation more readable and able to copy and paste directly in PySpark shell.

Does this PR introduce any user-facing change?

Yes, Documentation changes only

How was this patch tested?

Built documentation on local

Transurgeon · 2022-08-09T04:18:58Z

I set the tag [WIP] because dataframe.py needs a lot of updates. I will add some additional changes to this PR in the upcoming days.

Please review and provide some feedback.
Thanks

AmplabJenkins · 2022-08-10T00:53:09Z

Can one of the admins verify this patch?

HyukjinKwon · 2022-08-23T00:42:47Z

@Transurgeon is this still WIP? If there are too many to fix, feel free to split into multiple PRs.

Transurgeon · 2022-08-23T15:45:58Z

@HyukjinKwon.

No not WIP anymore, I wanted to get some feedback to see if I was making good changes before I continue working on it.

Should I remove the WIP tag?

Transurgeon · 2022-08-23T19:43:56Z

python/pyspark/sql/dataframe.py

        >>> df.drop(df.age).collect()
-        [Row(name='Alice'), Row(name='Bob')]
+        [Row(name='Tom'), Row(name='Alice'), Row(name='Bob')]

        >>> df.join(df2, df.name == df2.name, 'inner').drop(df.name).collect()


I am not sure what these 3 inner joins do exactly. I dont see anywhere an instantiation of df2..

What should I do with these 3 examples?

I think it's showing a common example that join and drop the join key.

…space

Transurgeon · 2022-08-23T19:50:15Z

@HyukjinKwon I made some additional changes. I think we can start by merging this PR, then I will make another one for the rest of the changes.

I have a list of all the functions I made changes to in this PR, should I add it to the JIRA ticket to avoid duplicate changes?

HyukjinKwon · 2022-08-23T23:48:25Z

I think you can reuse the same JIRA, and make a followup.

python/pyspark/sql/dataframe.py

HyukjinKwon · 2022-08-23T23:51:51Z

python/pyspark/sql/dataframe.py

-        >>> df
-        DataFrame[age: int, name: string]
+        >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+        ... (16, "Bob")], ["age", "name"])


ditto for indentation

python/pyspark/sql/dataframe.py

HyukjinKwon · 2022-08-23T23:53:53Z

python/pyspark/sql/dataframe.py

+        Fill all null values with 50 when the data type of the column is an integer
+
+	>>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"),
+        ... (None, None, "Tom"), (None, None, None)], ["age", "height", "name"])


intentation

python/pyspark/sql/dataframe.py

HyukjinKwon · 2022-08-23T23:56:46Z

We should make the tests passed before merging it in (https://github.com/Transurgeon/spark/runs/7981691501).

cc @dcoliversun @khalidmammadov FYI if you guys find some time to review, and work on the rest of API.

python/pyspark/sql/dataframe.py

dcoliversun · 2022-08-24T08:15:19Z

python/pyspark/sql/dataframe.py

-        >>> df4.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show()
+        Replace all instances of Alice to 'A' and Bob to 'B' under the name column
+
+        >>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"),


Why do we need to create duplicate dataframe here? If we don't need it, better to delete

>>> df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"),(None, None, "Tom"), (None, None, None)], ["age", "height", "name"]) >>> df.show() +----+------+-----+ | age|height| name| +----+------+-----+ | 10| 80|Alice| | 5| null| Bob| |null| null| Tom| |null| null| null| +----+------+-----+ >>> df.na.replace('Alice', None).show() +----+------+----+ | age|height|name| +----+------+----+ | 10| 80|null| | 5| null| Bob| |null| null| Tom| |null| null|null| +----+------+----+ >>> df.show() +----+------+-----+ | age|height| name| +----+------+-----+ | 10| 80|Alice| | 5| null| Bob| |null| null| Tom| |null| null| null| +----+------+-----+ >>> df.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show() +----+------+----+ | age|height|name| +----+------+----+ | 10| 80| A| | 5| null| B| |null| null| Tom| |null| null|null| +----+------+----+

dcoliversun · 2022-08-24T08:28:28Z

@Transurgeon Hi. Look like that CI is disabled in your fork repo.

Maybe the doc can help you :)

HyukjinKwon · 2022-08-25T10:47:13Z

Im gonna take this over if the PR author gets inactive few more days - this is the last task left for the umbrella task.

Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: Qian.Sun <qian.sun2020@gmail.com>

Transurgeon · 2022-08-25T22:30:25Z

Hi Hyukjin and Oliver, thanks all for your feedback.

I have created a commit with all your suggestions and allowed all jobs to be run in git Actions for my fork.

I will make one last commit for further minor changes.

HyukjinKwon · 2022-08-26T00:14:26Z

python/pyspark/sql/dataframe.py

@@ -4100,7 +4256,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame":
        Parameters
        ----------
        cols : str
-            new column names
+            new column names. The length of the list needs to be the same as the number of columns in the initial :class:`DataFrame`


@Transurgeon mind running ./dev/lint-python script and fix the line length, etc?

yes will do, sorry about that

HyukjinKwon · 2022-08-26T00:15:00Z

python/pyspark/sql/dataframe.py

        >>> df.schema
-        StructType([StructField('age', IntegerType(), True),
-                    StructField('name', StringType(), True)])
+        StructType([StructField('age', IntegerType(), True), 


Let's remove the space in the end

HyukjinKwon · 2022-08-26T00:15:36Z

python/pyspark/sql/dataframe.py

        +---+-----+
+        only showing top 2 rows
+
+	Show DataFrame where the maximum number of characters is 3.


:class:`DataFrame`

python/pyspark/sql/dataframe.py

itholic · 2022-08-26T00:13:34Z

python/pyspark/sql/dataframe.py

+        >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"),
+        ... (16, "Bob")], ["age", "name"])
+        >>> df.show()
+        +---+-----+
+        |age| name|
+        +---+-----+
+        | 14|  Tom|
+        | 23|Alice|
+        | 16|  Bob|
+        +---+-----+


I think we don't need to create a new DataFrame here, since drop() doesn't remove the column in-place.

e.g.

>>> df.drop('age').show() +-----+ | name| +-----+ | Tom| |Alice| | Bob| +-----+ >>> df.drop(df.age).show() +-----+ | name| +-----+ | Tom| |Alice| | Bob| +-----+

itholic · 2022-08-26T00:17:04Z

python/pyspark/sql/dataframe.py

@@ -4100,7 +4256,7 @@ def toDF(self, *cols: "ColumnOrName") -> "DataFrame":
        Parameters
        ----------
        cols : str
-            new column names
+            new column names. The length of the list needs to be the same as the number of columns in the initial :class:`DataFrame`


Seems like it exceeds the 100 lines, which violates flake8 rule.

starting flake8 test... flake8 checks failed: ./python/pyspark/sql/dataframe.py:4250:101: E501 line too long (128 > 100 characters) """ Returns a best-effort snapshot of the files that compose this :class:`DataFrame`. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed. new column names. The length of the list needs to be the same as the number of columns in the initial :class:`DataFrame` .. versionadded:: 3.1.0 Returns ------- list List of file paths. Examples -------- >>> df = spark.read.load("examples/src/main/resources/people.json", format="json") >>> len(df.inputFiles()) 1 """ ^ 1 E501 line too long (128 > 100 characters)

You can run dev/lint-python to check if the static analysis is passed.

HyukjinKwon · 2022-08-26T00:17:41Z

python/pyspark/sql/dataframe.py

-        StructType([StructField('age', IntegerType(), True),
-                    StructField('name', StringType(), True)])
+        StructType([StructField('age', IntegerType(), True), 
+		    StructField('name', StringType(), True)])


Please avoid using tabs

HyukjinKwon · 2022-08-29T06:19:55Z

Hey, let's co-author this change. I will create another PR on the top of this PR to speed this up.

adding self-contained examples for pyspark dataframe

380149f

github-actions bot added CORE PYTHON SQL labels Aug 9, 2022

adding additional self-contained examples for dataframe API

1fed3ca

Transurgeon commented Aug 23, 2022

View reviewed changes

splitting instantion of df to two lines to avoid overflow of example …

24c4769

…space

Transurgeon changed the title ~~[SPARK-40012][PYTHON][DOCS][WIP] Make pyspark.sql.dataframe examples self-contained~~ [SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1) Aug 23, 2022

HyukjinKwon reviewed Aug 23, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Show resolved Hide resolved

HyukjinKwon reviewed Aug 23, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Aug 23, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Aug 23, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Show resolved Hide resolved

HyukjinKwon reviewed Aug 23, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Show resolved Hide resolved

HyukjinKwon reviewed Aug 23, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Show resolved Hide resolved

HyukjinKwon reviewed Aug 23, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Show resolved Hide resolved

HyukjinKwon reviewed Aug 23, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Show resolved Hide resolved

HyukjinKwon reviewed Aug 23, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Aug 23, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

dcoliversun reviewed Aug 24, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

dcoliversun reviewed Aug 24, 2022

View reviewed changes

Apply suggestions from code review

60605d3

Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: Qian.Sun <qian.sun2020@gmail.com>

Transurgeon and others added 2 commits August 25, 2022 18:59

adding descriptions and removing duplicate dataframe initialisations

1ce9604

Merge branch 'master' into master

096ab81

HyukjinKwon reviewed Aug 26, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Aug 26, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Aug 26, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Aug 26, 2022

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

itholic reviewed Aug 26, 2022

View reviewed changes

HyukjinKwon reviewed Aug 26, 2022

View reviewed changes

Transurgeon added 2 commits August 25, 2022 21:18

running python lint and reformat

dfa5726

making some additional small detail changes

aadf3d0

HyukjinKwon mentioned this pull request Aug 29, 2022

[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained #37702

Closed

HyukjinKwon closed this in d5c1375 Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1) #37444

[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1) #37444

Transurgeon commented Aug 9, 2022

Transurgeon commented Aug 9, 2022

AmplabJenkins commented Aug 10, 2022

HyukjinKwon commented Aug 23, 2022

Transurgeon commented Aug 23, 2022

Transurgeon Aug 23, 2022

HyukjinKwon Aug 23, 2022

Transurgeon commented Aug 23, 2022

HyukjinKwon commented Aug 23, 2022

HyukjinKwon Aug 23, 2022

HyukjinKwon Aug 23, 2022

HyukjinKwon commented Aug 23, 2022

dcoliversun Aug 24, 2022

dcoliversun commented Aug 24, 2022 •

edited

Loading

HyukjinKwon commented Aug 25, 2022

Transurgeon commented Aug 25, 2022

HyukjinKwon Aug 26, 2022

Transurgeon Aug 26, 2022

HyukjinKwon Aug 26, 2022

HyukjinKwon Aug 26, 2022

itholic Aug 26, 2022

itholic Aug 26, 2022

itholic Aug 26, 2022

Transurgeon Aug 26, 2022

HyukjinKwon Aug 26, 2022

HyukjinKwon commented Aug 29, 2022

[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1) #37444

[SPARK-40012][PYTHON][DOCS] Make pyspark.sql.dataframe examples self-contained (Part 1) #37444

Conversation

Transurgeon commented Aug 9, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Transurgeon commented Aug 9, 2022

AmplabJenkins commented Aug 10, 2022

HyukjinKwon commented Aug 23, 2022

Transurgeon commented Aug 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Transurgeon commented Aug 23, 2022

HyukjinKwon commented Aug 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 23, 2022

Choose a reason for hiding this comment

dcoliversun commented Aug 24, 2022 • edited Loading

HyukjinKwon commented Aug 25, 2022

Transurgeon commented Aug 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 29, 2022

dcoliversun commented Aug 24, 2022 •

edited

Loading