[SPARK-32714][PYTHON] Initial pyspark-stubs port. #29591

zero323 · 2020-08-31T06:17:21Z

What changes were proposed in this pull request?

This PR proposes migration of pyspark-stubs into Spark codebase.

Why are the changes needed?

Does this PR introduce any user-facing change?

Yes. This PR adds type annotations directly to Spark source.

This can impact interaction with development tools for users, which haven't used pyspark-stubs.

How was this patch tested?

MyPy tests of the PySpark source

mypy --no-incremental --config python/mypy.ini python/pyspark

MyPy tests of Spark examples

MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming

Existing Flake8 linter
Existing unit tests

Tested against:

mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894
mypy==0.782

SparkQA · 2020-08-31T06:24:42Z

Test build #128074 has finished for PR 29591 at commit 60b126e.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-08-31T06:43:12Z

Let's use SPARK-32714 for this initial port, and use SPARK-32681 as an umbrella ticket to add other related tickets for followup works.

python/pyspark/cloudpickle.pyi

SparkQA · 2020-08-31T13:35:15Z

Test build #128098 has finished for PR 29591 at commit e59b562.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-31T16:43:11Z

Test build #128106 has finished for PR 29591 at commit 4b9000e.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MultilayerPerceptronClassificationSummary(_ClassificationSummary): ...
class MultilayerPerceptronClassificationTrainingSummary(
class PythonException(CapturedException): ...
class InheritableThread(threading.Thread):

SparkQA · 2020-08-31T17:06:36Z

Test build #128109 has finished for PR 29591 at commit a936e03.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-31T17:20:38Z

Test build #128111 has finished for PR 29591 at commit 0bc7183.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-01T06:02:35Z

Test build #128137 has finished for PR 29591 at commit df12013.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2020-09-03T06:14:33Z

Update

At the moment I'm working on re-syncing pyspark-stubs to reflect changes introduced by SPARK-32719 and SPARK-32319.

SparkQA · 2020-09-06T21:05:00Z

Test build #128324 has finished for PR 29591 at commit 93fb711.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-06T21:28:15Z

Test build #128326 has finished for PR 29591 at commit b8d4876.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2020-09-06T22:26:15Z

Update:

At the moment:

both MyPy and Flake8 tests pass with
flake8==3.8.3
flake8-pyi==20.5.0

and F401 (unused import) excludes on a few pyi files.

With flake8==3.7.* all tests pass with excludes as added to tox fiile. Additionally to F401 violations, flake8 doesn't seem to understand specific type ignores.
With flake8==3.5.0 there are no file specific ignores, so we get multiple failures. This could be addressed, for the time being, by either excluding problematic files from mypy or flake8 tests, and adjusting inline ignores. However, it is rather brute-force solution, and would require some discussion about the priorities.

From the perspective of this PR, an ideal solution would be an update of test dependencies, but I am not sure if that's realistic at the moment (hate to ask, but do you have any thoughts about it @shaneknapp?).

SparkQA · 2020-09-06T23:13:58Z

Test build #128325 has finished for PR 29591 at commit 27409f2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-07T00:09:33Z

Test build #128327 has finished for PR 29591 at commit 1aede7c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-07T00:53:06Z

Test build #128328 has finished for PR 29591 at commit 601a577.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class _empty_cell_value(object):
skeleton_class = types.new_class(
enum_class = metacls.__new__(metacls, name, bases, classdict)
class CloudPickler(Pickler):
is_anyclass = issubclass(t, type)
except TypeError: # t is not a class (old Boost; see SF #502085)

examples/src/main/python/ml/estimator_transformer_param_example.py

dev/tox.ini

HyukjinKwon · 2020-09-07T00:58:32Z

@zero323, I usually prefer to don't block something by the env issue in Jenkins so such issue can be handled with enough time - @shaneknapp is sort of busy at this moment IIRC. We could work around for now, and file a separate JIRA for him about the dependency upgade.

zero323 · 2020-09-07T06:12:17Z

@zero323, I usually prefer to don't block something by the env issue in Jenkins so such issue can be handled with enough time - @shaneknapp is sort of busy at this moment IIRC. We could work around for now, and file a separate JIRA for him about the dependency upgade.

Agreed. I thought it is worth raising the question, as it seems like we'll need some changes to the environment anyway.

SparkQA · 2020-09-07T07:05:02Z

Test build #128338 has finished for PR 29591 at commit 8f9ef95.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

…ByName)

SparkQA · 2020-09-23T07:05:02Z

Test build #129005 has finished for PR 29591 at commit b9ac4f8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-09-23T07:21:30Z

retest this please

SparkQA · 2020-09-23T10:05:19Z

Test build #129017 has finished for PR 29591 at commit b9ac4f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-09-23T11:29:27Z

I'm going to merge if there's no more comment tomorrow.

SparkQA · 2020-09-23T16:33:34Z

Test build #129033 has finished for PR 29591 at commit fab00f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2020-09-23T18:15:19Z

LGTM pending passing both GHA and Jenkins.

SparkQA · 2020-09-23T23:58:02Z

Test build #129047 has finished for PR 29591 at commit fab00f1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-09-24T05:15:30Z

Merged to master.

HyukjinKwon · 2020-09-24T05:16:10Z

@zero323 mind working on the below ones?

writing the guidelines in the doc
removing non-API type hints

I think these two are pretty important followups to be done soon ..

zero323 · 2020-09-24T06:03:29Z

@zero323 mind working on the below ones?

On it.

zero323 · 2020-09-24T06:07:21Z

Thanks everyone!

HyukjinKwon · 2020-09-24T06:14:47Z

Thank you @zero323 for leading type hint support in PySpark.

### What changes were proposed in this pull request? This PR: - removes annotations for modules which are not part of the public API. - removes `__init__.pyi` files, if no annotations, beyond exports, are present. ### Why are the changes needed? Primarily to reduce maintenance overhead and as requested in the comments to #29591 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests and additional MyPy checks: ``` mypy --no-incremental --config python/mypy.ini python/pyspark MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming ``` Closes #29879 from zero323/SPARK-33002. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

rehevkor5 · 2023-02-17T19:07:01Z

python/pyspark/sql/dataframe.pyi

+    def take(self, num: int) -> List[Row]: ...
+    def tail(self, num: int) -> List[Row]: ...
+    def foreach(self, f: Callable[[Row], None]) -> None: ...
+    def foreachPartition(self, f: Callable[[Iterator[Row]], None]) -> None: ...


Shouldn't this be Iterable[Row] instead of Iterator[Row], to match https://github.com/apache/spark/pull/29591/files#diff-6349afe05d41878cc15995c96a14b011d6aef04b779e136f711eab989b71da6cR215 ?

Kilo59 · 2023-02-25T15:57:38Z

Has anyone solved the problem of trying to type-check pyspark code without installing the 200+MB pyspark package?

That seems to be one massive downside of having pyspark provide its own stubs as opposed to them being part of type-shed.

probot-autolabeler bot added ML PYTHON labels Aug 31, 2020

HyukjinKwon changed the title ~~[SPARK-32681][PYTHON] Initial pyspark-stubs port.~~ [SPARK-32714][PYTHON] Initial pyspark-stubs port. Aug 31, 2020

HyukjinKwon reviewed Aug 31, 2020

View reviewed changes

python/pyspark/cloudpickle.pyi Outdated Show resolved Hide resolved

zero323 force-pushed the SPARK-32681 branch from 60b126e to e59b562 Compare August 31, 2020 13:27

probot-autolabeler bot added the BUILD label Aug 31, 2020

zero323 force-pushed the SPARK-32681 branch from f64ba9a to 4b9000e Compare August 31, 2020 16:41

zero323 mentioned this pull request Sep 3, 2020

Add missing methods zero323/pyspark-stubs#464

Closed

probot-autolabeler bot added EXAMPLES SQL labels Sep 6, 2020

zero323 force-pushed the SPARK-32681 branch from b8d4876 to 1aede7c Compare September 6, 2020 22:07

HyukjinKwon reviewed Sep 7, 2020

View reviewed changes

examples/src/main/python/ml/estimator_transformer_param_example.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Sep 7, 2020

View reviewed changes

dev/tox.ini Outdated Show resolved Hide resolved

zero323 added 9 commits September 23, 2020 07:43

Resync with pyspark-stubs

af4459f

Resync with pyspark-stubs

a8990ca

Resync with pyspark-stubs (drop long alias)

afe5222

Resync with pyspark-stubs (drop 'stubs for' comments)

6aaef20

Resync with pyspark-stubs (add allowMissingColumns to DataFrame.union…

19bf189

…ByName)

Resync with pyspark-stubs (drop unused hasSummary and add leafCol)

601b99a

Resync with pyspark-stubs (drop __metaclass__ fields)

7d359a9

Resync with pyspark-stubs (add Column.withField)

53107a1

Drop unued typing.Type imports

b9ac4f8

zero323 force-pushed the SPARK-32681 branch from 9998d0a to b9ac4f8 Compare September 23, 2020 05:44

Resync with pyspark-stubs (revert blockify gmm)

fab00f1

HyukjinKwon closed this in 31a16fb Sep 24, 2020

zero323 deleted the SPARK-32681 branch September 24, 2020 06:07

zero323 mentioned this pull request Sep 26, 2020

[SPARK-33002][PYTHON] Remove non-API annotations. #29879

Closed

rehevkor5 reviewed Feb 17, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32714][PYTHON] Initial pyspark-stubs port. #29591

[SPARK-32714][PYTHON] Initial pyspark-stubs port. #29591

zero323 commented Aug 31, 2020 •

edited

Loading

SparkQA commented Aug 31, 2020

HyukjinKwon commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Sep 1, 2020

zero323 commented Sep 3, 2020 •

edited

Loading

SparkQA commented Sep 6, 2020

SparkQA commented Sep 6, 2020

zero323 commented Sep 6, 2020

SparkQA commented Sep 6, 2020

SparkQA commented Sep 7, 2020

SparkQA commented Sep 7, 2020

HyukjinKwon commented Sep 7, 2020

zero323 commented Sep 7, 2020

SparkQA commented Sep 7, 2020

SparkQA commented Sep 23, 2020

HyukjinKwon commented Sep 23, 2020

SparkQA commented Sep 23, 2020

HyukjinKwon commented Sep 23, 2020

SparkQA commented Sep 23, 2020

holdenk commented Sep 23, 2020

SparkQA commented Sep 23, 2020

HyukjinKwon commented Sep 24, 2020

HyukjinKwon commented Sep 24, 2020 •

edited

Loading

zero323 commented Sep 24, 2020

zero323 commented Sep 24, 2020

HyukjinKwon commented Sep 24, 2020

rehevkor5 Feb 17, 2023

Kilo59 commented Feb 25, 2023 •

edited

Loading

[SPARK-32714][PYTHON] Initial pyspark-stubs port. #29591

[SPARK-32714][PYTHON] Initial pyspark-stubs port. #29591

Conversation

zero323 commented Aug 31, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Aug 31, 2020

HyukjinKwon commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Sep 1, 2020

zero323 commented Sep 3, 2020 • edited Loading

SparkQA commented Sep 6, 2020

SparkQA commented Sep 6, 2020

zero323 commented Sep 6, 2020

SparkQA commented Sep 6, 2020

SparkQA commented Sep 7, 2020

SparkQA commented Sep 7, 2020

HyukjinKwon commented Sep 7, 2020

zero323 commented Sep 7, 2020

SparkQA commented Sep 7, 2020

SparkQA commented Sep 23, 2020

HyukjinKwon commented Sep 23, 2020

SparkQA commented Sep 23, 2020

HyukjinKwon commented Sep 23, 2020

SparkQA commented Sep 23, 2020

holdenk commented Sep 23, 2020

SparkQA commented Sep 23, 2020

HyukjinKwon commented Sep 24, 2020

HyukjinKwon commented Sep 24, 2020 • edited Loading

zero323 commented Sep 24, 2020

zero323 commented Sep 24, 2020

HyukjinKwon commented Sep 24, 2020

rehevkor5 Feb 17, 2023

Choose a reason for hiding this comment

Kilo59 commented Feb 25, 2023 • edited Loading

zero323 commented Aug 31, 2020 •

edited

Loading

zero323 commented Sep 3, 2020 •

edited

Loading

HyukjinKwon commented Sep 24, 2020 •

edited

Loading

Kilo59 commented Feb 25, 2023 •

edited

Loading