Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve pyspark / numpy conflicts #992

Merged
merged 3 commits into from
Jan 21, 2023

Conversation

loomlike
Copy link
Collaborator

@loomlike loomlike commented Jan 18, 2023

Signed-off-by: Jun Ki Min 42475935+loomlike@users.noreply.github.com

Description

Pyspark is still relying on the old numpy api, referring np.bool that has been deprecated.
Because of that, when calling sparkDF.toPandas(), it throws AttributeError: module 'numpy' has no attribute 'bool'.
We have to upgrade pyspark version when they release new patch.

How was this PR tested?

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to clarify your proposed changes.

Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>
@loomlike loomlike added the safe to test Tag to execute build pipeline for a PR from forked repo label Jan 18, 2023
xiaoyongzhu
xiaoyongzhu previously approved these changes Jan 19, 2023
@blrchen
Copy link
Collaborator

blrchen commented Jan 19, 2023

I am having concern to do version pin in Feathr. Its okay to do version pin in Registry as its a standalone app. But for feathr it's a library and normally used in an environment with other python libraries in same context. Introduce a version pin might introduces the risk for package installation error or incompatible issues with other python packages.

I actually already experienced some issue when numpy was pin earlier

  • Nightly notebook test fails with error ImportError: this version of pandas is incompatible with numpy < 1.20.3
  • Installation issue on python 3.10 RuntimeWarning: NumPy 1.20.3 may not yet support Python 3.10

And seems this is already fixed in Spark apache/spark#37817, probably we can just set a min version for pyspeak instead?

image

image

@loomlike
Copy link
Collaborator Author

@xiaoyongzhu @blrchen I verified that unless we explicitly call sparkDF.toPandas() on the dataframe that include boolean type features, we can avoid the pyspark's bug. Once pyspark has new release, let's change the pyspark dependency to >new_version to address this.

Until then, we can stick with the current version. I'll put the comment on our setup.py instead of pinning numpy.

…e notebooks

Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>
@loomlike loomlike changed the title Put numpy pinning back Resolve pyspark / numpy conflicts Jan 19, 2023
Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>
@xiaoyongzhu xiaoyongzhu merged commit f9cdccd into feathr-ai:main Jan 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test Tag to execute build pipeline for a PR from forked repo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants