Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] udf_test udf_cudf_test failed require_minimum_pandas_version check in spark 320+ #4378

Closed
pxLi opened this issue Dec 17, 2021 · 6 comments · Fixed by #4419 or #4433
Closed

[BUG] udf_test udf_cudf_test failed require_minimum_pandas_version check in spark 320+ #4378

pxLi opened this issue Dec 17, 2021 · 6 comments · Fixed by #4419 or #4433
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@pxLi
Copy link
Collaborator

pxLi commented Dec 17, 2021

Describe the bug
seems pandas lib in spark 320+ is incompatible w/ the one installed as cudf's dep

[2021-12-17T11:14:48.244Z] ==================================== ERRORS ====================================
[2021-12-17T11:14:48.244Z] ______________ ERROR collecting src/main/python/udf_cudf_test.py _______________
[2021-12-17T11:14:48.244Z] ../../src/main/python/udf_cudf_test.py:21: in <module>
[2021-12-17T11:14:48.244Z]     require_minimum_pandas_version()
[2021-12-17T11:14:48.244Z] /home/jenkins/agent/workspace/jenkins-rapids_cudf_udf-dev-github-33-cuda11.2/jars/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py:27: in require_minimum_pandas_version
[2021-12-17T11:14:48.244Z]     import pandas
[2021-12-17T11:14:48.244Z] ../../../spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas/__init__.py:31: in <module>
[2021-12-17T11:14:48.244Z]     require_minimum_pandas_version()
[2021-12-17T11:14:48.244Z] /home/jenkins/agent/workspace/jenkins-rapids_cudf_udf-dev-github-33-cuda11.2/jars/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py:35: in require_minimum_pandas_version
[2021-12-17T11:14:48.244Z]     if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
[2021-12-17T11:14:48.244Z] E   AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
[2021-12-17T11:14:48.244Z] 
[2021-12-17T11:14:48.244Z] During handling of the above exception, another exception occurred:
[2021-12-17T11:14:48.244Z] ../../src/main/python/udf_cudf_test.py:24: in <module>
[2021-12-17T11:14:48.244Z]     raise AssertionError("incorrect pandas version during required testing " + str(e))
[2021-12-17T11:14:48.244Z] E   AssertionError: incorrect pandas version during required testing partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
[2021-12-17T11:14:48.244Z] _________________ ERROR collecting src/main/python/udf_test.py _________________
[2021-12-17T11:14:48.244Z] ../../src/main/python/udf_test.py:21: in <module>
[2021-12-17T11:14:48.244Z]     require_minimum_pandas_version()
[2021-12-17T11:14:48.244Z] /home/jenkins/agent/workspace/jenkins-rapids_cudf_udf-dev-github-33-cuda11.2/jars/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py:27: in require_minimum_pandas_version
[2021-12-17T11:14:48.244Z]     import pandas
[2021-12-17T11:14:48.244Z] ../../../spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas/__init__.py:31: in <module>
[2021-12-17T11:14:48.244Z]     require_minimum_pandas_version()
[2021-12-17T11:14:48.244Z] /home/jenkins/agent/workspace/jenkins-rapids_cudf_udf-dev-github-33-cuda11.2/jars/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py:35: in require_minimum_pandas_version
[2021-12-17T11:14:48.244Z]     if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
[2021-12-17T11:14:48.244Z] E   AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
[2021-12-17T11:14:48.244Z] 
[2021-12-17T11:14:48.244Z] During handling of the above exception, another exception occurred:
[2021-12-17T11:14:48.244Z] ../../src/main/python/udf_test.py:24: in <module>
[2021-12-17T11:14:48.244Z]     raise AssertionError("incorrect pandas version during required testing " + str(e))
[2021-12-17T11:14:48.244Z] E   AssertionError: incorrect pandas version during required testing partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 17, 2021
@tgravescs
Copy link
Collaborator

spark bumped up the minimum pandas version to 1.0.5 with apache/spark@3657703

That change went into 3.3.0, branch 3.2 does not have that change, so I'm curious why this started failing unless cudf changed their version but then I would expect it to fail other places. 3.2 shipped requiring pandas version 0.23.2 (https://github.com/apache/spark/blob/v3.2.0/python/pyspark/sql/pandas/utils.py#L23)

Cudf seems to require: pandas>=1.0,<1.4.0dev0 which hasn't changed recently.
https://github.com/rapidsai/cudf/blob/branch-22.02/conda/environments/cudf_dev_cuda11.5.yml#L19

From the jenkinsfile for this build it looks like we are using 11.0 and 11.2 cuda images, which I don't think are supported any more. I think we need to change to the 11.5. @pxLi @NvTimLiu @GaryShen2008 could you take a look ?

@tgravescs tgravescs added the P0 Must have for release label Dec 20, 2021
@NvTimLiu NvTimLiu self-assigned this Dec 21, 2021
@NvTimLiu
Copy link
Collaborator

I'll check this issue.

@NvTimLiu
Copy link
Collaborator

Seems we're not importing the real pandas module when run cudf-udf tests, we have a directory in spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas under the PYTHONPATH environment variable : https://github.com/NVIDIA/spark-rapids/blob/branch-22.02/jenkins/spark-tests.sh#L109

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Dec 21, 2021
@sameerz sameerz added this to the Dec 13 - Jan 7 milestone Dec 21, 2021
@NvTimLiu
Copy link
Collaborator

spark bumped up the minimum pandas version to 1.0.5 with apache/spark@3657703

That change went into 3.3.0, branch 3.2 does not have that change, so I'm curious why this started failing unless cudf changed their version but then I would expect it to fail other places. 3.2 shipped requiring pandas version 0.23.2 (https://github.com/apache/spark/blob/v3.2.0/python/pyspark/sql/pandas/utils.py#L23)

Cudf seems to require: pandas>=1.0,<1.4.0dev0 which hasn't changed recently. https://github.com/rapidsai/cudf/blob/branch-22.02/conda/environments/cudf_dev_cuda11.5.yml#L19

From the jenkinsfile for this build it looks like we are using 11.0 and 11.2 cuda images, which I don't think are supported any more. I think we need to change to the 11.5. @pxLi @NvTimLiu @GaryShen2008 could you take a look ?

As CUDA11.5 official docker images are not available until now here: https://hub.docker.com/r/nvidia/cuda/tags?page=1&ordering=last_updated, so we are still using CUDA11.0/11.2 runtime.

Will update to 11.5 once the official images are online @tgravescs

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Dec 22, 2021

Reason for the failure:

  • There is a pandas python package in the dir: spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas, and our spark-tests.sh#L109 include it via PYTHONPATH

  • This spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas causes the ERROR details as below:

[root@0de42b9f44bd /]# python --version
Python 3.8.12
[root@0de42b9f44bd /]# export PYTHONPATH=/jars/spark-3.2.0-bin-hadoop3.2/python:/jars/spark-3.2.0-bin-hadoop3.2/python/pyspark/:/jars/spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip
>>> import pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/jars/spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas/__init__.py", line 31, in <module>
    require_minimum_pandas_version()
  File "/jars/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/pandas/utils.py", line 35, in require_minimum_pandas_version
    if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
  • There is no such dir: python/pyspark/pandas under spark-3.1.x or earlier versions, so this issue only happens on the spark-3.2.0 or later versions

To fix:

  • Put conda package path ahead of the env 'PYTHONPATH', to import the right pandas from conda instead of spark3.2.0 or later binary path.
  [root@0de42b9f44bd]# export PYTHONPATH=/opt/conda/lib/python3.8/site-packages:/jars/spark-3.2.0-bin-hadoop3.2/python:/jars/spark-3.2.0-bin-hadoop3.2/python/pyspark/:/jars/spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip
[root@0de42b9f44bd spark-rapids]# python
Python 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.__version__
'1.3.5'
>>>

@NvTimLiu NvTimLiu linked a pull request Dec 23, 2021 that will close this issue
@NvTimLiu
Copy link
Collaborator

close as #4419 merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
4 participants