[BUG] udf_test udf_cudf_test failed require_minimum_pandas_version check in spark 320+ #4378

pxLi · 2021-12-17T11:19:13Z

Describe the bug
seems pandas lib in spark 320+ is incompatible w/ the one installed as cudf's dep

[2021-12-17T11:14:48.244Z] ==================================== ERRORS ====================================
[2021-12-17T11:14:48.244Z] ______________ ERROR collecting src/main/python/udf_cudf_test.py _______________
[2021-12-17T11:14:48.244Z] ../../src/main/python/udf_cudf_test.py:21: in <module>
[2021-12-17T11:14:48.244Z]     require_minimum_pandas_version()
[2021-12-17T11:14:48.244Z] /home/jenkins/agent/workspace/jenkins-rapids_cudf_udf-dev-github-33-cuda11.2/jars/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py:27: in require_minimum_pandas_version
[2021-12-17T11:14:48.244Z]     import pandas
[2021-12-17T11:14:48.244Z] ../../../spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas/__init__.py:31: in <module>
[2021-12-17T11:14:48.244Z]     require_minimum_pandas_version()
[2021-12-17T11:14:48.244Z] /home/jenkins/agent/workspace/jenkins-rapids_cudf_udf-dev-github-33-cuda11.2/jars/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py:35: in require_minimum_pandas_version
[2021-12-17T11:14:48.244Z]     if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
[2021-12-17T11:14:48.244Z] E   AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
[2021-12-17T11:14:48.244Z] 
[2021-12-17T11:14:48.244Z] During handling of the above exception, another exception occurred:
[2021-12-17T11:14:48.244Z] ../../src/main/python/udf_cudf_test.py:24: in <module>
[2021-12-17T11:14:48.244Z]     raise AssertionError("incorrect pandas version during required testing " + str(e))
[2021-12-17T11:14:48.244Z] E   AssertionError: incorrect pandas version during required testing partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
[2021-12-17T11:14:48.244Z] _________________ ERROR collecting src/main/python/udf_test.py _________________
[2021-12-17T11:14:48.244Z] ../../src/main/python/udf_test.py:21: in <module>
[2021-12-17T11:14:48.244Z]     require_minimum_pandas_version()
[2021-12-17T11:14:48.244Z] /home/jenkins/agent/workspace/jenkins-rapids_cudf_udf-dev-github-33-cuda11.2/jars/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py:27: in require_minimum_pandas_version
[2021-12-17T11:14:48.244Z]     import pandas
[2021-12-17T11:14:48.244Z] ../../../spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas/__init__.py:31: in <module>
[2021-12-17T11:14:48.244Z]     require_minimum_pandas_version()
[2021-12-17T11:14:48.244Z] /home/jenkins/agent/workspace/jenkins-rapids_cudf_udf-dev-github-33-cuda11.2/jars/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py:35: in require_minimum_pandas_version
[2021-12-17T11:14:48.244Z]     if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
[2021-12-17T11:14:48.244Z] E   AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
[2021-12-17T11:14:48.244Z] 
[2021-12-17T11:14:48.244Z] During handling of the above exception, another exception occurred:
[2021-12-17T11:14:48.244Z] ../../src/main/python/udf_test.py:24: in <module>
[2021-12-17T11:14:48.244Z]     raise AssertionError("incorrect pandas version during required testing " + str(e))
[2021-12-17T11:14:48.244Z] E   AssertionError: incorrect pandas version during required testing partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)

The text was updated successfully, but these errors were encountered:

tgravescs · 2021-12-20T19:09:28Z

spark bumped up the minimum pandas version to 1.0.5 with apache/spark@3657703

That change went into 3.3.0, branch 3.2 does not have that change, so I'm curious why this started failing unless cudf changed their version but then I would expect it to fail other places. 3.2 shipped requiring pandas version 0.23.2 (https://github.com/apache/spark/blob/v3.2.0/python/pyspark/sql/pandas/utils.py#L23)

Cudf seems to require: pandas>=1.0,<1.4.0dev0 which hasn't changed recently.
https://github.com/rapidsai/cudf/blob/branch-22.02/conda/environments/cudf_dev_cuda11.5.yml#L19

From the jenkinsfile for this build it looks like we are using 11.0 and 11.2 cuda images, which I don't think are supported any more. I think we need to change to the 11.5. @pxLi @NvTimLiu @GaryShen2008 could you take a look ?

NvTimLiu · 2021-12-21T02:18:13Z

I'll check this issue.

NvTimLiu · 2021-12-21T15:05:06Z

Seems we're not importing the real pandas module when run cudf-udf tests, we have a directory in spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas under the PYTHONPATH environment variable : https://github.com/NVIDIA/spark-rapids/blob/branch-22.02/jenkins/spark-tests.sh#L109

NvTimLiu · 2021-12-22T06:14:40Z

spark bumped up the minimum pandas version to 1.0.5 with apache/spark@3657703

That change went into 3.3.0, branch 3.2 does not have that change, so I'm curious why this started failing unless cudf changed their version but then I would expect it to fail other places. 3.2 shipped requiring pandas version 0.23.2 (https://github.com/apache/spark/blob/v3.2.0/python/pyspark/sql/pandas/utils.py#L23)

Cudf seems to require: pandas>=1.0,<1.4.0dev0 which hasn't changed recently. https://github.com/rapidsai/cudf/blob/branch-22.02/conda/environments/cudf_dev_cuda11.5.yml#L19

From the jenkinsfile for this build it looks like we are using 11.0 and 11.2 cuda images, which I don't think are supported any more. I think we need to change to the 11.5. @pxLi @NvTimLiu @GaryShen2008 could you take a look ?

As CUDA11.5 official docker images are not available until now here: https://hub.docker.com/r/nvidia/cuda/tags?page=1&ordering=last_updated, so we are still using CUDA11.0/11.2 runtime.

Will update to 11.5 once the official images are online @tgravescs

NvTimLiu · 2021-12-22T08:33:03Z

Reason for the failure:

There is a pandas python package in the dir: spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas, and our spark-tests.sh#L109 include it via PYTHONPATH
This spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas causes the ERROR details as below:

[root@0de42b9f44bd /]# python --version
Python 3.8.12
[root@0de42b9f44bd /]# export PYTHONPATH=/jars/spark-3.2.0-bin-hadoop3.2/python:/jars/spark-3.2.0-bin-hadoop3.2/python/pyspark/:/jars/spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip
>>> import pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/jars/spark-3.2.0-bin-hadoop3.2/python/pyspark/pandas/__init__.py", line 31, in <module>
    require_minimum_pandas_version()
  File "/jars/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/pandas/utils.py", line 35, in require_minimum_pandas_version
    if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)

There is no such dir: python/pyspark/pandas under spark-3.1.x or earlier versions, so this issue only happens on the spark-3.2.0 or later versions

To fix:

Put conda package path ahead of the env 'PYTHONPATH', to import the right pandas from conda instead of spark3.2.0 or later binary path.

  [root@0de42b9f44bd]# export PYTHONPATH=/opt/conda/lib/python3.8/site-packages:/jars/spark-3.2.0-bin-hadoop3.2/python:/jars/spark-3.2.0-bin-hadoop3.2/python/pyspark/:/jars/spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip
[root@0de42b9f44bd spark-rapids]# python
Python 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.__version__
'1.3.5'
>>>

NvTimLiu · 2021-12-23T12:51:20Z

close as #4419 merged

pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 17, 2021

tgravescs added the P0 Must have for release label Dec 20, 2021

NvTimLiu self-assigned this Dec 21, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Dec 21, 2021

sameerz added this to the Dec 13 - Jan 7 milestone Dec 21, 2021

NvTimLiu mentioned this issue Dec 22, 2021

Import the right pandas from conda #4419

Merged

NvTimLiu linked a pull request Dec 23, 2021 that will close this issue

Import the right pandas from conda #4419

Merged

NvTimLiu closed this as completed Dec 23, 2021

NvTimLiu linked a pull request Dec 28, 2021 that will close this issue

Import the right pandas from conda [skip ci] #4433

Merged

NvTimLiu mentioned this issue Dec 28, 2021

Import the right pandas from conda [skip ci] #4433

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] udf_test udf_cudf_test failed require_minimum_pandas_version check in spark 320+ #4378

[BUG] udf_test udf_cudf_test failed require_minimum_pandas_version check in spark 320+ #4378

pxLi commented Dec 17, 2021 •

edited

Loading

tgravescs commented Dec 20, 2021

NvTimLiu commented Dec 21, 2021

NvTimLiu commented Dec 21, 2021

NvTimLiu commented Dec 22, 2021

NvTimLiu commented Dec 22, 2021 •

edited

Loading

NvTimLiu commented Dec 23, 2021

[BUG] udf_test udf_cudf_test failed require_minimum_pandas_version check in spark 320+ #4378

[BUG] udf_test udf_cudf_test failed require_minimum_pandas_version check in spark 320+ #4378

Comments

pxLi commented Dec 17, 2021 • edited Loading

tgravescs commented Dec 20, 2021

NvTimLiu commented Dec 21, 2021

NvTimLiu commented Dec 21, 2021

NvTimLiu commented Dec 22, 2021

NvTimLiu commented Dec 22, 2021 • edited Loading

NvTimLiu commented Dec 23, 2021

pxLi commented Dec 17, 2021 •

edited

Loading

NvTimLiu commented Dec 22, 2021 •

edited

Loading