Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][CI] Tests involving fastparquet are never run #37853

Open
jorisvandenbossche opened this issue Sep 25, 2023 · 3 comments
Open

[Python][CI] Tests involving fastparquet are never run #37853

jorisvandenbossche opened this issue Sep 25, 2023 · 3 comments

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 25, 2023

We have a fastparquet pytest marker for tests that require fastparquet, and we have two such tests, but from a search in our code base, it doesn't seem that we include fastparquet anywhere in one of our CI test builds.

The two tests are:

  • test_fastparquet_cross_compatibility in the parquet tests, added in https://issues.apache.org/jira/browse/ARROW-6683 (I know that pandas has similar cross compat tests)
  • test_fastparquet_read_with_hdfs in test_hdfs.py: ensuring fastparquet can use our HDFS filesystem -> this is something that fsspec / fastparquet can test themselves I think, and this is also in our legacy HDFS tests that we will remove once removing the legacy HDFS bindings

Given the above, we could also consider removing the tests alltogether (although adding fastparquet in one of the nightly builds should also be easy)

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Sep 25, 2023

Context: the parquet cross-compat test appears to be failing when enabled in the conda recipe tests: #37624 (comment)

And I can actually reproduce this locally, both with running our test, as with the following small snippet:

In [1]: df = pd.DataFrame({"col": [True, False, True]})

In [2]: df.to_parquet("test_bool.parquet", engine="pyarrow")

In [3]: pd.read_parquet("test_bool.parquet", engine="pyarrow")
Out[3]: 
     col
0   True
1  False
2   True

In [4]: pd.read_parquet("test_bool.parquet", engine="fastparquet")
Out[4]: 
     col
0  False
1  False
2  False

In [5]: pa.__version__
Out[5]: '14.0.0.dev183+g7b14b2b27.d20230925'

In [6]: import fastparquet; fastparquet.__version__
Out[6]: '2023.8.0'

Update: the above is with the 14 dev version of pyarrow. When writing the file with 13.0 or older, the roundtrip is actually perfectly fine. So this might be related to change to use RLE by default for boolean values: #36955

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Sep 25, 2023

Yes, so it's the different encoding that makes fastparquet fail:

In [5]: df = pd.DataFrame({"col": [True, False, True]})

In [8]: df.to_parquet("test_bool_pa14_plain.parquet", engine="pyarrow", column_encoding={"col": "PLAIN"}, use_dictionary=False)

In [9]: df.to_parquet("test_bool_pa14_rle.parquet", engine="pyarrow", column_encoding={"col": "RLE"}, use_dictionary=False)

In [10]: pd.read_parquet("test_bool_pa14_plain.parquet", engine="fastparquet")
Out[10]: 
     col
0   True
1  False
2   True

In [11]: pd.read_parquet("test_bool_pa14_rle.parquet", engine="fastparquet")
Out[11]: 
     col
0  False
1  False
2  False

This is getting a bit off-topic for this issue, but maybe that's a good argument to actually do run those tests on our CI, then we would have noticed this compat issue earlier.

Opened dask/fastparquet#884 on the fastparquet side

@pitrou
Copy link
Member

pitrou commented Sep 16, 2024

Do we still care about this or should we close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants