[Python][CI] Tests involving fastparquet are never run #37853

jorisvandenbossche · 2023-09-25T09:05:01Z

We have a fastparquet pytest marker for tests that require fastparquet, and we have two such tests, but from a search in our code base, it doesn't seem that we include fastparquet anywhere in one of our CI test builds.

The two tests are:

test_fastparquet_cross_compatibility in the parquet tests, added in https://issues.apache.org/jira/browse/ARROW-6683 (I know that pandas has similar cross compat tests)
test_fastparquet_read_with_hdfs in test_hdfs.py: ensuring fastparquet can use our HDFS filesystem -> this is something that fsspec / fastparquet can test themselves I think, and this is also in our legacy HDFS tests that we will remove once removing the legacy HDFS bindings

Given the above, we could also consider removing the tests alltogether (although adding fastparquet in one of the nightly builds should also be easy)

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-09-25T10:38:22Z

Context: the parquet cross-compat test appears to be failing when enabled in the conda recipe tests: #37624 (comment)

And I can actually reproduce this locally, both with running our test, as with the following small snippet:

In [1]: df = pd.DataFrame({"col": [True, False, True]})

In [2]: df.to_parquet("test_bool.parquet", engine="pyarrow")

In [3]: pd.read_parquet("test_bool.parquet", engine="pyarrow")
Out[3]: 
     col
0   True
1  False
2   True

In [4]: pd.read_parquet("test_bool.parquet", engine="fastparquet")
Out[4]: 
     col
0  False
1  False
2  False

In [5]: pa.__version__
Out[5]: '14.0.0.dev183+g7b14b2b27.d20230925'

In [6]: import fastparquet; fastparquet.__version__
Out[6]: '2023.8.0'

Update: the above is with the 14 dev version of pyarrow. When writing the file with 13.0 or older, the roundtrip is actually perfectly fine. So this might be related to change to use RLE by default for boolean values: #36955

jorisvandenbossche · 2023-09-25T10:48:59Z

Yes, so it's the different encoding that makes fastparquet fail:

In [5]: df = pd.DataFrame({"col": [True, False, True]})

In [8]: df.to_parquet("test_bool_pa14_plain.parquet", engine="pyarrow", column_encoding={"col": "PLAIN"}, use_dictionary=False)

In [9]: df.to_parquet("test_bool_pa14_rle.parquet", engine="pyarrow", column_encoding={"col": "RLE"}, use_dictionary=False)

In [10]: pd.read_parquet("test_bool_pa14_plain.parquet", engine="fastparquet")
Out[10]: 
     col
0   True
1  False
2   True

In [11]: pd.read_parquet("test_bool_pa14_rle.parquet", engine="fastparquet")
Out[11]: 
     col
0  False
1  False
2  False

This is getting a bit off-topic for this issue, but maybe that's a good argument to actually do run those tests on our CI, then we would have noticed this compat issue earlier.

Opened dask/fastparquet#884 on the fastparquet side

pitrou · 2024-09-16T08:48:35Z

Do we still care about this or should we close this issue?

jorisvandenbossche added Component: Python Component: Continuous Integration labels Sep 25, 2023

jorisvandenbossche mentioned this issue Sep 25, 2023

GH-37621: [Packaging][Conda] Sync conda recipes with feedstocks #37624

Merged

jorisvandenbossche mentioned this issue Sep 29, 2023

GH-36882: [C++][Parquet] Default RLE for bool values in the parquet version 2.x #36955

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][CI] Tests involving fastparquet are never run #37853

[Python][CI] Tests involving fastparquet are never run #37853

jorisvandenbossche commented Sep 25, 2023 •

edited

Loading

jorisvandenbossche commented Sep 25, 2023 •

edited

Loading

jorisvandenbossche commented Sep 25, 2023 •

edited

Loading

pitrou commented Sep 16, 2024

[Python][CI] Tests involving fastparquet are never run #37853

[Python][CI] Tests involving fastparquet are never run #37853

Comments

jorisvandenbossche commented Sep 25, 2023 • edited Loading

jorisvandenbossche commented Sep 25, 2023 • edited Loading

jorisvandenbossche commented Sep 25, 2023 • edited Loading

pitrou commented Sep 16, 2024

jorisvandenbossche commented Sep 25, 2023 •

edited

Loading

jorisvandenbossche commented Sep 25, 2023 •

edited

Loading

jorisvandenbossche commented Sep 25, 2023 •

edited

Loading