dtype_backend overrides categories #2700

antbz · 2024-03-04T14:57:30Z

Describe the bug

I noticed that when using wr.athena.read_sql_query, the categories parameter was not having any effect on the returned pandas dataframe.

I did some investigation and realised that in the pa.Table.to_pandas method, the categorical conversion happens before the types_mapper is taken into account, so in effect, the categorical columns are always being converted back to string.

Removing the types_mapper kwarg, the categorical types are processed correctly.

How to Reproduce

import awswrangler as wr

data = wr.athena.read_sql_query(
    sql="SELECT id, options FROM my_table",
    database="my-database",
    categories=["options"],
)
print(data.dtypes)

The categories will be a string column.

Expected behavior

The columns specified in categories should be pd.Categorical types

Your project

No response

Screenshots

No response

OS

MacOS 14.3.1

Python version

3.9.18

AWS SDK for pandas version

3.6.0

Additional context

PyArrow is version 15.0.0

The text was updated successfully, but these errors were encountered:

jaidisido · 2024-03-04T15:15:56Z

The wr.athena.read_sql_query API has a pyarrow_additional_kwargs argument which is forwarded to the to_pandas method. If nothing is supplied, some sane defaults are applied.

If you wish to override these defaults, to remove types_mapper for example, you can do something along the lines of:

data = wr.athena.read_sql_query(
    sql="SELECT id, options FROM my_table",
    database="my-database",
    categories=["options"],
    pyarrow_additional_kwargs={'types_mapper': None},
)

antbz · 2024-03-04T15:37:35Z

@jaidisido While that is a nice suggestion, it also does not work because of how _fetch_parquet_result works internally. When you specify pyarrow_additional_kwargs, the categories are never added to the kwargs actually passed onto pyarrow:

aws-sdk-pandas/awswrangler/athena/_read.py

Lines 150 to 153 in 4816e5e

    
           if not pyarrow_additional_kwargs: 
        
               pyarrow_additional_kwargs = {} 
        
               if categories: 
        
                   pyarrow_additional_kwargs["categories"] = categories

For it to work correctly you need to pass categories as additional kwargs as well:

data = wr.athena.read_sql_query(
    sql="SELECT id, options FROM my_table",
    database="my-database",
    pyarrow_additional_kwargs={'types_mapper': None, 'categories': ['options']},
)

I'm not sure if the behaviour in _fetch_parquet_result is intentional or not, but as it stands, the categories parameter effectively does not do what it is supposed to. We should either document this better or find a way to make it compatible by default.

jaidisido · 2024-03-04T15:48:07Z

I can't think of a reason why it's setup that way so I believe it's just badly indented. #2701 should fix that

antbz added the bug Something isn't working label Mar 4, 2024

jaidisido mentioned this issue Mar 4, 2024

fix: indent categories in pyarrow_additional_kwargs correctly #2701

Merged

jaidisido linked a pull request Mar 4, 2024 that will close this issue

fix: indent categories in pyarrow_additional_kwargs correctly #2701

Merged

jaidisido closed this as completed in #2701 Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dtype_backend overrides categories #2700

dtype_backend overrides categories #2700

antbz commented Mar 4, 2024

jaidisido commented Mar 4, 2024 •

edited

Loading

antbz commented Mar 4, 2024 •

edited

Loading

jaidisido commented Mar 4, 2024

dtype_backend overrides categories #2700

dtype_backend overrides categories #2700

Comments

antbz commented Mar 4, 2024

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

jaidisido commented Mar 4, 2024 • edited Loading

antbz commented Mar 4, 2024 • edited Loading

jaidisido commented Mar 4, 2024

jaidisido commented Mar 4, 2024 •

edited

Loading

antbz commented Mar 4, 2024 •

edited

Loading