Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dtype_backend overrides categories #2700

Closed
antbz opened this issue Mar 4, 2024 · 3 comments · Fixed by #2701
Closed

dtype_backend overrides categories #2700

antbz opened this issue Mar 4, 2024 · 3 comments · Fixed by #2701
Labels
bug Something isn't working

Comments

@antbz
Copy link

antbz commented Mar 4, 2024

Describe the bug

I noticed that when using wr.athena.read_sql_query, the categories parameter was not having any effect on the returned pandas dataframe.

I did some investigation and realised that in the pa.Table.to_pandas method, the categorical conversion happens before the types_mapper is taken into account, so in effect, the categorical columns are always being converted back to string.

Removing the types_mapper kwarg, the categorical types are processed correctly.

How to Reproduce

import awswrangler as wr

data = wr.athena.read_sql_query(
    sql="SELECT id, options FROM my_table",
    database="my-database",
    categories=["options"],
)
print(data.dtypes)

The categories will be a string column.

Expected behavior

The columns specified in categories should be pd.Categorical types

Your project

No response

Screenshots

No response

OS

MacOS 14.3.1

Python version

3.9.18

AWS SDK for pandas version

3.6.0

Additional context

PyArrow is version 15.0.0

@antbz antbz added the bug Something isn't working label Mar 4, 2024
@jaidisido
Copy link
Contributor

jaidisido commented Mar 4, 2024

The wr.athena.read_sql_query API has a pyarrow_additional_kwargs argument which is forwarded to the to_pandas method. If nothing is supplied, some sane defaults are applied.

If you wish to override these defaults, to remove types_mapper for example, you can do something along the lines of:

data = wr.athena.read_sql_query(
    sql="SELECT id, options FROM my_table",
    database="my-database",
    categories=["options"],
    pyarrow_additional_kwargs={'types_mapper': None},
)

@antbz
Copy link
Author

antbz commented Mar 4, 2024

@jaidisido While that is a nice suggestion, it also does not work because of how _fetch_parquet_result works internally. When you specify pyarrow_additional_kwargs, the categories are never added to the kwargs actually passed onto pyarrow:

if not pyarrow_additional_kwargs:
pyarrow_additional_kwargs = {}
if categories:
pyarrow_additional_kwargs["categories"] = categories

For it to work correctly you need to pass categories as additional kwargs as well:

data = wr.athena.read_sql_query(
    sql="SELECT id, options FROM my_table",
    database="my-database",
    pyarrow_additional_kwargs={'types_mapper': None, 'categories': ['options']},
)

I'm not sure if the behaviour in _fetch_parquet_result is intentional or not, but as it stands, the categories parameter effectively does not do what it is supposed to. We should either document this better or find a way to make it compatible by default.

@jaidisido
Copy link
Contributor

I can't think of a reason why it's setup that way so I believe it's just badly indented. #2701 should fix that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants