Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_parquet_table fails when chunked is true for partitioned table #627

Closed
bentkibler opened this issue Apr 1, 2021 · 3 comments · Fixed by #631
Closed

read_parquet_table fails when chunked is true for partitioned table #627

bentkibler opened this issue Apr 1, 2021 · 3 comments · Fixed by #631
Assignees
Labels
bug Something isn't working minor release Will be addressed in the next minor release ready to release
Milestone

Comments

@bentkibler
Copy link

If I try to read a partitioned table in batch chunks using s3.read_parquet_table(chunked=True) on a table that has partitions defined, I get the following exception:

<ipython-input-3-6b4a443538f2> in test_table(dbname, tablename, use_threads, chunked, partition_filter, columns)
      3     tot_rows = 0
      4     tot_files = 0
----> 5     dfs = wr.s3.read_parquet_table(boto3_session=session, 
      6                                    database=dbname,
      7                                    table=tablename,

~/.conda/envs/awswrangler/lib/python3.9/site-packages/awswrangler/_config.py in wrapper(*args_raw, **kwargs)
    415                 del args[name]
    416                 args = {**args, **keywords}
--> 417         return function(**args)
    418 
    419     wrapper.__doc__ = _inject_config_doc(doc=function.__doc__, available_configs=available_configs)

~/.conda/envs/awswrangler/lib/python3.9/site-packages/awswrangler/s3/_read_parquet.py in read_parquet_table(table, database, filename_suffix, filename_ignore_suffix, catalog_id, partition_filter, columns, validate_schema, categories, safe, map_types, chunked, use_threads, boto3_session, s3_additional_kwargs)
    774     except KeyError as ex:
    775         raise exceptions.InvalidTable(f"Missing s3 location for {database}.{table}.") from ex
--> 776     return _data_types.cast_pandas_with_athena_types(
    777         df=read_parquet(
    778             path=path,

~/.conda/envs/awswrangler/lib/python3.9/site-packages/awswrangler/_data_types.py in cast_pandas_with_athena_types(df, dtype)
    597     for col, athena_type in dtype.items():
    598         if (
--> 599             (col in df.columns)
    600             and (athena_type.startswith("array") is False)
    601             and (athena_type.startswith("struct") is False)

AttributeError: 'generator' object has no attribute 'columns'

The exception is thrown in line 599 of cast_pandas_with_athena_types because it's expecting an actual dataframe, not a generator of dataframes. It seems the calling methodread_parquet_table() is where a fix is needed, such that it calls the cast_ method on each dataframe returned by the generator from read_parquet().

My environment is Python 3.9, awswrangler 2.6.0, on AWS Linux2

@maxispeicher
Copy link
Contributor

I've created a fix for reading a parquet table in chunked mode. If you find the time, you can try if it works for you 🙂

pip uninstall awswrangler -y
pip install git+https://github.com/maxispeicher/aws-data-wrangler.git@read_pq_table_chunked

@bentkibler
Copy link
Author

I tried that fix in my environment, and it works as expected now. Thanks for the quick turnaround!

@jaidisido jaidisido added bug Something isn't working minor release Will be addressed in the next minor release labels Apr 6, 2021
@jaidisido jaidisido added this to the 2.7.0 milestone Apr 6, 2021
@jaidisido jaidisido linked a pull request Apr 6, 2021 that will close this issue
@jaidisido
Copy link
Contributor

Covered in 2.7.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working minor release Will be addressed in the next minor release ready to release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants