read_parquet_table fails when chunked is true for partitioned table #627

bentkibler · 2021-04-01T21:52:19Z

If I try to read a partitioned table in batch chunks using s3.read_parquet_table(chunked=True) on a table that has partitions defined, I get the following exception:

<ipython-input-3-6b4a443538f2> in test_table(dbname, tablename, use_threads, chunked, partition_filter, columns)
      3     tot_rows = 0
      4     tot_files = 0
----> 5     dfs = wr.s3.read_parquet_table(boto3_session=session, 
      6                                    database=dbname,
      7                                    table=tablename,

~/.conda/envs/awswrangler/lib/python3.9/site-packages/awswrangler/_config.py in wrapper(*args_raw, **kwargs)
    415                 del args[name]
    416                 args = {**args, **keywords}
--> 417         return function(**args)
    418 
    419     wrapper.__doc__ = _inject_config_doc(doc=function.__doc__, available_configs=available_configs)

~/.conda/envs/awswrangler/lib/python3.9/site-packages/awswrangler/s3/_read_parquet.py in read_parquet_table(table, database, filename_suffix, filename_ignore_suffix, catalog_id, partition_filter, columns, validate_schema, categories, safe, map_types, chunked, use_threads, boto3_session, s3_additional_kwargs)
    774     except KeyError as ex:
    775         raise exceptions.InvalidTable(f"Missing s3 location for {database}.{table}.") from ex
--> 776     return _data_types.cast_pandas_with_athena_types(
    777         df=read_parquet(
    778             path=path,

~/.conda/envs/awswrangler/lib/python3.9/site-packages/awswrangler/_data_types.py in cast_pandas_with_athena_types(df, dtype)
    597     for col, athena_type in dtype.items():
    598         if (
--> 599             (col in df.columns)
    600             and (athena_type.startswith("array") is False)
    601             and (athena_type.startswith("struct") is False)

AttributeError: 'generator' object has no attribute 'columns'

The exception is thrown in line 599 of cast_pandas_with_athena_types because it's expecting an actual dataframe, not a generator of dataframes. It seems the calling methodread_parquet_table() is where a fix is needed, such that it calls the cast_ method on each dataframe returned by the generator from read_parquet().

My environment is Python 3.9, awswrangler 2.6.0, on AWS Linux2

The text was updated successfully, but these errors were encountered:

maxispeicher · 2021-04-03T09:49:37Z

I've created a fix for reading a parquet table in chunked mode. If you find the time, you can try if it works for you 🙂

pip uninstall awswrangler -y
pip install git+https://github.com/maxispeicher/aws-data-wrangler.git@read_pq_table_chunked

bentkibler · 2021-04-06T01:04:45Z

I tried that fix in my environment, and it works as expected now. Thanks for the quick turnaround!

jaidisido · 2021-04-15T17:19:47Z

Covered in 2.7.0 release

maxispeicher mentioned this issue Apr 3, 2021

Read parquet table chunked #631

Merged

jaidisido assigned maxispeicher Apr 6, 2021

jaidisido added bug Something isn't working minor release Will be addressed in the next minor release labels Apr 6, 2021

jaidisido added this to the 2.7.0 milestone Apr 6, 2021

jaidisido linked a pull request Apr 6, 2021 that will close this issue

Read parquet table chunked #631

Merged

jaidisido added the ready to release label Apr 6, 2021

jaidisido closed this as completed Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_parquet_table fails when chunked is true for partitioned table #627

read_parquet_table fails when chunked is true for partitioned table #627

bentkibler commented Apr 1, 2021

maxispeicher commented Apr 3, 2021

bentkibler commented Apr 6, 2021

jaidisido commented Apr 15, 2021

read_parquet_table fails when chunked is true for partitioned table #627

read_parquet_table fails when chunked is true for partitioned table #627

Comments

bentkibler commented Apr 1, 2021

maxispeicher commented Apr 3, 2021

bentkibler commented Apr 6, 2021

jaidisido commented Apr 15, 2021