read_parquet_metadata and store_parquet_metadata result inconsistent order of columns, causing glue table to modify multiple times #449

tuannguyen0901 · 2020-11-14T22:25:28Z

Describe the bug
read_parquet_metadata and store_parquet_metadata read parquet files and infer data structure (and in case of store_parquet_metadata , will store it to Glue table).
However, the results of both functions results inconsistent order of columns, make it hard to parameterize Athena query

To Reproduce
Read a same parquet file multiple times and the results are different each time, resulting in change in Glue Table
1st time run:
{'Op': 'string', 's3_timestamp': 'string', 'id': 'string', 'name': 'string', 'date_entered': 'timestamp',....)
2nd time run: 'OP' column has moved to the end.
{'s3_timestamp': 'string', 'id': 'string', 'name': 'string', 'date_entered': 'timestamp', ..., 'Op': 'string'}
3rd time run: 'OP' column has moved back to original position
{'Op': 'string', 's3_timestamp': 'string', 'id': 'string', 'name': 'string', 'date_entered': 'timestamp',....)
So on...

The parquet file has the order of 1st result

igorborgest · 2020-11-21T14:31:48Z

Hi @tuannguyen0901, thanks for reaching out!

I did a lot of tests here, and this situation must only occur if you have mixed schemas (different columns order) between different files in your dataset.

Unfortunately, This situation is not rare to see out there. So I've added in the commit above a small change that will ensure determinism in the read schema. It will always keep the first schema detected in the first file reached following the key (path) alphanumeric order (the natural s3 object order).

Do you mind to install it directly from our repository and check if it is everything fine for your use case before the official release?
The idea is to publish it in the version 1.10.1 next Wednesday.

pip install git+https://github.com/awslabs/aws-data-wrangler.git --no-use-pep517

nguyentrantuan · 2020-11-21T15:02:09Z

Thanks for fixing this @igorborgest. I checked and everything is fine now. The data structure returns correct order in every run.
You are correct, there are parquet files with and without 'Op' column. Files with 'Op' column is at 1st position.

igorborgest · 2020-11-26T01:44:01Z

Released on version 1.10.1

tuannguyen0901 added the bug Something isn't working label Nov 14, 2020

igorborgest self-assigned this Nov 21, 2020

igorborgest added the WIP Work in progress label Nov 21, 2020

igorborgest modified the milestones: 1.11.0, 1.10.1 Nov 21, 2020

igorborgest added a commit that referenced this issue Nov 21, 2020

Forcing read_parquet_metadat determinism #449

af9ccf6

igorborgest added the ready to release label Nov 21, 2020

igorborgest removed the WIP Work in progress label Nov 21, 2020

igorborgest closed this as completed Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_parquet_metadata and store_parquet_metadata result inconsistent order of columns, causing glue table to modify multiple times #449

read_parquet_metadata and store_parquet_metadata result inconsistent order of columns, causing glue table to modify multiple times #449

tuannguyen0901 commented Nov 14, 2020

igorborgest commented Nov 21, 2020

nguyentrantuan commented Nov 21, 2020

igorborgest commented Nov 26, 2020

read_parquet_metadata and store_parquet_metadata result inconsistent order of columns, causing glue table to modify multiple times #449

read_parquet_metadata and store_parquet_metadata result inconsistent order of columns, causing glue table to modify multiple times #449

Comments

tuannguyen0901 commented Nov 14, 2020

igorborgest commented Nov 21, 2020

nguyentrantuan commented Nov 21, 2020

igorborgest commented Nov 26, 2020