Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_parquet_metadata and store_parquet_metadata result inconsistent order of columns, causing glue table to modify multiple times #449

Closed
tuannguyen0901 opened this issue Nov 14, 2020 · 3 comments
Assignees
Labels
bug Something isn't working ready to release
Milestone

Comments

@tuannguyen0901
Copy link

Describe the bug
read_parquet_metadata and store_parquet_metadata read parquet files and infer data structure (and in case of store_parquet_metadata , will store it to Glue table).
However, the results of both functions results inconsistent order of columns, make it hard to parameterize Athena query

To Reproduce
Read a same parquet file multiple times and the results are different each time, resulting in change in Glue Table
1st time run:
{'Op': 'string', 's3_timestamp': 'string', 'id': 'string', 'name': 'string', 'date_entered': 'timestamp',....)
2nd time run: 'OP' column has moved to the end.
{'s3_timestamp': 'string', 'id': 'string', 'name': 'string', 'date_entered': 'timestamp', ..., 'Op': 'string'}
3rd time run: 'OP' column has moved back to original position
{'Op': 'string', 's3_timestamp': 'string', 'id': 'string', 'name': 'string', 'date_entered': 'timestamp',....)
So on...

The parquet file has the order of 1st result

@tuannguyen0901 tuannguyen0901 added the bug Something isn't working label Nov 14, 2020
@tuannguyen0901 tuannguyen0901 changed the title read_parquet_metadata and store_parquet_metadata result inconsistent order of columns, causing glue table to change order read_parquet_metadata and store_parquet_metadata result inconsistent order of columns, causing glue table to modify multiple times Nov 14, 2020
@igorborgest igorborgest self-assigned this Nov 21, 2020
@igorborgest igorborgest added the WIP Work in progress label Nov 21, 2020
@igorborgest igorborgest modified the milestones: 1.11.0, 1.10.1 Nov 21, 2020
@igorborgest
Copy link
Contributor

Hi @tuannguyen0901, thanks for reaching out!

I did a lot of tests here, and this situation must only occur if you have mixed schemas (different columns order) between different files in your dataset.

Unfortunately, This situation is not rare to see out there. So I've added in the commit above a small change that will ensure determinism in the read schema. It will always keep the first schema detected in the first file reached following the key (path) alphanumeric order (the natural s3 object order).

Do you mind to install it directly from our repository and check if it is everything fine for your use case before the official release?
The idea is to publish it in the version 1.10.1 next Wednesday.

pip install git+https://github.com/awslabs/aws-data-wrangler.git --no-use-pep517

@nguyentrantuan
Copy link

Thanks for fixing this @igorborgest. I checked and everything is fine now. The data structure returns correct order in every run.
You are correct, there are parquet files with and without 'Op' column. Files with 'Op' column is at 1st position.

@igorborgest igorborgest removed the WIP Work in progress label Nov 21, 2020
@igorborgest
Copy link
Contributor

Released on version 1.10.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready to release
Projects
None yet
Development

No branches or pull requests

3 participants