You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
read_parquet_metadata and store_parquet_metadata read parquet files and infer data structure (and in case of store_parquet_metadata , will store it to Glue table).
However, the results of both functions results inconsistent order of columns, make it hard to parameterize Athena query
To Reproduce
Read a same parquet file multiple times and the results are different each time, resulting in change in Glue Table
1st time run:
{'Op': 'string', 's3_timestamp': 'string', 'id': 'string', 'name': 'string', 'date_entered': 'timestamp',....)
2nd time run: 'OP' column has moved to the end.
{'s3_timestamp': 'string', 'id': 'string', 'name': 'string', 'date_entered': 'timestamp', ..., 'Op': 'string'}
3rd time run: 'OP' column has moved back to original position
{'Op': 'string', 's3_timestamp': 'string', 'id': 'string', 'name': 'string', 'date_entered': 'timestamp',....)
So on...
The parquet file has the order of 1st result
The text was updated successfully, but these errors were encountered:
tuannguyen0901
changed the title
read_parquet_metadata and store_parquet_metadata result inconsistent order of columns, causing glue table to change order
read_parquet_metadata and store_parquet_metadata result inconsistent order of columns, causing glue table to modify multiple times
Nov 14, 2020
I did a lot of tests here, and this situation must only occur if you have mixed schemas (different columns order) between different files in your dataset.
Unfortunately, This situation is not rare to see out there. So I've added in the commit above a small change that will ensure determinism in the read schema. It will always keep the first schema detected in the first file reached following the key (path) alphanumeric order (the natural s3 object order).
Do you mind to install it directly from our repository and check if it is everything fine for your use case before the official release?
The idea is to publish it in the version 1.10.1 next Wednesday.
Thanks for fixing this @igorborgest. I checked and everything is fine now. The data structure returns correct order in every run.
You are correct, there are parquet files with and without 'Op' column. Files with 'Op' column is at 1st position.
Describe the bug
read_parquet_metadata and store_parquet_metadata read parquet files and infer data structure (and in case of store_parquet_metadata , will store it to Glue table).
However, the results of both functions results inconsistent order of columns, make it hard to parameterize Athena query
To Reproduce
Read a same parquet file multiple times and the results are different each time, resulting in change in Glue Table
1st time run:
{'Op': 'string', 's3_timestamp': 'string', 'id': 'string', 'name': 'string', 'date_entered': 'timestamp',....)
2nd time run: 'OP' column has moved to the end.
{'s3_timestamp': 'string', 'id': 'string', 'name': 'string', 'date_entered': 'timestamp', ..., 'Op': 'string'}
3rd time run: 'OP' column has moved back to original position
{'Op': 'string', 's3_timestamp': 'string', 'id': 'string', 'name': 'string', 'date_entered': 'timestamp',....)
So on...
The parquet file has the order of 1st result
The text was updated successfully, but these errors were encountered: