Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SELECT ... ORDER BY query fails on data with int64 timestamp and timezone field #959

Closed
Tracked by #8282 ...
sergiimk opened this issue Aug 30, 2021 · 7 comments
Closed
Tracked by #8282 ...

Comments

@sergiimk
Copy link
Contributor

Describe the bug
When trying to query a Parquet file produced by Apache Flink I get an error:

ArrowError(InvalidArgumentError("column types must match schema types, expected Timestamp(Millisecond, Some(\"UTC\")) but found Timestamp(Millisecond, None) at column index 0"))

Output of Java parquet-schema:

message Row {
  optional int64 system_time (TIMESTAMP(MILLIS,true));
  optional int64 reported_date (TIMESTAMP(MILLIS,true));
  optional binary province (STRING);
  optional int64 total_daily;
}

To Reproduce
Download and extract the sample data: data.tar.gz.

Run:

use datafusion::arrow::util::pretty::print_batches;
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let mut ctx = ExecutionContext::new();
    ctx.register_parquet("test", "flink.parquet")?;
    let df = ctx.table("test")?;

    //let df = ctx.sql("select * from test")?;
    let df = ctx.sql("select * from test order by reported_date desc")?;

    let records = df.collect().await?;
    print_batches(&records)?;
    Ok(())
}

Note that simple select works fine, but ORDER BY fails.

Expected behavior
Query executes without errors.

@sergiimk sergiimk added the bug Something isn't working label Aug 30, 2021
@jorgecarleitao
Copy link
Member

This is due to a limitation in arrow-rs that does not support timestamps with timezones: every timestamp that enters arrow-rs and is transformed is outputted as a timestamp without timezone. Because Timestamp(a,Some(b)) != Timestamp(a,None), the RecordBatch refuses to accept the transformed columns (as it expects one with timestamp).

A solution in DataFusion is to apply an extra cast after every operator from arrow to the expected schema, thereby converting the array back to the expected logical type (until the next operator hits). This implies not using operators that return RecordBatch such as batch_filter, etc. (since afaik they also error in arrow-rs).

@sergiimk
Copy link
Contributor Author

Thanks for the prompt reply @jorgecarleitao.

I followed your advise of re-casting and this query works:

SELECT
  CAST(system_time as TIMESTAMP) as system_time, 
  CAST(reported_date as TIMESTAMP) as reported_date,
  province, 
  total_daily
FROM test
ORDER BY reported_date

Linking the corresponding arrow issue: apache/arrow-rs#393

I wonder if the impact of not having timezone support in Arrow can be minimized somehow, e.g. by treating Timestamp(a, None) as equal to Timestamp(a, Some("UTC")).

When reading parquet files produced by Spark the timestamps are being encoded as plain INT96 with no logical types, so it seems DataFusion already defaults to UTC?

@alamb
Copy link
Contributor

alamb commented Aug 31, 2021

@alamb alamb changed the title SELECT ... ORDER BY query fails on data with int64 timestamp field SELECT ... ORDER BY query fails on data with int64 timestamp and timezone field Oct 2, 2021
@sergiimk
Copy link
Contributor Author

Update: Latest datafusion no longer errors out on timestamps with time zones, but when printing out the column values it displays something like:

2022-12-20T00:17:00.207 (Unknown Time Zone 'UTC')

@alamb
Copy link
Contributor

alamb commented Dec 20, 2022

Thanks @sergiimk -- I wonder if we need to update the arrow pretty printing code to handle that better 🤔

@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 1, 2024

Seems to be working as expected now on latest main.

Via datafusion-cli:

❯ select * from '/home/jeffrey/Downloads/flink.parquet' order by reported_date desc limit 10;
+--------------------------+----------------------+----------+-------------+
| system_time              | reported_date        | province | total_daily |
+--------------------------+----------------------+----------+-------------+
| 2021-08-30T20:38:07.488Z | 2021-08-25T00:00:00Z | ON       | 807         |
| 2021-08-30T20:38:07.488Z | 2021-08-25T00:00:00Z | BC       | 719         |
| 2021-08-30T20:38:07.488Z | 2021-08-24T00:00:00Z | ON       | 634         |
| 2021-08-30T20:38:07.488Z | 2021-08-24T00:00:00Z | BC       | 708         |
| 2021-08-30T20:38:07.488Z | 2021-08-23T00:00:00Z | ON       | 509         |
| 2021-08-30T20:38:07.488Z | 2021-08-23T00:00:00Z | BC       | 559         |
| 2021-08-30T20:38:07.488Z | 2021-08-22T00:00:00Z | BC       | 465         |
| 2021-08-30T20:38:07.488Z | 2021-08-22T00:00:00Z | ON       | 489         |
| 2021-08-30T20:38:07.488Z | 2021-08-21T00:00:00Z | BC       | 563         |
| 2021-08-30T20:38:07.488Z | 2021-08-21T00:00:00Z | ON       | 681         |
+--------------------------+----------------------+----------+-------------+
10 rows in set. Query took 0.008 seconds.

❯

Can see it properly displays timezone with Z now

cc @alamb @sergiimk

@alamb alamb removed the bug Something isn't working label Jan 1, 2024
@alamb
Copy link
Contributor

alamb commented Jan 1, 2024

Thanks @Jefffrey

I also verified the query runs correctly using an explicitly created external table as well

❯ create external table test stored as parquet location 'flink.parquet';
0 rows in set. Query took 0.003 seconds.

❯ select * from test order by reported_date desc;
+--------------------------+----------------------+----------+-------------+
| system_time              | reported_date        | province | total_daily |
+--------------------------+----------------------+----------+-------------+
| 2021-08-30T20:38:07.488Z | 2021-08-25T00:00:00Z | ON       | 807         |
| 2021-08-30T20:38:07.488Z | 2021-08-25T00:00:00Z | BC       | 719         |
| 2021-08-30T20:38:07.488Z | 2021-08-24T00:00:00Z | BC       | 708         |
| 2021-08-30T20:38:07.488Z | 2021-08-24T00:00:00Z | ON       | 634         |
| 2021-08-30T20:38:07.488Z | 2021-08-23T00:00:00Z | ON       | 509         |
| 2021-08-30T20:38:07.488Z | 2021-08-23T00:00:00Z | BC       | 559         |
| 2021-08-30T20:38:07.488Z | 2021-08-22T00:00:00Z | ON       | 489         |
| 2021-08-30T20:38:07.488Z | 2021-08-22T00:00:00Z | BC       | 465         |
| 2021-08-30T20:38:07.488Z | 2021-08-21T00:00:00Z | ON       | 681         |
| 2021-08-30T20:38:07.488Z | 2021-08-21T00:00:00Z | BC       | 563         |
| 2021-08-30T20:38:07.488Z | 2021-08-20T00:00:00Z | BC       | 696         |
| 2021-08-30T20:38:07.488Z | 2021-08-20T00:00:00Z | ON       | 710         |
| 2021-08-30T20:38:07.488Z | 2021-08-19T00:00:00Z | ON       | 706         |
| 2021-08-30T20:38:07.488Z | 2021-08-19T00:00:00Z | BC       | 672         |
| 2021-08-30T20:38:07.488Z | 2021-08-18T00:00:00Z | ON       | 651         |
| 2021-08-30T20:38:07.488Z | 2021-08-18T00:00:00Z | BC       | 774         |
| 2021-08-30T20:38:07.488Z | 2021-08-17T00:00:00Z | BC       | 622         |
| 2021-08-30T20:38:07.488Z | 2021-08-17T00:00:00Z | ON       | 511         |
| 2021-08-30T20:38:07.488Z | 2021-08-16T00:00:00Z | ON       | 434         |
| 2021-08-30T20:38:07.488Z | 2021-08-16T00:00:00Z | BC       | 388         |
| 2021-08-30T20:38:07.488Z | 2021-08-15T00:00:00Z | ON       | 450         |
| 2021-08-30T20:38:07.488Z | 2021-08-15T00:00:00Z | BC       | 452         |
| 2021-08-30T20:38:07.488Z | 2021-08-14T00:00:00Z | BC       | 427         |
| 2021-08-30T20:38:07.488Z | 2021-08-14T00:00:00Z | ON       | 506         |
| 2021-08-30T20:38:07.488Z | 2021-08-13T00:00:00Z | ON       | 544         |
| 2021-08-30T20:38:07.488Z | 2021-08-13T00:00:00Z | BC       | 546         |
| 2021-08-30T20:38:07.488Z | 2021-08-12T00:00:00Z | BC       | 704         |
| 2021-08-30T20:38:07.488Z | 2021-08-12T00:00:00Z | ON       | 530         |
| 2021-08-30T20:38:07.488Z | 2021-08-11T00:00:00Z | BC       | 539         |
| 2021-08-30T20:38:07.488Z | 2021-08-11T00:00:00Z | ON       | 535         |
| 2021-08-30T20:38:07.488Z | 2021-08-10T00:00:00Z | ON       | 386         |
| 2021-08-30T20:38:07.488Z | 2021-08-10T00:00:00Z | BC       | 540         |
| 2021-08-30T20:38:07.488Z | 2021-08-09T00:00:00Z | BC       | 360         |
| 2021-08-30T20:38:07.488Z | 2021-08-09T00:00:00Z | ON       | 335         |
| 2021-08-30T20:38:07.488Z | 2021-08-08T00:00:00Z | BC       | 298         |
| 2021-08-30T20:38:07.488Z | 2021-08-08T00:00:00Z | ON       | 337         |
| 2021-08-30T20:38:07.488Z | 2021-08-07T00:00:00Z | BC       | 390         |
| 2021-08-30T20:38:07.488Z | 2021-08-07T00:00:00Z | ON       | 343         |
| 2021-08-30T20:38:07.488Z | 2021-08-06T00:00:00Z | BC       | 428         |
| 2021-08-30T20:38:07.488Z | 2021-08-06T00:00:00Z | ON       | 446         |
| .                                                                        |
| .                                                                        |
| .                                                                        |
+--------------------------+----------------------+----------+-------------+
1095 rows in set (40 shown). Query took 0.004 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants