Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark #7958

Closed
comphead opened this issue Oct 27, 2023 · 17 comments · Fixed by #8369
Closed

Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark #7958

comphead opened this issue Oct 27, 2023 · 17 comments · Fixed by #8369
Assignees
Labels
bug Something isn't working

Comments

@comphead
Copy link
Contributor

comphead commented Oct 27, 2023

Describe the bug

DF reads parquet timestamp datatype as nanos from parquet file whereas DuckDb and Spark treats timestamp datatype as seconds

To Reproduce

create a parquet file with timestamp value -62125747200 and read it back

DuckDb or Spark reads the value correctly

0001-04-25 00:00:00

but DF reads timestamps as nanos and provides the wrong answer

❯ select * from test;
+-------------------------------+
| a                             |
+-------------------------------+
| 1754-12-22T22:43:41.128654848 |
+-------------------------------+

Expected behavior

Behavior should be the same

Additional context

No response

@comphead comphead added the bug Something isn't working label Oct 27, 2023
@comphead
Copy link
Contributor Author

@waitingkuo let me know your thoughts

@alamb
Copy link
Contributor

alamb commented Oct 27, 2023

create a parquet file with timestamp value -62125747200 and read it back

How does one do this? Can you provide an example of such a file or the commands needed to create such a file?

@comphead
Copy link
Contributor Author

ts.snappy.zip
Attaching a parquet file.
I created this in spark

spark.sql("select cast(-62125747200 as timestamp)").write.parquet("out")

@waitingkuo
Copy link
Contributor

@alamb @comphead
i'm not familiar with parquet parser, i assume that time casting is dealed by timestamp casting but not to_timestamp function.
The behaviors for casting and to_timestamp are now different after #7844 merged

select to_timestamp(1), arrow_typeof(to_timestamp(1));
+------------------------+--------------------------------------+
| to_timestamp(Int64(1)) | arrow_typeof(to_timestamp(Int64(1))) |
+------------------------+--------------------------------------+
| 1970-01-01T00:00:01    | Timestamp(Second, None)              |
+------------------------+--------------------------------------+
1 row in set. Query took 0.004 seconds.

❯ select 1::timestamp, arrow_typeof(1::timestamp);
+-------------------------------+-----------------------------+
| Int64(1)                      | arrow_typeof(Int64(1))      |
+-------------------------------+-----------------------------+
| 1970-01-01T00:00:00.000000001 | Timestamp(Nanosecond, None) |
+-------------------------------+-----------------------------+

@comphead
Copy link
Contributor Author

comphead commented Oct 29, 2023

@waitingkuo thanks

to_timestamp(1) is consistent with PG, DuckDB, Spark.

1::timestamp this syntax is not known for any engine above, is it supposed to do the same as to_timestamp(1), correct?

@waitingkuo
Copy link
Contributor

Timestamp in SQL maps to Timestamp(Nanosecond, tz) for now.
https://github.com/apache/arrow-datafusion/blob/9b45967edc6dba312ea223464dad3e66604d2095/datafusion/sql/src/planner.rs#L346

So far, 1::timestamp in datafusion treat int 1 as 1 ns which is same as pyarrow

In [8]: pa.array([1]).cast(pa.timestamp('ns'))
Out[8]: 
<pyarrow.lib.TimestampArray object at 0x11969d2e0>
[
  1970-01-01 00:00:00.000000001
]

@alamb @avantgardnerio @liukun4515 any thoughts about this?

@avantgardnerio
Copy link
Contributor

Parquet has a proper Timestamp field, correct? https://learn.microsoft.com/en-us/common-data-model/sdk/parquet-to-cdm-datatype

And that datatype supports a precision field which specifies millis/micros/nanos?

So I presume this error only happens when reading an INT field and casting? If so, I'd 100% be on board copying the behavior of these other well known databases. Folks who stored their data properly as Timestamps shouldn't be affected. It would be a breaking change to those who just stored as an INT, but I think now is the time to make that change since datetime support is relatively new... it will only get harder to change going forward.

@alamb
Copy link
Contributor

alamb commented Oct 30, 2023

If so, I'd 100% be on board copying the behavior of these other well known databases

I agree

I took a look at what ts.snappy.parquet contains:

$ parquet-tools schema -d ts.snappy.parquet
message spark_schema {
  required int96 a;
}

creator: parquet-mr version 1.10.99.7.1.7.2000-305 (build eeabcd207c4c506ebd915865772cadb9bac25837)
extra: org.apache.spark.version = 2.4.7
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"a","type":"timestamp","nullable":false,"metadata":{}}]}

file schema: spark_schema
--------------------------------------------------------------------------------
a: REQUIRED INT96 R:0 D:0

row group 1: RC:1 TS:44 OFFSET:4
--------------------------------------------------------------------------------
a:  INT96 SNAPPY DO:0 FPO:4 SZ:48/44/0.92 VC:1 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0x0000000000000000C4441A00, max: 0x0000000000000000C4441A00, num_nulls: 0]

It seem to use a different type and has extra metadata that is not present in an equivalent file created by datafusion:

❯ select to_timestamp_seconds(-62125747200);
+-------------------------------------------+
| to_timestamp_seconds(Int64(-62125747200)) |
+-------------------------------------------+
| 0001-04-25T00:00:00                       |
+-------------------------------------------+
1 row in set. Query took 0.001 seconds.

❯ copy (select to_timestamp_seconds(-62125747200) as "a") to 'ts-df.parquet';
+-------+
| count |
+-------+
| 1     |
+-------+
1 row in set. Query took 0.024 seconds.

The field is not read as a timestamp at all 🤔

❯ select * from 'ts-df.parquet';
+--------------+
| a            |
+--------------+
| -62125747200 |
+--------------+
1 row in set. Query took 0.004 seconds.

And the metadata / type information is different than spark:

$ parquet-tools schema -d ts-df.parquet
message arrow_schema {
  required int64 a;
}

creator: datafusion version 32.0.0

file schema: arrow_schema
--------------------------------------------------------------------------------
a: REQUIRED INT64 R:0 D:0

row group 1: RC:1 TS:63 OFFSET:4
--------------------------------------------------------------------------------
a:  INT64 ZSTD DO:4 FPO:35 SZ:81/63/0.78 VC:1 ENC:RLE,PLAIN,RLE_DICTIONARY ST:[min: -62125747200, max: -62125747200, num_nulls not defined]

@alamb alamb changed the title to_timestamp() wrong value reading from parquet Wrong timestamp type read while from parquet file created by spark Oct 30, 2023
@comphead
Copy link
Contributor Author

Looks like the analysis shows DF has a bunch of issues with timestamp type

  • select cast(1 as timestamp); should be 1970-01-01T00:00:01 not 1970-01-01T00:00:00.000000001
  • 1::timestamp should be consistent with to_timestamp(1) which works correctly now
  • reading Int96 Timestamp type is not consistent with major engines(this ticket)
  • writing timestamp from COPY doesn't generate all neccessary parquet information

@comphead comphead self-assigned this Nov 1, 2023
@comphead
Copy link
Contributor Author

comphead commented Nov 2, 2023

arrow-rs treats INT96 Parquet type as Timestamp(NanoSecond)
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/schema/primitive.rs#L97

Interesting explanation in Snowflake of the same issue
https://community.snowflake.com/s/article/TIMESTAMP-function-returns-wrong-date-time-value-from-Parquet-file

Key takeaways

  • INT96 Parquet field is deprecated https://issues.apache.org/jira/browse/PARQUET-323
  • INT96 is only used to represent nanosec timestamp
  • Apache projects like Hive and Spark still incorrectly treats the first 16 bytes, hence it returned what users thought was the correct value, but in fact it is incorrect.

That is the reason of having the difference. However DuckDB also works as Spark. To provide the compatibility support we may want introduce some config param in DF and treat INT96 like Spark.

What are your thoughts? @alamb @waitingkuo @tustvold @viirya

@tustvold
Copy link
Contributor

tustvold commented Nov 2, 2023

I agree with the conclusion in the snowflake document, we should follow the specification and be consistent with other compliant implementations.

As an aside this type has been deprecated for almost a decade, why is spark still using it...

@viirya
Copy link
Member

viirya commented Nov 6, 2023

Spark community tried (https://issues.apache.org/jira/browse/SPARK-27528) to change default Parquet timestamp type to TIMESTAMP_MICROS but the change was reverted back to INT96 later for ecosystem compatibility (https://issues.apache.org/jira/browse/SPARK-31639).

@alamb
Copy link
Contributor

alamb commented Nov 20, 2023

Added to list on #7958

@edmondop
Copy link
Contributor

Can I give a shoot at this?

@tustvold
Copy link
Contributor

I think it would be worthwhile writing up a description of any proposed change first. It isn't necessarily clear to me how one handles this correctly, or even if we can't simply follow Snowflake's example and close this as "won't fix".

Perhaps @comphead might be able to help out here?

@tustvold tustvold changed the title Wrong timestamp type read while from parquet file created by spark Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark Nov 22, 2023
@comphead
Copy link
Contributor Author

Thanks @edmondop please hold up on this.
I'm fixing some timestamp literals issues and we will have discussion on timestamp precision in DF, so this ticket will depend on actions above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants