Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

createdTime lost in metadata after schema change #2925

Closed
crpcrp opened this issue Oct 6, 2024 · 2 comments · Fixed by #2926
Closed

createdTime lost in metadata after schema change #2925

crpcrp opened this issue Oct 6, 2024 · 2 comments · Fixed by #2926
Labels
binding/rust Issues for the Rust crate bug Something isn't working

Comments

@crpcrp
Copy link

crpcrp commented Oct 6, 2024

Environment

Delta-rs version: 0.20.1

Binding: Python

Environment:

  • Cloud provider:
  • OS: Windows
  • Other:

Bug

What happened:
Using mode='append' and schema_mode='merge', the createdTime field became null in the metadata if there was a change in the schema in an append.

What you expected to happen:
The createdTime should be kept as its original value even after schema changes.

How to reproduce it:

import pandas as pd
from deltalake import write_deltalake,


data1 = {
    "a": [1, 2],
    "b": [10, 20]
}

data2 = {
    "a": [3, 4],
    "c": ['X', 'Y']
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

write_deltalake('my_table',df1,mode='append',schema_mode='merge')
write_deltalake('my_table',df2,mode='append',schema_mode='merge')

More details:

{
    "metaData": {
        "id": "326d7d04-a989-48ea-99e6-590eb8d66946",
        "name": null,
        "description": null,
        "format": {
            "provider": "parquet",
            "options": {}
        },
        "schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"a\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"b\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}",
        "partitionColumns": [],
        "createdTime": null,
        "configuration": {}
    }
}
{
    "add": {
        "path": "part-00001-026f1f36-2361-4a5f-a115-b2f00ec7962d-c000.snappy.parquet",
        "partitionValues": {},
        "size": 1038,
        "modificationTime": 1728224675640,
        "dataChange": true,
        "stats": "{\"numRecords\":2,\"minValues\":{\"a\":3,\"c\":\"X\"},\"maxValues\":{\"a\":4,\"c\":\"Y\"},\"nullCount\":{\"a\":0,\"b\":2,\"c\":0}}",
        "tags": null,
        "deletionVector": null,
        "baseRowId": null,
        "defaultRowCommitVersion": null,
        "clusteringProvider": null
    }
}
{
    "commitInfo": {
        "timestamp": 1728224675640,
        "operation": "WRITE",
        "operationParameters": {
            "mode": "Append"
        },
        "clientVersion": "delta-rs.0.20.1",
        "operationMetrics": {
            "execution_time_ms": 2,
            "num_added_files": 1,
            "num_added_rows": 2,
            "num_partitions": 0,
            "num_removed_files": 0
        }
    }
}
@crpcrp crpcrp added the bug Something isn't working label Oct 6, 2024
@rtyler rtyler added the binding/rust Issues for the Rust crate label Oct 7, 2024
@rtyler
Copy link
Member

rtyler commented Oct 7, 2024

The createdTime should be kept as its original value even after schema changes.

I mentioned this in #2925 that the createdTime is optional and should not be identical after the merge operation. The protocol states that the createdTime should be the time when the action itself is created, not a timestamp which corresponds to table creation.

I think it's worthwhile to ensure that the new metadata action contains an updated timestamp, even though its not required by the protocol.

I am curious @crpcrp what you're using this field for 🤔

rtyler added a commit to ion-elgreco/delta-rs that referenced this issue Oct 7, 2024
rtyler added a commit to ion-elgreco/delta-rs that referenced this issue Oct 7, 2024
@crpcrp
Copy link
Author

crpcrp commented Oct 7, 2024

I thought it refered to the table creation, but thanks for the protocol link.

This issue came up when I used an external table in BigQuery pointing to a Delta lake on GCS. After the schema changed, it failed to parse the delta lake log, because BigQuery expects a long data type in the createdTime metadata field.

I've seen that the createdTime is optional, but have not found how to 'force' it to be filled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants