Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read a streaming Delta table with a watermark #457

Closed
langelgjm opened this issue Oct 11, 2021 · 3 comments
Closed

Cannot read a streaming Delta table with a watermark #457

langelgjm opened this issue Oct 11, 2021 · 3 comments
Labels
binding/rust Issues for the Rust crate bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@langelgjm
Copy link
Contributor

langelgjm commented Oct 11, 2021

Environment

Delta-rs version:
rust-v0.4.0, rust-v0.4.1 (head)

Binding:
rust

Environment:

  • Cloud provider: AWS
  • OS: Linux / OS X
  • Other:

Bug

What happened: Attempting to use delta-rs to read a streaming Delta table with a watermark fails with "Error: Failed to apply transaction log: Invalid JSON in log record"

What you expected to happen: delta-rs should be able to successfully read a streaming Delta table with a watermark

How to reproduce it:

Create a streaming Delta table in Spark with a watermark:

from pyspark.sql import functions as F

# create a dataframe
columns = ["foo", "bar"]
data = [(1, 2), (3, 4), (5, 6)]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF()
df = df.withColumn("ts", F.current_timestamp())

# write batch Delta dataframe
df.write.format("delta").save("/example_batch")

# read batch Delta dataframe as streaming, apply a watermark, and write back as streaming
dfs = spark.readStream.format("delta").load("/example_batch").withWatermark("ts", "1 hour")
dfs.writeStream.format("delta").option("checkpointLocation", "/example_streaming/_checkpoint").start("example_streaming")

Then, attempt to read the streaming Delta table with, e.g., delta-inspect

$ delta-inspect info /example_streaming
Error: Failed to apply transaction log: Invalid JSON in log record

Caused by:
    0: Invalid JSON in log record
    1: invalid type: integer `3600000`, expected a string at line 1 column 235

More details:

The error occurs because the streaming Delta table has a metaData transaction whose schemaString looks like this:

{
  "type": "struct",
  "fields": [
    {
      "name": "_1",
      "type": "long",
      "nullable": true,
      "metadata": {}
    },
    {
      "name": "_1",
      "type": "long",
      "nullable": true,
      "metadata": {}
    },
    {
      "name": "ts",
      "type": "timestamp",
      "nullable": true,
      "metadata": {
        "spark.watermarkDelayMs": 3600000
      }
    }
}

However, delta-rs defines SchemaField::metadata as type HashMap<String, String> (see here), so attempts to deserialize the third field above fail due to the key spark.watermarkDelayMs having a numeric value instead of a string.

I was able to work around this issue by simply disabling deserialization of the schemaString (not needed for my purposes), but a real fix would require changing the definition of SchemaField.

@langelgjm langelgjm added the bug Something isn't working label Oct 11, 2021
@mosyp
Copy link
Contributor

mosyp commented Oct 11, 2021

@langelgjm thank you for the submission.

@houqp I don't see a use of field's metadata anywhere in the code base, so we can define the metadata as HashMap<String, Value> or even Value, e.g. the json value

@houqp
Copy link
Member

houqp commented Oct 12, 2021

@mosyp you are correct that schema metadata is not being used at the moment, based on https://github.com/delta-io/delta/blob/master/PROTOCOL.md#struct-field, I think it's best if we serialize them as serde_json::Map or HashMap<String, Value>.

@houqp houqp added binding/rust Issues for the Rust crate good first issue Good for newcomers help wanted Extra attention is needed labels Oct 12, 2021
d-bhutada added a commit to d-bhutada/delta-rs that referenced this issue Dec 28, 2021
bug fix for issue 457, 524. This will enable to read metadata containing a numerical value
For more detail, please refer 
delta-io#524
delta-io#457
@langelgjm
Copy link
Contributor Author

Fixed by #531

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants