[SUPPORT] Hudi 0.12.1 support for Spark Structured Streaming. read clustering metadata replace avro file error. Unrecognized token 'Obj^A^B^Vavro' #7375

aizain · 2022-12-04T07:59:36Z

Describe the problem you faced

When i enable async clustering, hudi write xxx.replacecommit.requested is avro.schema. but canSkipBatch function read it file use json reader, throw Unrecognized token 'Obj^A^B^Vavro'.

How can i fixed it ？i deleted it but it also happend in next replacecommit

To Reproduce

Steps to reproduce the behavior:

run spark stream use sink hudi

spark.
sql(conf.getSql).
na.fill("").
writeStream.
format("hudi").
options( conf.getHudiConf).
option("checkpointLocation", conf.getCheckpointPath).
trigger(conf.getTrigger).
outputMode(OutputMode.Append()).
start(conf.getOutputPath(conf.getHudiTableName))

hoodie.table.name=history
hoodie.datasource.write.table.type=MERGE_ON_READ
hoodie.datasource.write.operation=insert
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.precombine.field=version
hoodie.datasource.write.partitionpath.field=partition
hoodie.cleaner.commits.retained=3
hoodie.clustering.async.enabled=true
hoodie.clean.async=true
hoodie.parquet.max.file.size=268435456
hoodie.metrics.on=true

when doing clusting => 20221204152715580__replacecommit__REQUESTED
=> 20221204150150328.replacecommit.requested
is avro file

keep run stream
when run function HoodieStreamingSink.addBatch => canSkipBatch => CommitUtils.getLatestCommitMetadataWithValidCheckpointInfo
throws error

Caused by: org.apache.hudi.exception.HoodieIOException: Failed to parse HoodieCommitMetadata for [==>20221204152715580_
_replacecommit__REQUESTED]
Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'Obj^A^B^Vavro': was expecting ('true', 'f
alse' or 'null')

Expected behavior

Environment Description

Hudi version :
0.12.1
Spark version :
2.4.3.2
Hive version :
no
Hadoop version :
Storage (HDFS/S3/GCS..) :
use HDFS
Running on Docker? (yes/no) :
no

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

The text was updated successfully, but these errors were encountered:

aizain · 2022-12-04T08:07:30Z

user stream sink write has error

when i use foreachBatch batch sink write is ok.

xushiyan · 2022-12-06T15:11:55Z

@aizain took a closer look at the code and i think it is a bug, where a non-completed commit instant should not be used for reading checkpoint. it should actually filter the timeline for completed instants only. cc @codope

aizain · 2022-12-13T02:27:06Z

thanks~

xccui · 2022-12-21T06:42:41Z

Built with the latest version on master (fd62a14) but still encountered the same issue (with Flink r/w).

xccui · 2022-12-21T19:43:28Z

Stacktrace:

Caused by: org.apache.hudi.exception.HoodieException: java.io.IOException: unable to read commit metadata
	at org.apache.hudi.sink.partitioner.profile.WriteProfiles.getCommitMetadata(WriteProfiles.java:240)
	at org.apache.hudi.source.IncrementalInputSplits.lambda$inputSplits$4(IncrementalInputSplits.java:314)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
	at org.apache.hudi.source.IncrementalInputSplits.inputSplits(IncrementalInputSplits.java:314)
	at org.apache.hudi.source.StreamReadMonitoringFunction.monitorDirAndForwardSplits(StreamReadMonitoringFunction.java:206)
	at org.apache.hudi.source.StreamReadMonitoringFunction.run(StreamReadMonitoringFunction.java:179)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:332)
Caused by: java.io.IOException: unable to read commit metadata
	at org.apache.hudi.common.model.HoodieCommitMetadata.fromBytes(HoodieCommitMetadata.java:496)
	at org.apache.hudi.common.table.timeline.TimelineUtils.getCommitMetadata(TimelineUtils.java:227)
	at org.apache.hudi.sink.partitioner.profile.WriteProfiles.getCommitMetadata(WriteProfiles.java:236)
	... 14 more
Caused by: org.apache.hudi.com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'Obj���avro': was expecting ('true', 'false' or 'null')
 at [Source: UNKNOWN; line: 1, column: 11]
	at org.apache.hudi.com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
	at org.apache.hudi.com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
	at org.apache.hudi.com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2462)
	at org.apache.hudi.com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1621)
	at org.apache.hudi.com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:689)
	at org.apache.hudi.com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3776)
	at org.apache.hudi.com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3721)
	at org.apache.hudi.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)
	at org.apache.hudi.common.model.HoodieCommitMetadata.fromJsonString(HoodieCommitMetadata.java:240)
	at org.apache.hudi.common.model.HoodieCommitMetadata.fromBytes(HoodieCommitMetadata.java:494)
	... 16 more

@danny0405, could you also take a look? Thanks!

danny0405 · 2022-12-22T09:04:07Z

Yeah, it's a bug that was introduced in #7296, would fire a fix soon ~

danny0405 · 2022-12-22T10:22:55Z

A fix PR is fired here: #7540

CTTY · 2023-09-13T23:22:04Z

This issue regressed in 0.13.1+ and a new PR is posted here to fix: #9711

sdudi-te · 2024-07-03T10:41:29Z

Is there a possible workaround for this ? In other words how do we recover from this situation ?

We are using spark structured streaming on kafka and write output to hudi (v0.13.1) on s3.

Upon deleting the partial commit file (as a workaround) or moving to a previous checkpoint, we are observing even though streaming job is progressing with updated offsets, but no data is ever written to hudi.

xushiyan moved this to ⏳ Awaiting Triage in Hudi Issue Support Dec 6, 2022

xushiyan added this to Hudi Issue Support Dec 6, 2022

xushiyan added priority:major degraded perf; unable to move forward; potential bugs streaming labels Dec 6, 2022

xushiyan mentioned this issue Dec 6, 2022

[HUDI-5334] Fix checkpoint reading for structured streaming #7389

Merged

4 tasks

xushiyan moved this from ⏳ Awaiting Triage to 🏁 Triaged in Hudi Issue Support Dec 6, 2022

codope closed this as completed in #7389 Dec 7, 2022

Repository owner moved this from 🏁 Triaged to ✅ Done in Hudi Issue Support Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Hudi 0.12.1 support for Spark Structured Streaming. read clustering metadata replace avro file error. Unrecognized token 'Obj^A^B^Vavro' #7375

[SUPPORT] Hudi 0.12.1 support for Spark Structured Streaming. read clustering metadata replace avro file error. Unrecognized token 'Obj^A^B^Vavro' #7375

aizain commented Dec 4, 2022 •

edited

Loading

aizain commented Dec 4, 2022 •

edited

Loading

xushiyan commented Dec 6, 2022

aizain commented Dec 13, 2022

xccui commented Dec 21, 2022 •

edited

Loading

xccui commented Dec 21, 2022

danny0405 commented Dec 22, 2022

danny0405 commented Dec 22, 2022

CTTY commented Sep 13, 2023

sdudi-te commented Jul 3, 2024 •

edited

Loading

[SUPPORT] Hudi 0.12.1 support for Spark Structured Streaming. read clustering metadata replace avro file error. Unrecognized token 'Obj^A^B^Vavro' #7375

[SUPPORT] Hudi 0.12.1 support for Spark Structured Streaming. read clustering metadata replace avro file error. Unrecognized token 'Obj^A^B^Vavro' #7375

Comments

aizain commented Dec 4, 2022 • edited Loading

aizain commented Dec 4, 2022 • edited Loading

xushiyan commented Dec 6, 2022

aizain commented Dec 13, 2022

xccui commented Dec 21, 2022 • edited Loading

xccui commented Dec 21, 2022

danny0405 commented Dec 22, 2022

danny0405 commented Dec 22, 2022

CTTY commented Sep 13, 2023

sdudi-te commented Jul 3, 2024 • edited Loading

aizain commented Dec 4, 2022 •

edited

Loading

aizain commented Dec 4, 2022 •

edited

Loading

xccui commented Dec 21, 2022 •

edited

Loading

sdudi-te commented Jul 3, 2024 •

edited

Loading