Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Hudi 0.12.1 support for Spark Structured Streaming. read clustering metadata replace avro file error. Unrecognized token 'Obj^A^B^Vavro' #7375

Closed
aizain opened this issue Dec 4, 2022 · 9 comments · Fixed by #7389
Labels
priority:major degraded perf; unable to move forward; potential bugs streaming

Comments

@aizain
Copy link
Contributor

aizain commented Dec 4, 2022

Describe the problem you faced

When i enable async clustering, hudi write xxx.replacecommit.requested is avro.schema. but canSkipBatch function read it file use json reader, throw Unrecognized token 'Obj^A^B^Vavro'.

How can i fixed it ?i deleted it but it also happend in next replacecommit

To Reproduce

Steps to reproduce the behavior:

  1. run spark stream use sink hudi

spark.
sql(conf.getSql).
na.fill("").
writeStream.
format("hudi").
options( conf.getHudiConf).
option("checkpointLocation", conf.getCheckpointPath).
trigger(conf.getTrigger).
outputMode(OutputMode.Append()).
start(conf.getOutputPath(conf.getHudiTableName))

hoodie.table.name=history
hoodie.datasource.write.table.type=MERGE_ON_READ
hoodie.datasource.write.operation=insert
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.precombine.field=version
hoodie.datasource.write.partitionpath.field=partition
hoodie.cleaner.commits.retained=3
hoodie.clustering.async.enabled=true
hoodie.clean.async=true
hoodie.parquet.max.file.size=268435456
hoodie.metrics.on=true

  1. when doing clusting => 20221204152715580__replacecommit__REQUESTED
    => 20221204150150328.replacecommit.requested
    is avro file

image

  1. keep run stream
  2. when run function HoodieStreamingSink.addBatch => canSkipBatch => CommitUtils.getLatestCommitMetadataWithValidCheckpointInfo
  3. throws error

Caused by: org.apache.hudi.exception.HoodieIOException: Failed to parse HoodieCommitMetadata for [==>20221204152715580_
_replacecommit__REQUESTED]
Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'Obj^A^B^Vavro': was expecting ('true', 'f
alse' or 'null')

Expected behavior

Environment Description

  • Hudi version :
    0.12.1

  • Spark version :
    2.4.3.2

  • Hive version :
    no

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :
    use HDFS

  • Running on Docker? (yes/no) :
    no

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

image

image

image

@aizain
Copy link
Contributor Author

aizain commented Dec 4, 2022

user stream sink write has error

image

image

when i use foreachBatch batch sink write is ok.

@aizain aizain changed the title [SUPPORT] When i enable async clustering, hudi write xxx.replacecommit.requested is avro.schema. but canSkipBatch function read it file use json reader, throw Unrecognized token 'Obj^A^B^Vavro' [SUPPORT] Hudi 0.12.1 When i enable async clustering, hudi write xxx.replacecommit.requested is avro.schema. but canSkipBatch function read it file use json reader, throw Unrecognized token 'Obj^A^B^Vavro' Dec 4, 2022
@aizain aizain changed the title [SUPPORT] Hudi 0.12.1 When i enable async clustering, hudi write xxx.replacecommit.requested is avro.schema. but canSkipBatch function read it file use json reader, throw Unrecognized token 'Obj^A^B^Vavro' [SUPPORT] Hudi 0.12.1 support for Spark Structured Streaming. read clustering metadata replace avro file error. Unrecognized token 'Obj^A^B^Vavro' Dec 4, 2022
@xushiyan xushiyan moved this to ⏳ Awaiting Triage in Hudi Issue Support Dec 6, 2022
@xushiyan
Copy link
Member

xushiyan commented Dec 6, 2022

@aizain took a closer look at the code and i think it is a bug, where a non-completed commit instant should not be used for reading checkpoint. it should actually filter the timeline for completed instants only. cc @codope

@xushiyan xushiyan added priority:major degraded perf; unable to move forward; potential bugs streaming labels Dec 6, 2022
@xushiyan xushiyan moved this from ⏳ Awaiting Triage to 🏁 Triaged in Hudi Issue Support Dec 6, 2022
Repository owner moved this from 🏁 Triaged to ✅ Done in Hudi Issue Support Dec 7, 2022
@aizain
Copy link
Contributor Author

aizain commented Dec 13, 2022

thanks~

@xccui
Copy link
Member

xccui commented Dec 21, 2022

Built with the latest version on master (fd62a14) but still encountered the same issue (with Flink r/w).

@xccui
Copy link
Member

xccui commented Dec 21, 2022

Stacktrace:

Caused by: org.apache.hudi.exception.HoodieException: java.io.IOException: unable to read commit metadata
	at org.apache.hudi.sink.partitioner.profile.WriteProfiles.getCommitMetadata(WriteProfiles.java:240)
	at org.apache.hudi.source.IncrementalInputSplits.lambda$inputSplits$4(IncrementalInputSplits.java:314)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
	at org.apache.hudi.source.IncrementalInputSplits.inputSplits(IncrementalInputSplits.java:314)
	at org.apache.hudi.source.StreamReadMonitoringFunction.monitorDirAndForwardSplits(StreamReadMonitoringFunction.java:206)
	at org.apache.hudi.source.StreamReadMonitoringFunction.run(StreamReadMonitoringFunction.java:179)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:332)
Caused by: java.io.IOException: unable to read commit metadata
	at org.apache.hudi.common.model.HoodieCommitMetadata.fromBytes(HoodieCommitMetadata.java:496)
	at org.apache.hudi.common.table.timeline.TimelineUtils.getCommitMetadata(TimelineUtils.java:227)
	at org.apache.hudi.sink.partitioner.profile.WriteProfiles.getCommitMetadata(WriteProfiles.java:236)
	... 14 more
Caused by: org.apache.hudi.com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'Obj���avro': was expecting ('true', 'false' or 'null')
 at [Source: UNKNOWN; line: 1, column: 11]
	at org.apache.hudi.com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
	at org.apache.hudi.com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
	at org.apache.hudi.com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2462)
	at org.apache.hudi.com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1621)
	at org.apache.hudi.com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:689)
	at org.apache.hudi.com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3776)
	at org.apache.hudi.com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3721)
	at org.apache.hudi.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)
	at org.apache.hudi.common.model.HoodieCommitMetadata.fromJsonString(HoodieCommitMetadata.java:240)
	at org.apache.hudi.common.model.HoodieCommitMetadata.fromBytes(HoodieCommitMetadata.java:494)
	... 16 more

@danny0405, could you also take a look? Thanks!

@danny0405
Copy link
Contributor

Yeah, it's a bug that was introduced in #7296, would fire a fix soon ~

@danny0405
Copy link
Contributor

A fix PR is fired here: #7540

@CTTY
Copy link
Contributor

CTTY commented Sep 13, 2023

This issue regressed in 0.13.1+ and a new PR is posted here to fix: #9711

@sdudi-te
Copy link

sdudi-te commented Jul 3, 2024

Is there a possible workaround for this ? In other words how do we recover from this situation ?

We are using spark structured streaming on kafka and write output to hudi (v0.13.1) on s3.

Upon deleting the partial commit file (as a workaround) or moving to a previous checkpoint, we are observing even though streaming job is progressing with updated offsets, but no data is ever written to hudi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:major degraded perf; unable to move forward; potential bugs streaming
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants