You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-- sets up the result mode to tableau to show the results directly in the CLIsetexecution.result-mode=tableau;
CREATETABLEt1(
uuid VARCHAR(20),
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
`partition`VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector'='hudi',
'path'='schema://base-path',
'table.type'='MERGE_ON_READ'-- this creates a MERGE_ON_READ table, by default is COPY_ON_WRITE
);
-- insert data using valuesINSERT INTO t1 VALUES
('id1','Danny',23,TIMESTAMP'1970-01-01 00:00:01','par1'),
('id2','Stephen',33,TIMESTAMP'1970-01-01 00:00:02','par1'),
('id3','Julian',53,TIMESTAMP'1970-01-01 00:00:03','par2'),
('id4','Fabian',31,TIMESTAMP'1970-01-01 00:00:04','par2'),
('id5','Sophia',18,TIMESTAMP'1970-01-01 00:00:05','par3'),
('id6','Emma',20,TIMESTAMP'1970-01-01 00:00:06','par3'),
('id7','Bob',44,TIMESTAMP'1970-01-01 00:00:07','par4'),
('id8','Han',56,TIMESTAMP'1970-01-01 00:00:08','par4');
-- 主键相同为修改insert into t1 values
('id1','Danny',27,TIMESTAMP'1970-01-01 00:00:01','par1');
Streaming Query
可以通过提交的时间戳去流式消费数据
CREATETABLEt1(
uuid VARCHAR(20),
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
`partition`VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector'='hudi',
'path'='oss://vvr-daily/hudi/t1',
'table.type'='MERGE_ON_READ',
'read.streaming.enabled'='true', -- this option enable the streaming read'read.streaming.start-commit'='20210316134557'-- specifies the start commit instant time'read.streaming.check-interval'='4'-- specifies the check interval for finding new source commits, default 60s.
);
-- Then query the table in stream modeselect*from t1;
Setting this value as execution.checkpointing.interval = 150000ms, 150000ms = 2.5min. Configuring this parameter is equivalent to enabling the checkpoint
state.backend
(none)
String
The state backend to be used to store state. We recommend setting store state as rocksdb : state.backend: rocksdb
state.backend.rocksdb.localdir
(none)
String
The local directory (on the TaskManager) where RocksDB puts its files
state.checkpoints.dir
(none)
String
The default directory used for storing the data files and meta data of checkpoints in a Flink supported filesystem. The storage path must be accessible from all participating processes/nodes(i.e. all TaskManagers and JobManagers), like hdfs and oss path
state.backend.incremental
false
Boolean
Option whether the state backend should create incremental checkpoints, if possible. For an incremental checkpoint, only a diff from the previous checkpoint is stored, rather than the complete checkpoint state. If store state is setting as rocksdb, recommending to turn on
-- hms mode templateCREATETABLEt1(
uuid VARCHAR(20),
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
`partition`VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector'='hudi',
'path'='oss://vvr-daily/hudi/t1',
'table.type'='COPY_ON_WRITE', --If MERGE_ON_READ, hive query will not have output until the parquet file is generated'hive_sync.enable'='true', -- Required. To enable hive synchronization'hive_sync.mode'='hms'-- Required. Setting hive sync mode to hms, default jdbc'hive_sync.metastore.uris'='thrift://ip:9083'-- Required. The port need set on hive-site.xml
);
-- jdbc mode templateCREATETABLEt1(
uuid VARCHAR(20),
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
`partition`VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector'='hudi',
'path'='oss://vvr-daily/hudi/t1',
'table.type'='COPY_ON_WRITE', --If MERGE_ON_READ, hive query will not have output until the parquet file is generated'hive_sync.enable'='true', -- Required. To enable hive synchronization'hive_sync.mode'='jdbc'-- Required. Setting hive sync mode to hms, default jdbc'hive_sync.metastore.uris'='thrift://ip:9083'-- Required. The port need set on hive-site.xml'hive_sync.jdbc_url'='jdbc:hive2://ip:10000', -- required, hiveServer port'hive_sync.table'='t1', -- required, hive table name'hive_sync.db'='testDB', -- required, hive database name'hive_sync.username'='root', -- required, HMS username'hive_sync.password'='your password'-- required, HMS password
);
Flag to indicate whether to ignore any non exception error (e.g. writestatus error). within a checkpoint batch. By default true (in favor of streaming progressing over data integrity)
hoodie.datasource.write.recordkey.field
N
uuid
Record key field. Value to be used as the recordKey component of HoodieKey. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.c
hoodie.datasource.write.keygenerator.class
N
SimpleAvroKeyGenerator.class
Key generator class, that implements will extract the key out of incoming record
write.tasks
N
4
Parallelism of tasks that do actual write, default is 4
write.batch.size
N
128
Batch buffer size in MB to flush data into the underneath filesystem
deleteDF // dataframe containing just records to be deleted
.write().format("org.apache.hudi")
.option(...) // Add HUDI options like record-key, partition-path and others as needed for your setup// specify record_key, partition_key, precombine_fieldkey & usual params
.option(HoodieWriteConfig.WRITE_PAYLOAD_CLASS_NAME.key(), "org.apache.hudi.EmptyHoodieRecordPayload")
/** * Write buffer as buckets for a checkpoint. The key is bucket ID. */privatetransientMap<String, DataBucket> buckets;
/** * Config options. */privatefinalConfigurationconfig;
/** * Id of current subtask. */privateinttaskID;
/** * Write Client. */privatetransientHoodieFlinkWriteClientwriteClient;
/*** 写入函数*/privatetransientBiFunction<List<HoodieRecord>, String, List<WriteStatus>> writeFunction;
/** * The REQUESTED instant we write the data. */privatevolatileStringcurrentInstant;
/** * Gateway to send operator events to the operator coordinator. */privatetransientOperatorEventGatewayeventGateway;
/** * Commit action type. */privatetransientStringactionType;
/** * Total size tracer. 记录大小的tracer */privatetransientTotalSizeTracertracer;
/** * Flag saying whether the write task is waiting for the checkpoint success notification * after it finished a checkpoint. * * <p>The flag is needed because the write task does not block during the waiting time interval, * some data buckets still flush out with old instant time. There are two cases that the flush may produce * corrupted files if the old instant is committed successfully: * 1) the write handle was writing data but interrupted, left a corrupted parquet file; * 2) the write handle finished the write but was not closed, left an empty parquet file. * * <p>To solve, when this flag was set to true, we block the data flushing thus the #processElement method, * the flag was reset to false if the task receives the checkpoint success event or the latest inflight instant * time changed(the last instant committed successfully). */privatevolatilebooleanconfirming = false;
/** * List state of the write metadata events. */privatetransientListState<WriteMetadataEvent> writeMetadataState;
/** * Write status list for the current checkpoint. */privateList<WriteStatus> writeStatuses;
publicvoidinitializeState(FunctionInitializationContextcontext) throwsException {
this.taskID = getRuntimeContext().getIndexOfThisSubtask();
// 创建hudi写入客户端this.writeClient = StreamerUtil.createWriteClient(this.config, getRuntimeContext());
// 读取with配置this.actionType = CommitUtils.getCommitActionType(
WriteOperationType.fromValue(config.getString(FlinkOptions.OPERATION)),
HoodieTableType.valueOf(config.getString(FlinkOptions.TABLE_TYPE)));
this.writeStatuses = newArrayList<>();
// 写入元数据状态this.writeMetadataState = context.getOperatorStateStore().getListState(
newListStateDescriptor<>(
"write-metadata-state",
TypeInformation.of(WriteMetadataEvent.class)
));
this.currentInstant = this.writeClient.getLastPendingInstant(this.actionType);
if (context.isRestored()) {
restoreWriteMetadata();
} else {
sendBootstrapEvent();
}
// blocks flushing until the coordinator starts a new instantthis.confirming = true;
}
//WriteMetadataEventpublicclassWriteMetadataEventimplementsOperatorEvent {
privatestaticfinallongserialVersionUID = 1L;
publicstaticfinalStringBOOTSTRAP_INSTANT = "";
privateList<WriteStatus> writeStatuses;
privateinttaskID;
// instant时间privateStringinstantTime;
// 是否最后一个批次privatebooleanlastBatch;
/** * Flag saying whether the event comes from the end of input, e.g. the source * is bounded, there are two cases in which this flag should be set to true: * 1. batch execution mode * 2. bounded stream source such as VALUES */privatebooleanendInput;
/** * Flag saying whether the event comes from bootstrap of a write function. */privatebooleanbootstrap;
}
// 恢复写入元数据privatevoidrestoreWriteMetadata() throwsException {
StringlastInflight = this.writeClient.getLastPendingInstant(this.actionType);
booleaneventSent = false;
for (WriteMetadataEventevent : this.writeMetadataState.get()) {
if (Objects.equals(lastInflight, event.getInstantTime())) {
// The checkpoint succeed but the meta does not commit,// re-commit the inflight instant// 重新提交inflight的instantthis.eventGateway.sendEventToCoordinator(event);
LOG.info("Send uncommitted write metadata event to coordinator, task[{}].", taskID);
eventSent = true;
}
}
if (!eventSent) {
sendBootstrapEvent();
}
}
privatevoidsendBootstrapEvent() {
// 发送空的eventthis.eventGateway.sendEventToCoordinator(WriteMetadataEvent.emptyBootstrap(taskID));
LOG.info("Send bootstrap write metadata event to coordinator, task[{}].", taskID);
}
snapshotState
publicvoidsnapshotState(FunctionSnapshotContextfunctionSnapshotContext) throwsException {
//基于协调器首先启动检查点的事实,//它将检查有效性。//等待缓冲区数据刷新,并请求一个新的即时flushRemaining(false);
// 重新加载writeMeta状态reloadWriteMetaState();
}
// endInput标识是否无界流privatevoidflushRemaining(booleanendInput) {
// hasData==(this.buckets.size() > 0 && this.buckets.values().stream().anyMatch(bucket -> bucket.records.size() > 0);)// 获取当前instantthis.currentInstant = instantToWrite(hasData());
if (this.currentInstant == null) {
// in case there are empty checkpoints that has no input datathrownewHoodieException("No inflight instant when flushing data!");
}
finalList<WriteStatus> writeStatus;
if (buckets.size() > 0) {
writeStatus = newArrayList<>();
this.buckets.values()
// The records are partitioned by the bucket ID and each batch sent to// the writer belongs to one bucket.
.forEach(bucket -> {
List<HoodieRecord> records = bucket.writeBuffer();
if (records.size() > 0) {
if (config.getBoolean(FlinkOptions.INSERT_DROP_DUPS)) {
// 去重records = FlinkWriteHelper.newInstance().deduplicateRecords(records, (HoodieIndex) null, -1);
}
// 预写 在刷新之前设置:用正确的分区路径和fileID修补第一个记录。bucket.preWrite(records);
// 写入数据writeStatus.addAll(writeFunction.apply(records, currentInstant));
records.clear();
bucket.reset();
}
});
} else {
LOG.info("No data to write in subtask [{}] for instant [{}]", taskID, currentInstant);
writeStatus = Collections.emptyList();
}
// 构造WriteMetadataEventfinalWriteMetadataEventevent = WriteMetadataEvent.builder()
.taskID(taskID)
.instantTime(currentInstant)
.writeStatus(writeStatus)
.lastBatch(true)
.endInput(endInput)
.build();
// 发送eventthis.eventGateway.sendEventToCoordinator(event);
this.buckets.clear();
this.tracer.reset();
this.writeClient.cleanHandles();
// 写入状态放入状态this.writeStatuses.addAll(writeStatus);
// blocks flushing until the coordinator starts a new instantthis.confirming = true;
}
/** * Reload the write metadata state as the current checkpoint. */privatevoidreloadWriteMetaState() throwsException {
// 清理writeMetadataStatethis.writeMetadataState.clear();
WriteMetadataEventevent = WriteMetadataEvent.builder()
.taskID(taskID)
.instantTime(currentInstant)
.writeStatus(newArrayList<>(writeStatuses))
.bootstrap(true)
.build();
this.writeMetadataState.add(event);
writeStatuses.clear();
}
bufferRecord
privatevoidbufferRecord(HoodieRecord<?> value) {
// 根据record获取bucketId {partition path}_{fileID}.finalStringbucketID = getBucketID(value);
// 获取对应bucketDataBucketbucket = this.buckets.computeIfAbsent(bucketID,
k -> newDataBucket(this.config.getDouble(FlinkOptions.WRITE_BATCH_SIZE), value));
// 包装hoodie记录finalDataItemitem = DataItem.fromHoodieRecord(value);
// 判断是否需要刷新bucket,是否超过WRITE_BATCH_SIZE大小booleanflushBucket = bucket.detector.detect(item);
// 判断是否需要刷新buffer,是否超过maxBufferSize,总buffer大小booleanflushBuffer = this.tracer.trace(bucket.detector.lastRecordSize);
if (flushBucket) {
if (flushBucket(bucket)) {
// 清理总buffer大小this.tracer.countDown(bucket.detector.totalSize);
// 重置bucket大小bucket.reset();
}
} elseif (flushBuffer) {
// find the max size bucket and flush it out,找到最大的bucket然后flushList<DataBucket> sortedBuckets = this.buckets.values().stream()
.sorted((b1, b2) -> Long.compare(b2.detector.totalSize, b1.detector.totalSize))
.collect(Collectors.toList());
finalDataBucketbucketToFlush = sortedBuckets.get(0);
if (flushBucket(bucketToFlush)) {
this.tracer.countDown(bucketToFlush.detector.totalSize);
bucketToFlush.reset();
} else {
LOG.warn("The buffer size hits the threshold {}, but still flush the max size data bucket failed!", this.tracer.maxBufferSize);
}
}
// 一批buffer数据上不超过` FlinkOptions#WRITE_BATCH_SIZE`大小或者全部的buffer数据超过`FlinkOptions#WRITE_TASK_MAX_SIZE`bucket.records.add(item);
}
// 刷新bucketprivatebooleanflushBucket(DataBucketbucket) {
Stringinstant = instantToWrite(true);
if (instant == null) {
// in case there are empty checkpoints that has no input dataLOG.info("No inflight instant when flushing data, skip.");
returnfalse;
}
List<HoodieRecord> records = bucket.writeBuffer();
ValidationUtils.checkState(records.size() > 0, "Data bucket to flush has no buffering records");
if (config.getBoolean(FlinkOptions.INSERT_DROP_DUPS)) {
// 去重hoodieRecordrecords = FlinkWriteHelper.newInstance().deduplicateRecords(records, (HoodieIndex) null, -1);
}
// 预写 在刷新之前设置:用正确的分区路径和fileID修补第一个记录。bucket.preWrite(records);
// 写入记录finalList<WriteStatus> writeStatus = newArrayList<>(writeFunction.apply(records, instant));
records.clear();
// 发送数据finalWriteMetadataEventevent = WriteMetadataEvent.builder()
.taskID(taskID)
.instantTime(instant) // the write instant may shift but the event still use the currentInstant.
.writeStatus(writeStatus)
.lastBatch(false)
.endInput(false)
.build();
this.eventGateway.sendEventToCoordinator(event);
writeStatuses.addAll(writeStatus);
returntrue;
}
publicabstractclassBucketAssigners {
privateBucketAssigners() {
}
publicstaticBucketAssignercreate(
inttaskID,
intmaxParallelism,
intnumTasks,
booleanignoreSmallFiles,
HoodieTableTypetableType,
HoodieFlinkEngineContextcontext,
HoodieWriteConfigconfig) {
// 是否delta filebooleandelta = tableType.equals(HoodieTableType.MERGE_ON_READ);
// 写入的配置WriteProfilewriteProfile = WriteProfiles.singleton(ignoreSmallFiles, delta, config, context);
// 创建bucket分配器returnnewBucketAssigner(taskID, maxParallelism, numTasks, writeProfile, config);
}
}
// bucket分配器publicclassBucketAssignerimplementsAutoCloseable {
privatestaticfinalLoggerLOG = LogManager.getLogger(BucketAssigner.class);
privatefinalinttaskID;
privatefinalintmaxParallelism;
// subTask个数privatefinalintnumTasks;
// 每个桶的类型privatefinalHashMap<String, BucketInfo> bucketInfoMap;
protectedfinalHoodieWriteConfigconfig;
privatefinalWriteProfilewriteProfile;
// 小文件分配映射privatefinalMap<String, SmallFileAssign> smallFileAssignMap;
/** * Bucket ID(partition + fileId) -> new file assign state. */privatefinalMap<String, NewFileAssignState> newFileAssignStates;
/** * Num of accumulated successful checkpoints, used for cleaning the new file assign state. */privateintaccCkp = 0;
publicBucketAssigner(
inttaskID,
intmaxParallelism,
intnumTasks,
WriteProfileprofile,
HoodieWriteConfigconfig) {
this.taskID = taskID;
this.maxParallelism = maxParallelism;
this.numTasks = numTasks;
this.config = config;
this.writeProfile = profile;
this.bucketInfoMap = newHashMap<>();
this.smallFileAssignMap = newHashMap<>();
this.newFileAssignStates = newHashMap<>();
}
// 重置bucket类型记录mappublicvoidreset() {
bucketInfoMap.clear();
}
publicBucketInfoaddUpdate(StringpartitionPath, StringfileIdHint) {
finalStringkey = StreamerUtil.generateBucketKey(partitionPath, fileIdHint);
// 如果bucketmap没有则创建update bucketif (!bucketInfoMap.containsKey(key)) {
BucketInfobucketInfo = newBucketInfo(BucketType.UPDATE, fileIdHint, partitionPath);
bucketInfoMap.put(key, bucketInfo);
}
// 复用现有的returnbucketInfoMap.get(key);
}
publicBucketInfoaddInsert(StringpartitionPath) {
// 判断是否存在小文件的bucketSmallFileAssignsmallFileAssign = getSmallFileAssign(partitionPath);
// 首先尝试将其打包到一个smallfile中if (smallFileAssign != null && smallFileAssign.assign()) {
returnnewBucketInfo(BucketType.UPDATE, smallFileAssign.getFileId(), partitionPath);
}
// if we have anything more, create new insert buckets, like normalif (newFileAssignStates.containsKey(partitionPath)) {
NewFileAssignStatenewFileAssignState = newFileAssignStates.get(partitionPath);
if (newFileAssignState.canAssign()) {
newFileAssignState.assign();
finalStringkey = StreamerUtil.generateBucketKey(partitionPath, newFileAssignState.fileId);
if (bucketInfoMap.containsKey(key)) {
// the newFileAssignStates is cleaned asynchronously when received the checkpoint success notification,// the records processed within the time range:// (start checkpoint, checkpoint success(and instant committed))// should still be assigned to the small buckets of last checkpoint instead of new one.// the bucketInfoMap is cleaned when checkpoint starts.// A promotion: when the HoodieRecord can record whether it is an UPDATE or INSERT,// we can always return an UPDATE BucketInfo here, and there is no need to record the// UPDATE bucket through calling #addUpdate.returnbucketInfoMap.get(key);
}
returnnewBucketInfo(BucketType.UPDATE, newFileAssignState.fileId, partitionPath);
}
}
BucketInfobucketInfo = newBucketInfo(BucketType.INSERT, createFileIdOfThisTask(), partitionPath);
finalStringkey = StreamerUtil.generateBucketKey(partitionPath, bucketInfo.getFileIdPrefix());
bucketInfoMap.put(key, bucketInfo);
NewFileAssignStatenewFileAssignState = newNewFileAssignState(bucketInfo.getFileIdPrefix(), writeProfile.getRecordsPerBucket());
newFileAssignState.assign();
newFileAssignStates.put(partitionPath, newFileAssignState);
returnbucketInfo;
}
privatesynchronizedSmallFileAssigngetSmallFileAssign(StringpartitionPath) {
if (smallFileAssignMap.containsKey(partitionPath)) {
returnsmallFileAssignMap.get(partitionPath);
}
List<SmallFile> smallFiles = smallFilesOfThisTask(writeProfile.getSmallFiles(partitionPath));
if (smallFiles.size() > 0) {
LOG.info("For partitionPath : " + partitionPath + " Small Files => " + smallFiles);
SmallFileAssignState[] states = smallFiles.stream()
.map(smallFile -> newSmallFileAssignState(config.getParquetMaxFileSize(), smallFile, writeProfile.getAvgSize()))
.toArray(SmallFileAssignState[]::new);
SmallFileAssignassign = newSmallFileAssign(states);
smallFileAssignMap.put(partitionPath, assign);
returnassign;
}
smallFileAssignMap.put(partitionPath, null);
returnnull;
}
/** * Refresh the table state like TableFileSystemView and HoodieTimeline. */publicsynchronizedvoidreload(longcheckpointId) {
this.accCkp += 1;
if (this.accCkp > 1) {
// do not clean the new file assignment state for the first checkpoint,// this #reload calling is triggered by checkpoint success event, the coordinator// also relies on the checkpoint success event to commit the inflight instant,// and very possibly this component receives the notification before the coordinator,// if we do the cleaning, the records processed within the time range:// (start checkpoint, checkpoint success(and instant committed))// would be assigned to a fresh new data bucket which is not the right behavior.this.newFileAssignStates.clear();
this.accCkp = 0;
}
this.smallFileAssignMap.clear();
this.writeProfile.reload(checkpointId);
}
privatebooleanfileIdOfThisTask(StringfileId) {
// the file id can shuffle to this taskreturnKeyGroupRangeAssignment.assignKeyToParallelOperator(fileId, maxParallelism, numTasks) == taskID;
}
@VisibleForTestingpublicStringcreateFileIdOfThisTask() {
StringnewFileIdPfx = FSUtils.createNewFileIdPfx();
while (!fileIdOfThisTask(newFileIdPfx)) {
newFileIdPfx = FSUtils.createNewFileIdPfx();
}
returnnewFileIdPfx;
}
@VisibleForTestingpublicList<SmallFile> smallFilesOfThisTask(List<SmallFile> smallFiles) {
// computes the small files to write inserts for this task.returnsmallFiles.stream()
.filter(smallFile -> fileIdOfThisTask(smallFile.location.getFileId()))
.collect(Collectors.toList());
}
publicvoidclose() {
reset();
WriteProfiles.clean(config.getBasePath());
}
/** * Assigns the record to one of the small files under one partition. * * <p> The tool is initialized with an array of {@link SmallFileAssignState}s. * A pointer points to the current small file we are ready to assign, * if the current small file can not be assigned anymore (full assigned), the pointer * move to next small file. * <pre> * | -> * V * | smallFile_1 | smallFile_2 | smallFile_3 | ... | smallFile_N | * </pre> * * <p>If all the small files are full assigned, a flag {@code noSpace} was marked to true, and * we can return early for future check. */privatestaticclassSmallFileAssign {
finalSmallFileAssignState[] states;
intassignIdx = 0;
booleannoSpace = false;
SmallFileAssign(SmallFileAssignState[] states) {
this.states = states;
}
publicbooleanassign() {
if (noSpace) {
returnfalse;
}
SmallFileAssignStatestate = states[assignIdx];
while (!state.canAssign()) {
assignIdx += 1;
if (assignIdx >= states.length) {
noSpace = true;
returnfalse;
}
// move to next slot if possiblestate = states[assignIdx];
}
state.assign();
returntrue;
}
publicStringgetFileId() {
returnstates[assignIdx].fileId;
}
}
/** * Candidate bucket state for small file. It records the total number of records * that the bucket can append and the current number of assigned records. */privatestaticclassSmallFileAssignState {
longassigned;
longtotalUnassigned;
finalStringfileId;
SmallFileAssignState(longparquetMaxFileSize, SmallFilesmallFile, longaverageRecordSize) {
this.assigned = 0;
this.totalUnassigned = (parquetMaxFileSize - smallFile.sizeBytes) / averageRecordSize;
this.fileId = smallFile.location.getFileId();
}
publicbooleancanAssign() {
returnthis.totalUnassigned > 0 && this.totalUnassigned > this.assigned;
}
/** * Remembers to invoke {@link #canAssign()} first. */publicvoidassign() {
Preconditions.checkState(canAssign(),
"Can not assign insert to small file: assigned => "
+ this.assigned + " totalUnassigned => " + this.totalUnassigned);
this.assigned++;
}
}
/** * Candidate bucket state for a new file. It records the total number of records * that the bucket can append and the current number of assigned records. */privatestaticclassNewFileAssignState {
longassigned;
longtotalUnassigned;
finalStringfileId;
NewFileAssignState(StringfileId, longinsertRecordsPerBucket) {
this.fileId = fileId;
this.assigned = 0;
this.totalUnassigned = insertRecordsPerBucket;
}
publicbooleancanAssign() {
returnthis.totalUnassigned > 0 && this.totalUnassigned > this.assigned;
}
/** * Remembers to invoke {@link #canAssign()} first. */publicvoidassign() {
Preconditions.checkState(canAssign(),
"Can not assign insert to new file: assigned => "
+ this.assigned + " totalUnassigned => " + this.totalUnassigned);
this.assigned++;
}
}
}
Pipelines
构造write pipeline
publicstaticDataStream<Object> hoodieStreamWrite(Configurationconf, intdefaultParallelism, DataStream<HoodieRecord> dataStream) {
WriteOperatorFactory<HoodieRecord> operatorFactory = StreamWriteOperator.getFactory(conf);
returndataStream// Key-by record key, to avoid multiple subtasks write to a bucket at the same time
.keyBy(HoodieRecord::getRecordKey)
// 分配桶
.transform(
"bucket_assigner",
TypeInformation.of(HoodieRecord.class),
newKeyedProcessOperator<>(newBucketAssignFunction<>(conf)))
.uid("uid_bucket_assigner_" + conf.getString(FlinkOptions.TABLE_NAME))
.setParallelism(conf.getOptional(FlinkOptions.BUCKET_ASSIGN_TASKS).orElse(defaultParallelism))
// shuffle by fileId(bucket id)
.keyBy(record -> record.getCurrentLocation().getFileId())
.transform("hoodie_stream_write", TypeInformation.of(Object.class), operatorFactory)
.uid("uid_hoodie_stream_write" + conf.getString(FlinkOptions.TABLE_NAME))
.setParallelism(conf.getInteger(FlinkOptions.WRITE_TASKS));
}
// write pipelinepipeline = Pipelines.hoodieStreamWrite(conf, parallelism, hoodieRecordDataStream);
Bootstrap
用于全量+增量读取时构造flink index state,用于后续实时数据写入upsert,将历史数据的index重写构建在flink的state里,为后续增量数据也可以保证唯一性。
publicclassCompactionPlanOperatorextendsAbstractStreamOperator<CompactionPlanEvent>
implementsOneInputStreamOperator<Object, CompactionPlanEvent> {
/** * hudi配置 */privatefinalConfigurationconf;
/** * Meta Client. */@SuppressWarnings("rawtypes")
privatetransientHoodieFlinkTabletable;
publicCompactionPlanOperator(Configurationconf) {
this.conf = conf;
}
@Overridepublicvoidopen() throwsException {
super.open();
this.table = FlinkTables.createTable(conf, getRuntimeContext());
// when starting up, rolls back all the inflight compaction instants if there exists,// these instants are in priority for scheduling task because the compaction instants are// scheduled from earliest(FIFO sequence).// 当启动时回滚全部的inflight compactionCompactionUtil.rollbackCompaction(table);
}
@OverridepublicvoidprocessElement(StreamRecord<Object> streamRecord) {
// no operation
}
@OverridepublicvoidnotifyCheckpointComplete(longcheckpointId) {
try {
table.getMetaClient().reloadActiveTimeline();
// There is no good way to infer when the compaction task for an instant crushed// or is still undergoing. So we use a configured timeout threshold to control the rollback:// {@code FlinkOptions.COMPACTION_TIMEOUT_SECONDS},// when the earliest inflight instant has timed out, assumes it has failed// already and just rolls it back.// comment out: do we really need the timeout rollback ?// CompactionUtil.rollbackEarliestCompaction(table, conf);scheduleCompaction(table, checkpointId);
} catch (Throwablethrowable) {
// make it fail-safeLOG.error("Error while scheduling compaction plan for checkpoint: " + checkpointId, throwable);
}
}
privatevoidscheduleCompaction(HoodieFlinkTable<?> table, longcheckpointId) throwsIOException {
// the last instant takes the highest priority.Option<HoodieInstant> firstRequested = table.getActiveTimeline().filterPendingCompactionTimeline()
.filter(instant -> instant.getState() == HoodieInstant.State.REQUESTED).firstInstant();
if (!firstRequested.isPresent()) {
// do nothing.LOG.info("No compaction plan for checkpoint " + checkpointId);
return;
}
StringcompactionInstantTime = firstRequested.get().getTimestamp();
// generate compaction plan// should support configurable commit metadataHoodieCompactionPlancompactionPlan = CompactionUtils.getCompactionPlan(
table.getMetaClient(), compactionInstantTime);
if (compactionPlan == null || (compactionPlan.getOperations() == null)
|| (compactionPlan.getOperations().isEmpty())) {
// do nothing.LOG.info("Empty compaction plan for instant " + compactionInstantTime);
} else {
HoodieInstantinstant = HoodieTimeline.getCompactionRequestedInstant(compactionInstantTime);
// Mark instant as compaction inflighttable.getActiveTimeline().transitionCompactionRequestedToInflight(instant);
table.getMetaClient().reloadActiveTimeline();
List<CompactionOperation> operations = compactionPlan.getOperations().stream()
.map(CompactionOperation::convertFromAvroRecordInstance).collect(toList());
LOG.info("Execute compaction plan for instant {} as {} file groups", compactionInstantTime, operations.size());
for (CompactionOperationoperation : operations) {
output.collect(newStreamRecord<>(newCompactionPlanEvent(compactionInstantTime, operation)));
}
}
}
@VisibleForTestingpublicvoidsetOutput(Output<StreamRecord<CompactionPlanEvent>> output) {
this.output = output;
}
}
// 入参MergeOnReadInputSplit转换为RowDatapublicclassStreamReadOperatorextendsAbstractStreamOperator<RowData>
implementsOneInputStreamOperator<MergeOnReadInputSplit, RowData> {
privatestaticfinalLoggerLOG = LoggerFactory.getLogger(StreamReadOperator.class);
// 一批消费的数据量privatestaticfinalintMINI_BATCH_SIZE = 1000;
//运行这个操作符和检查点操作的是同一个线程。仅使用此执行程序来调度//对后续的读取进行拆分,这样可以触发一个新的检查点,而不会阻塞很长时间//用于耗尽所有预定的分割阅读任务。privatefinalMailboxExecutorexecutor;
// 读取hoodie data和log文件privateMergeOnReadInputFormatformat;
// source上下文privatetransientSourceFunction.SourceContext<RowData> sourceContext;
privatetransientListState<MergeOnReadInputSplit> inputSplitsState;
privatetransientQueue<MergeOnReadInputSplit> splits;
//当队列中有读任务时,它被设置为RUNNING。当没有更多的文件可读时,这将被设置为IDLE。privatetransientvolatileSplitStatecurrentSplitState;
// mor输入格式读取inputSplit, processTime服务privateStreamReadOperator(MergeOnReadInputFormatformat, ProcessingTimeServicetimeService,
MailboxExecutormailboxExecutor) {
this.format = Preconditions.checkNotNull(format, "The InputFormat should not be null.");
this.processingTimeService = timeService;
this.executor = Preconditions.checkNotNull(mailboxExecutor, "The mailboxExecutor should not be null.");
@OverridepublicvoidinitializeState(StateInitializationContextcontext) throwsException {
super.initializeState(context);
// 处理事split状态// TODO Replace Java serialization with Avro approach to keep state compatibility.inputSplitsState = context.getOperatorStateStore().getListState(
newListStateDescriptor<>("splits", newJavaSerializer<>()));
// Initialize the current split state to IDLE.currentSplitState = SplitState.IDLE;
// 用于从状态后端服务split状态的双端阻塞队列// Recover splits state from flink state backend if possible.splits = newLinkedBlockingDeque<>();
if (context.isRestored()) {
intsubtaskIdx = getRuntimeContext().getIndexOfThisSubtask();
LOG.info("Restoring state for operator {} (task ID: {}).", getClass().getSimpleName(), subtaskIdx);
for (MergeOnReadInputSplitsplit : inputSplitsState.get()) {
// state放入队列splits.add(split);
}
}
// 获取sourceContextthis.sourceContext = StreamSourceContexts.getSourceContext(
getOperatorConfig().getTimeCharacteristic(),
getProcessingTimeService(),
newObject(), // no actual locking neededgetContainingTask().getStreamStatusMaintainer(),
output,
getRuntimeContext().getExecutionConfig().getAutoWatermarkInterval(),
-1);
// Enqueue to process the recovered input splits.// 入队列处理恢复input splitenqueueProcessSplits();
}
@OverridepublicvoidsnapshotState(StateSnapshotContextcontext) throwsException {
super.snapshotState(context);
inputSplitsState.clear();
inputSplitsState.addAll(newArrayList<>(splits));
}
@OverridepublicvoidprocessElement(StreamRecord<MergeOnReadInputSplit> element) {
// 放入双端队列splits.add(element.getValue());
// 处理splitenqueueProcessSplits();
}
privatevoidenqueueProcessSplits() {
// 如果当前是空闲状态并且split队列不为空则处理队列的splitif (currentSplitState == SplitState.IDLE && !splits.isEmpty()) {
currentSplitState = SplitState.RUNNING;
executor.execute(this::processSplits, "process input split");
}
}
privatevoidprocessSplits() throwsIOException {
// 获取头部split但是不移除MergeOnReadInputSplitsplit = splits.peek();
// 为空则设置当前split状态为空闲if (split == null) {
currentSplitState = SplitState.IDLE;
return;
}
// 1. open a fresh new input split and start reading as mini-batch// 2. if the input split has remaining records to read, switches to another runnable to handle// 3. if the input split reads to the end, close the format and remove the split from the queue #splits// 4. for each runnable, reads at most #MINI_BATCH_SIZE number of recordsif (format.isClosed()) {
// This log is important to indicate the consuming process,// there is only one log message for one data bucket.LOG.info("Processing input split : {}", split);
format.open(split);
}
try {
// 消费微批数据consumeAsMiniBatch(split);
} finally {
currentSplitState = SplitState.IDLE;
}
// 再次调度// Re-schedule to process the next split.enqueueProcessSplits();
}
/** * Consumes at most {@link #MINI_BATCH_SIZE} number of records * for the given input split {@code split}. * * <p>Note: close the input format and remove the input split for the queue {@link #splits} * if the split reads to the end. * * @param split The input split */privatevoidconsumeAsMiniBatch(MergeOnReadInputSplitsplit) throwsIOException {
for (inti = 0; i < MINI_BATCH_SIZE; i++) {
// 如果没读取完毕,继续下发if (!format.reachedEnd()) {
// 下发记录sourceContext.collect(format.nextRecord(null));
// 消费标识+1split.consume();
} else {
// close the input formatformat.close();
// remove the split// 删除队列里消费完的数据splits.poll();
break;
}
}
}
@OverridepublicvoidprocessWatermark(Watermarkmark) {
// we do nothing because we emit our own watermarks if needed.
}
@Overridepublicvoiddispose() throwsException {
super.dispose();
if (format != null) {
format.close();
format.closeInputFormat();
format = null;
}
sourceContext = null;
}
@Overridepublicvoidclose() throwsException {
super.close();
output.close();
if (sourceContext != null) {
sourceContext.emitWatermark(Watermark.MAX_WATERMARK);
sourceContext.close();
sourceContext = null;
}
}
publicstaticOneInputStreamOperatorFactory<MergeOnReadInputSplit, RowData> factory(MergeOnReadInputFormatformat) {
returnnewOperatorFactory(format);
}
privateenumSplitState {
IDLE, RUNNING
}
privatestaticclassOperatorFactoryextendsAbstractStreamOperatorFactory<RowData>
implementsYieldingOperatorFactory<RowData>, OneInputStreamOperatorFactory<MergeOnReadInputSplit, RowData> {
privatefinalMergeOnReadInputFormatformat;
privatetransientMailboxExecutormailboxExecutor;
privateOperatorFactory(MergeOnReadInputFormatformat) {
this.format = format;
}
@OverridepublicvoidsetMailboxExecutor(MailboxExecutormailboxExecutor) {
this.mailboxExecutor = mailboxExecutor;
}
@SuppressWarnings("unchecked")
@Overridepublic <OextendsStreamOperator<RowData>> OcreateStreamOperator(StreamOperatorParameters<RowData> parameters) {
StreamReadOperatoroperator = newStreamReadOperator(format, processingTimeService, mailboxExecutor);
operator.setup(parameters.getContainingTask(), parameters.getStreamConfig(), parameters.getOutput());
return (O) operator;
}
@OverridepublicClass<? extendsStreamOperator> getStreamOperatorClass(ClassLoaderclassLoader) {
returnStreamReadOperator.class;
}
}
}
// flink配置privatefinalConfigurationconf;
// hadoop配置privatetransientorg.apache.hadoop.conf.ConfigurationhadoopConf;
// 表状态privatefinalMergeOnReadTableStatetableState;
/** * 读取数据迭代器 */privatetransientRecordIteratoriterator;
// for project push down/** * Full table names. */privatefinalList<String> fieldNames;
/** * Full field data types. */privatefinalList<DataType> fieldTypes;
/** * Default partition name when the field value is null. */privatefinalStringdefaultPartName;
/** * Required field positions. */privatefinalint[] requiredPos;
// for limit push down/** * Limit for the reader, -1 when the reading is not limited. */privatefinallonglimit;
/** * Recording the current read count for limit check. */privatelongcurrentReadCount = 0;
/** 标识是否输出delete记录 * Flag saying whether to emit the deletes. In streaming read mode, downstream * operators need the DELETE messages to retract the legacy accumulator. */privatebooleanemitDelete;
/** * Flag saying whether the input format has been closed. */privatebooleanclosed = true;