-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-1930] Improve Multi-active related logs and metrics #3800
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work--I HIGHLY support this finer-grained logging
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MysqlMultiActiveLeaseArbiter.java
Show resolved
Hide resolved
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MysqlMultiActiveLeaseArbiter.java
Outdated
Show resolved
Hide resolved
public static final String GOBBLIN_DAG_ACTION_STORE_MONITOR_UNEXPECTED_ERRORS = ServiceMetricNames.GOBBLIN_SERVICE_PREFIX + ".dagActionStoreMonitor.unexpected.errors"; | ||
public static final String | ||
GOBBLIN_DAG_ACTION_STORE_PRODUCE_TO_CONSUME_DELAY_MILLIS = ServiceMetricNames.GOBBLIN_SERVICE_PREFIX + ".dagActionStoreMonitor.produce.to.consume.delay"; | ||
public static final String DAG_ACTION_STORE_MONITOR_PREFIX = "dagActionStoreMonitor"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is no longer a prefix... but why anyway do you prefer to repeat so many times SMNames.GOBBLIN_SERVICE_PREFIX + "."
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good pt, updated to have the prefix contain SMNames.GOBBLIN_SERVICE_PREFIX + "."
instead
log.warn("SpecCompiler failed to reach healthy state before compilation of flowId {}. Exception: ", flowId, e); | ||
log.warn("SpecCompiler failed to reach healthy state before compilation of flowId {} due to exception {}", flowId, | ||
e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there's the possibility I may be misremembering... and pursuing this needlessly... but my expectation is that a stacktrace would only be written when calling this form:
Logger::warn(String, Throwable)
if you call the form:
Logger::warn(String, Object, Object) // aka. Logger::warn(String, Object...)
are you certain it will print the ST, when the last arg is Throwable
and there's no corresponding {}
remaining for it in the initial String
arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two {}
and the last one is for the throwable so it will print the getMessage
. I updated all the methods to use Logger::warn(String, Throwable)
thought to get full stack trace.
@@ -43,6 +43,11 @@ public class ServiceMetricNames { | |||
public static final String FLOW_TRIGGER_HANDLER_JOB_DOES_NOT_EXIST_COUNT = GOBBLIN_SERVICE_PREFIX + "." + FLOW_TRIGGER_HANDLER_PREFIX + ".jobDoesNotExistInScheduler"; | |||
public static final String FLOW_TRIGGER_HANDLER_FAILED_TO_SET_REMINDER_COUNT = GOBBLIN_SERVICE_PREFIX + "." + FLOW_TRIGGER_HANDLER_PREFIX + ".failedToSetReminderCount"; | |||
|
|||
// Dag Action Handling Related Metrics | |||
public static final String DAG_ACTION_HANDLING_PREFIX = "dagActionHandling"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering... couldn't this be a dagManager.
metric? dagManager.failedLaunchEventsOn...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to be clear that this is a failure that occurs related to handling all dagAction
related code changes and easily find them when they may originate from dagActionStoreMonitor
, dagManager
, or other locations. We also don't use a dagManager
prefix for other dagManager
metrics for some reason
// We only expect INSERT and DELETE operations done to this table. INSERTs correspond to any type of | ||
// {@link DagActionStore.FlowActionType} flow requests that have to be processed. DELETEs require no action. | ||
try { | ||
if (operation.equals("INSERT")) { | ||
if (dagActionType.equals(DagActionStore.FlowActionType.RESUME)) { | ||
log.info("Received insert dag action and about to send resume flow request"); | ||
log.info("Received insert dag action and about to send resume flow request for: {}", dagAction); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: too conversational. how about:
log.info("DagAction change ({}): {}", operation, dagAction)
(i.e. won't the resume/kill/launch
be logged as dagAction.dagActionType
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about unifying all the logs like this as well, let me make the change. I will add a bit more context in message but use this format.
@@ -43,6 +43,12 @@ public class ServiceMetricNames { | |||
public static final String FLOW_TRIGGER_HANDLER_JOB_DOES_NOT_EXIST_COUNT = GOBBLIN_SERVICE_PREFIX + "." + FLOW_TRIGGER_HANDLER_PREFIX + ".jobDoesNotExistInScheduler"; | |||
public static final String FLOW_TRIGGER_HANDLER_FAILED_TO_SET_REMINDER_COUNT = GOBBLIN_SERVICE_PREFIX + "." + FLOW_TRIGGER_HANDLER_PREFIX + ".failedToSetReminderCount"; | |||
|
|||
// DagManager Related Metrics | |||
public static final String DAG_MANAGER_HANDLING_PREFIX = GOBBLIN_SERVICE_PREFIX + ".dagManagerHandling"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall, this is minor... not wanting to press too hard here.
just observing that DagManagerHandling
, not being an actual class name, seems a synonym for DagMgrWorking
, DagMgrInAction
, DagMgrDoingStuff
, which is to say it's vague. even more, DM "doing stuff" is implied simply by the prefix DagMgr
.
so if there is something you want to differentiate here, that's fine... but do make sure what that is is clearly stated, so a future maintainer knows whether the metric they wish to add belongs under DagMgr
or your new prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see what you're saying. I can't find a better way to differntiate this sub-class of dagManager
code issues so I'll leave it as dagManager
to avoid any confusion.
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java
Outdated
Show resolved
Hide resolved
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java
Outdated
Show resolved
Hide resolved
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java
Outdated
Show resolved
Hide resolved
...lin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/Orchestrator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add changes to make the flow execution ID based off the even timestamp so that flows can be tracked from scheduling -> completion? It becomes very hard to debug as we will need to create an additional logical mapping of intended executions to running flow executions
Codecov Report
@@ Coverage Diff @@
## master #3800 +/- ##
============================================
- Coverage 47.37% 47.35% -0.03%
- Complexity 10973 10975 +2
============================================
Files 2152 2152
Lines 85154 85176 +22
Branches 9470 9472 +2
============================================
- Hits 40341 40333 -8
- Misses 41155 41187 +32
+ Partials 3658 3656 -2
... and 5 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
this.dagActionStore.get().addDagAction(flowAction.getFlowGroup(), flowAction.getFlowName(), flowAction.getFlowExecutionId(), flowAction.getFlowActionType()); | ||
// Replace flow execution id with trigger timestamp to easily track the flow | ||
this.dagActionStore.get().addDagAction(flowAction.getFlowGroup(), flowAction.getFlowName(), String.valueOf(leaseStatus.getEventTimestamp()), flowAction.getFlowActionType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems confusing that LeaseObtainedStatus.getFlowAction().getFlowExecutionId()
returns a different value than what we're storing in the DagAction.getFlowExecutionId()
. e.g. why not unify them by encapsulating the switcheroo to LeaseObtainedStatus.getEventTimestamp()
inside the LeaseObtainedStatus.getFlowAction()
call?
then, in that location I suggest a code comment explaining why it's helpful and safe to replace the original value w/ this one.
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MysqlMultiActiveLeaseArbiter.java
Outdated
Show resolved
Hide resolved
// Utilize db timestamp for reminder | ||
return new LeasedToAnotherStatus(flowAction, dbEventTimestamp.getTime(), | ||
return new LeasedToAnotherStatus(updatedFlowAction, dbEventTimestamp.getTime(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each of the flowExecId updates uses the timestamp value that is also passed as the second arg to the LeasedToAnotherStatus
or LeaseObtainedStatus
ctor. seems potentially unnecessary for that object still to maintain both--do we want it to?
e.g. it could always implement:
long getEventTimestamp() {
return this.dagAction.getFlowExecutionId();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
DagActionStore.DagAction updatedFlowAction = DagActionStore.DagAction.updateFlowExecutionId(flowAction, | ||
leaseObtainedStatus.getEventTimeMillis()); | ||
scheduleReminderForEvent(jobProps, | ||
new MultiActiveLeaseArbiter.LeasedToAnotherStatus(flowAction, leaseObtainedStatus.getEventTimestamp(), 0L), | ||
new MultiActiveLeaseArbiter.LeasedToAnotherStatus(updatedFlowAction, 0L), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this if we are using the event timetime millisecond, which should be synchronized across all hosts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we do, the timestamp within leaseObtainedStatus
is the agreed upon time that is synchronized across all hosts while the following param, eventTimeMillis
(later called) triggerEventTimeMillis
is local to the host and is only used for logging purposes to show us that we switch from local trigger to the synchronized trigger. Here we are just changing where it's being stored not the fact that we do update it.
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/api/MultiActiveLeaseArbiter.java
Outdated
Show resolved
Hide resolved
/** | ||
* Replace flow execution id with agreed upon event time to easily track the flow | ||
*/ | ||
public static DagActionStore.DagAction updateFlowExecutionId(DagActionStore.DagAction flowAction, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would better belong as a method of the DagAction
class defined on line 39
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is included as a method of the class, it's static because it returns a new dagAction obj
`Flow action`'s flow execution id is the timestamp the lease is associated with, which may be a different timestamp | ||
for the same flow action corresponding to the same instance of the event or a distinct one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhat confusing, "which may be a different timestamp for the same..." - reword?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tried to reword lmk what u think
* [GOBBLIN-1921] Properly handle reminder events (apache#3790) * Add millisecond level precision to timestamp cols & proper timezone conversion - existing tests pass with minor modifications * Handle reminder events properly * Fix compilation errors & add isReminder flag * Add unit tests * Address review comments * Add newline to address comment * Include reminder/original tag in logging * Clarify timezone issues in comment --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * [GOBBLIN-1924] Reminder event flag true (apache#3795) * Set reminder event flag to true for reminders * Update unit tests * remove unused variable --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * [GOBBLIN-1923] Add retention for lease arbiter table (apache#3792) * Add retention for lease arbiter table * Replace blocking thread with scheduled thread pool executor * Make Calendar instance thread-safe * Rename variables, make values more clear * Update timestamp related cols --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * Change debug statements to info temporarily to debug (apache#3796) Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * [GOBBLIN-1926] Fix Reminder Event Epsilon Comparison (apache#3797) * Fix Reminder Event Epsilon Comparison * Add TODO comment --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * [GOBBLIN-1930] Improve Multi-active related logs and metrics (apache#3800) * Improve Multi-active related logs and metrics * Add more metrics and logs around forwarding dag action to DagManager * Improve logs in response to review comments * Replace flow execution id with trigger timestamp from multi-active * Update flow action execution id within lease arbiter * Fix test & make Lease Statuses more lean * Update javadoc --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * add dataset root in some common type of datasets * [GOBBLIN-1927] Add topic validation support in KafkaSource, and add TopicNameValidator (apache#3793) * * Add generic topic validation support * Add the first validator TopicNameValidator into the validator chain, as a refactor of existing codes * Refine to address comments * Refine --------- Co-authored-by: Tao Qin <tqin@linkedin.com> * [GOBBLIN-1931] Refactor dag action updating method & add clarifying comment (apache#3801) * Refactor dag action updating method & add clarifying comment * Log filtered out duplicate messages * logs and metrics for missing messages from change monitor * Only add gobblin.service prefix for dagActionStoreChangeMonitor --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * [GOBBLIN-1934] Monitor High Level Consumer queue size (apache#3805) * Emit metrics to monitor high level consumer queue size * Empty commit to trigger tests * Use BlockingQueue.size() func instead of atomic integer array * Remove unused import & add DagActionChangeMonitor prefix to metric * Refactor to avoid repeating code * Make protected variables private where possible * Fix white space --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * [GOBBLIN-1935] Skip null dag action types unable to be processed (apache#3807) * Skip over null dag actions from malformed messages * Add new metric for skipped messages --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * [GOBBLIN-1922]Add function in Kafka Source to recompute workUnits for filtered partitions (apache#3798) * add function in Kafka Source to recompute workUnits for filtered partitions * address comments * set default min container value to 1 * add condition when create empty wu * update the condition * Expose functions to fetch record partitionColumn value (apache#3810) * [GOBBLIN-1938] preserve x bit in manifest file based copy (apache#3804) * preserve x bit in manifest file based copy * fix project structure preventing running unit tests from intellij * fix unit test * [GOBBLIN-1919] Simplify a few elements of MR-related job exec before reusing code in Temporal-based execution (apache#3784) * Simplify a few elements of MR-related job exec before reusing code in Temporal-based execution * Add JSON-ification to several foundational config-state representations, plus encapsulated convience method `JobState.getJobIdFromProps` * Update javadoc comments * Encapsulate check for whether a path has the extension of a multi-work-unit * [GOBBLIN-1939] Bump AWS version to use a compatible version of Jackson with Gobblin (apache#3809) * Bump AWS version to use a compatible version of jackson with Gobblin * use shared aws version * [GOBBLIN-1937] Quantify Missed Work Completed by Reminders (apache#3808) * Quantify Missed Work Completed by Reminders Also fix bug to filter out heartbeat events before extracting field * Refactor changeMonitorUtils & add delimiter to metrics prefix * Re-order params to group similar ones --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * GOBBLIN-1933]Change the logic in completeness verifier to support multi reference tier (apache#3806) * address comments * use connectionmanager when httpclient is not cloesable * [GOBBLIN-1933] Change the logic in completeness verifier to support multi reference tier * add uite test * fix typo * change the javadoc * change the javadoc --------- Co-authored-by: Zihan Li <zihli@zihli-mn2.linkedin.biz> * [GOBBLIN-1943] Use AWS version 1.12.261 to fix a security vulnerability in the previous version (apache#3813) * [GOBBLIN-1941] Develop Temporal abstractions, including `Workload` for workflows of unbounded size through sub-workflow nesting (apache#3811) * Define `Workload` abstraction for Temporal workflows of unbounded size through sub-workflow nesting * Adjust Gobblin-Temporal configurability for consistency and abstraction * Define `WorkerConfig`, to pass the `TemporalWorker`'s configuration to the workflows and activities it hosts * Improve javadoc * Javadoc fixup * Minor changes * Update per review suggestions * Insert pause, to spread the load on the temporal server, before launch of each child workflow that may have direct leaves of its own * Appease findbugs by having `SeqSliceBackedWorkSpan::next` throw `NoSuchElementException` * Add comment * [GOBBLIN-1944] Add gobblin-temporal load generator for a single subsuming super-workflow with a configurable number of activities nested beneath (apache#3815) * Add gobblin-temporal load generator for a single subsuming super-workflow with a configurable number of activities nested beneath * Update per findbugs advice * Improve processing of int props * [GOBBLIN-1945] Implement Distributed Data Movement (DDM) Gobblin-on-Temporal `WorkUnit` evaluation (apache#3816) * Implement Distributed Data Movement (DDM) Gobblin-on-Temporal `WorkUnit` evaluation * Adjust work unit processing tuning for start-to-close timeout and nested execution branching * Rework `ProcessWorkUnitImpl` and fix `FileSystem` misuse; plus convenience abstractions to load `FileSystem`, `JobState`, and `StateStore<TaskState>` * Fix `FileSystem` resource lifecycle, uniquely name each workflow, and drastically reduce worker concurrent task execution * Heed findbugs advice * prep before commit * Improve processing of required props * Update comment in response to PR feedback * [GOBBLIN-1942] Create MySQL util class for re-usable methods and setup MysqlDagActio… (apache#3812) * Create MySQL util class for re-usable methods and setup MysqlDagActionStore retention * Add a java doc * Address review comments * Close scheduled executors on shutdown & clarify naming and comments * Remove extra period making config key invalid * implement Closeable * Use try with resources --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * [Hotfix][GOBBLIN-1949] add option to detect malformed orc during commit (apache#3818) * add option to detect malformed ORC during commit phase * better logging * address comment * catch more generic exception * validate ORC file after close * move validate in between close and commit * syntax * whitespace * update log * [GOBBLIN-1948] Use same flowExecutionId across participants (apache#3819) * Use same flowExecutionId across participants * Set config field as well in new FlowSpec * Use gobblin util to create config * Rename function and move to util --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * Allow extension of functions in GobblinMCEPublisher and customization of fileList file metrics are calculated for (apache#3820) * [GOBBLIN-1951] Emit GTE when deleting corrupted ORC files (apache#3821) * [GOBBLIN-1951] Emit GTE when deleting corrupted ORC files This commit adds ORC file validation during the commit phase and deletes corrupted files. It also includes a test for ORC file validation. * Linter fixes * Add framework and unit tests for DagActionStoreChangeMonitor (apache#3817) * Add framework and unit tests for DagActionStoreChangeMonitor * Add more test cases and validation * Add header for new file * Move FlowSpec static function to Utils class * Remove unused import * Fix compile error * Fix unit tests --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * [GOBBLIN-1952] Make jobname shortening in GaaS more aggressive (apache#3822) * Make jobname shortening in GaaS more aggressive * Change long name prefix to flowgroup * Make KafkaTopicGroupingWorkUnitPacker pack with desired num of container (apache#3814) * Make KafkaTopicGroupingWorkUnitPacker pack with desired num of container * update comment * [GOBBLIN-1953] Add an exception message to orc writer validation GTE (apache#3826) * Fix FlowSpec Updating Function (apache#3823) * Fix FlowSpec Updating Function * makes Config object with FlowSpec mutable * adds unit test to ensure flow compiles after updating FlowSpec * ensure DagManager resilient to exceptions on leadership change * Only update Properties obj not Config to avoid GC overhead * Address findbugs error * Avoid updating or creating new FlowSpec objects by passing flowExecutionId directly to metadata * Remove changes that are not needed anymore * Add TODO to handle failed DagManager leadership change * Overload function and add more documentation --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * Emit metric to tune LeaseArbiter Linger metric (apache#3824) * Monitor number of failed persisting leases to tune linger * Increase default linger and epsilon values * Add metric for lease persisting success * Rename metrics --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * [GOBBLIN-1956]Make Kafka streaming pipeline be able to config the max poll records during runtime (apache#3827) * address comments * use connectionmanager when httpclient is not cloesable * add uite test * fix typo * [GOBBLIN-1956] Make Kafka streaming pipeline be able to config the max poll records during runtime * small refractor --------- Co-authored-by: Zihan Li <zihli@zihli-mn2.linkedin.biz> * Add semantics for failure on partial success (apache#3831) * Consistly handle Rest.li /flowexecutions KILL and RESUME actions (apache#3830) * [GOBBLIN-1957] GobblinOrcwriter improvements for large records (apache#3828) * WIP * Optimization to limit batchsize based on large record sizes * Address review * Use DB-qualified table ID as `IcebergTable` dataset descriptor (apache#3834) * [GOBBLIN-1961] Allow `IcebergDatasetFinder` to use separate names for source vs. destination-side DB and table (apache#3835) * Allow `IcebergDatasetFinder` to use separate names for source vs. destination-side DB and table * Adjust Mockito.verify to pass test * Prevent NPE in `FlowCompilationValidationHelper.validateAndHandleConcurrentExecution` (apache#3836) * Prevent NPE in `FlowCompilationValidationHelper.validateAndHandleConcurrentExecution` * improved `MultiHopFlowCompiler` javadoc * Delete Launch Action Events After Processing (apache#3837) * Delete launch action event after persisting * Fix default value for flowExecutionId retrieval from metadata map * Address review comments and add unit test * Code clean up --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * [GOBBLIN-1960] Emit audit count after commit in IcebergMetadataWriter (apache#3833) * Emit audit count after commit in IcebergMetadataWriter * Unit tests by extracting to a post commit * Emit audit count first * find bugs complaint * [GOBBLIN-1967] Add external data node for generic ingress/egress on GaaS (apache#3838) * Add external data node for generic ingress/egress on GaaS * Address reviews and cleanup * Use URI representation for external dataset descriptor node * Fix error message in containing check * Address review * [GOBBLIN-1971] Allow `IcebergCatalog` to specify the `DatasetDescriptor` name for the `IcebergTable`s it creates (apache#3842) * Allow `IcebergCatalog` to specify the `DatasetDescriptor` name for the `IcebergTable`s it creates * small method javadoc * [GOBBLIN-1970] Consolidate processing dag actions to one code path (apache#3841) * Consolidate processing dag actions to one code path * Delete dag action in failure cases too * Distinguish metrics for startup * Refactor to avoid duplicated code and create static metrics proxy class * Remove DagManager checks that don't apply on startup * Add test to check kill/resume dag action removal after processing * Remove unused import * Initialize metrics proxy with Null Pattern --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * [GOBBLIN-1972] Fix `CopyDataPublisher` to avoid committing post-publish WUs before they've actually run (apache#3844) * Fix `CopyDataPublisher` to avoid committing post-publish WUs before they've actually run * fixup findbugsMain * [GOBBLIN-1975] Keep job.name configuration immutable if specified on GaaS (apache#3847) * Revert "[GOBBLIN-1952] Make jobname shortening in GaaS more aggressive (apache#3822)" This reverts commit 5619a0a. * use configuration to keep specified jobname if enabled * Cleanup * [GOBBLIN-1974] Ensure Adhoc Flows can be Executed in Multi-active Scheduler state (apache#3846) * Ensure Adhoc Flows can be Executed in Multi-active Scheduler state * Only delete spec for adhoc flows & always after orchestration * Delete adhoc flows when dagManager is not present as well * Fix flaky test for scheduler * Add clarifying comment about failure recovery * Re-ordered private method * Move private methods again * Enforce sequential ordering of unit tests to make more reliable --------- Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> * [GOBBLIN-1973] Change Manifest distcp logic to compare permissions of source and dest files even when source is older (apache#3845) * change should copy logic * Add tests, address review * Fix checkstyle * Remove unused imports * [GOBBLIN-1976] Allow an `IcebergCatalog` to override the `DatasetDescriptor` platform name for the `IcebergTable`s it creates (apache#3848) * Allow an `IcebergCatalog` to override the `DatasetDescriptor` platform for the `IcebergTable`s it creates * fixup javadoc * Log when `PasswordManager` fails to load any master password (apache#3849) * [GOBBLIN-1968] Temporal commit step integration (apache#3829) Add commit step to Gobblin temporal workflow for job publish * Add codeql analysis * Make gradle specific * Add codeql as part of build script * Initialize codeql * Use separate workflow for codeql instead with custom build function as autobuild seems to not work * Add jdk jar for global dependencies script --------- Co-authored-by: umustafi <umust77@gmail.com> Co-authored-by: Urmi Mustafi <umustafi@linkedin.com> Co-authored-by: Arjun <abora@linkedin.com> Co-authored-by: Tao Qin <35046097+wsarecv@users.noreply.github.com> Co-authored-by: Tao Qin <tqin@linkedin.com> Co-authored-by: Hanghang Nate Liu <nate.hanghang.liu@gmail.com> Co-authored-by: Andy Jiang <andy.jiang99@outlook.com> Co-authored-by: Kip Kohn <ckohn@linkedin.com> Co-authored-by: Zihan Li <zihli@linkedin.com> Co-authored-by: Zihan Li <zihli@zihli-mn2.linkedin.biz> Co-authored-by: Matthew Ho <mho@linkedin.com>
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
Here are some details about my PR, including screenshots (if applicable):
Improve logging and metrics around multi-active launch flow event handling to identify any missing events between the
MysqlMultiActiveLeaseArbiter
committing the launch event to thedagActionStore
and theDagActionMonitor
receiving events for processing. We want to be able to distinguish between the following cases ofevents that are never received by the
DagActionMonitor
events incorrectly filtered out by the
DagActionMonitor
any failed submissions of dags to the
DagManager
either upon leader change or from theDagActionChangeMonitor
Tests
Commits