Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix history replayer #990

Merged
merged 3 commits into from
Jan 4, 2023

Conversation

Quinn-With-Two-Ns
Copy link
Contributor

@Quinn-With-Two-Ns Quinn-With-Two-Ns commented Dec 30, 2022

Fix a few bugs causing false positives and false negatives when replay histories

  • Remove protobuf json binary comparisons because they are not deterministic

  • Check history during a replay even if the workflow is complete

  • Remove dead code around strictMode

Note on backwards compatibility: This PR will cause some histories to fail the replayer that didn't use to, but they always should have so I don't consider this a breaking change, just a bug. It could also cause some histories to pass that used to fail if the non determinism was from the workflow execution complete payload. I tried hard to preserve the error message of any workflow that would have failed before

resolves: #983, #876 and #978

@@ -1197,8 +1203,16 @@ func skipDeterministicCheckForEvent(e *historypb.HistoryEvent) bool {
if markerName == versionMarkerName || markerName == mutableSideEffectMarkerName {
return true
}
// case enumspb.EVENT_TYPE_WORKFLOW_TASK_STARTED:
// return true
case enumspb.EVENT_TYPE_WORKFLOW_EXECUTION_COMPLETED:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These result of a workflow execution is handled after we check the history so these will never be in the replayed history. The replayer already checks for for completed and continue as new outside the state machine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, so before this was never called for these events anyways because we never did deterministic checks at the completed stage? How does "activity task timed out" play into this? I can't think of any harm in skipping more so this seems safe to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we would never check these before.

How does "activity task timed out" play into this

Oh that shouldn't, should be workflow task time out

@@ -1273,7 +1287,7 @@ func lastPartOfName(name string) string {
return name[lastDotIdx+1:]
}

func isCommandMatchEvent(d *commandpb.Command, e *historypb.HistoryEvent, strictMode bool) bool {
func isCommandMatchEvent(d *commandpb.Command, e *historypb.HistoryEvent) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strictMode flag was never used and wouldn't work if the go version was changed

@@ -1308,49 +1300,6 @@ func (s *internalWorkerTestSuite) TestReplayWorkflowHistory_CancelWorkflowWhileS
require.NoError(s.T(), err)
}

func (s *internalWorkerTestSuite) TestReplayWorkflowHistory_LocalActivity_Result_Mismatch() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different payload will no longer cause the history to be non deterministic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, so we are removing functionality? Can you give some background here?

@Quinn-With-Two-Ns Quinn-With-Two-Ns marked this pull request as ready for review December 30, 2022 23:21
@Quinn-With-Two-Ns Quinn-With-Two-Ns requested a review from a team as a code owner December 30, 2022 23:21
"workflowTaskTimeout": "1s",
"identity": "97228@samar-C02XG22GJGH6@"
}
"events": [
Copy link
Contributor Author

@Quinn-With-Two-Ns Quinn-With-Two-Ns Jan 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regenerated this history because it failed the replayer for two reasons

  1. The activities name are wrong, this just seems like an issue with the history

  2. The history was missing multiple events, I think because the history came from a WF that timed out. Unclear if that was intentional. Even if it was intentional it is not obvious this history should pass the replayer since the replayed history will not match the history file.

@@ -877,7 +883,7 @@ ProcessEvents:
if err != nil {
return nil, err
}
if w.isWorkflowCompleted {
if w.isWorkflowCompleted && !shouldForceReplayCheck() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the harm of breaking out of the events here always? Why do we need to continue to process events after workflow completed? Wouldn't the presence/absence of a "completed command" in history vs not be enough trigger the non-determinism check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean by a "completed command" can you elaborate?

The problem with breaking out is described here #983, basically if we break we skip populating the replayCommands list and that is a important part of checking if a replay was deterministic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, ok. I was expecting that if we skip populating replay commands then this workflow would add a workflow completed command and inside the replayer that would mismatch with the next event not being a complete command. But I think this works too. And we're pretty sure it's isolated only to the replayer.

But if this change is only for replayer and only applies to a completed workflow, I think we're good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The break would also causes false negatives if activities are removed from a workflow because we would break before processing the event. I made a test for this called TestBadEmptyWorkflow that is an extreme case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting that if we skip populating replay commands then this workflow would add a workflow completed command and inside the replayer that would mismatch with the next event not being a complete command

Ah here is the thing, where would it mismatch with the next event? In the replayer (I assume you mean WorkflowReplayer ) we only check the last event in the history is a workflow completed event

@@ -1197,8 +1203,16 @@ func skipDeterministicCheckForEvent(e *historypb.HistoryEvent) bool {
if markerName == versionMarkerName || markerName == mutableSideEffectMarkerName {
return true
}
// case enumspb.EVENT_TYPE_WORKFLOW_TASK_STARTED:
// return true
case enumspb.EVENT_TYPE_WORKFLOW_EXECUTION_COMPLETED:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, so before this was never called for these events anyways because we never did deterministic checks at the completed stage? How does "activity task timed out" play into this? I can't think of any harm in skipping more so this seems safe to me.

@@ -1286,20 +1289,12 @@ func (aw *WorkflowReplayer) replayWorkflowHistory(logger log.Logger, service wor
for _, d := range completeReq.Commands {
if d.GetCommandType() == enumspb.COMMAND_TYPE_CONTINUE_AS_NEW_WORKFLOW_EXECUTION {
if last.GetEventType() == enumspb.EVENT_TYPE_WORKFLOW_EXECUTION_CONTINUED_AS_NEW {
inputA := d.GetContinueAsNewWorkflowExecutionCommandAttributes().GetInput()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand this removal? Before we didn't return if not equal, now we always do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it based on the conversation we had here #978 we said

"we need to stop using payload comparison anyways. I don't believe we need to check input/output data"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, I may have been thinking more activities and child workflows and such inside the workflow replay check instead of the replayer. But this might be ok. Let me bring it up in Slack to get others' opinions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try to keep this, but we can't leave it in its current form because the binary comparison is not reliable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conferred, good to remove!

@@ -1308,49 +1300,6 @@ func (s *internalWorkerTestSuite) TestReplayWorkflowHistory_CancelWorkflowWhileS
require.NoError(s.T(), err)
}

func (s *internalWorkerTestSuite) TestReplayWorkflowHistory_LocalActivity_Result_Mismatch() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, so we are removing functionality? Can you give some background here?

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I can't find anything that would break workflows today, only replayers. That's the biggest concern to check for I think.

shouldForceReplayCheck := func() bool {
isInReplayer := IsReplayNamespace(w.wth.namespace)
// If we are in the replayer we should always check the history replay, even if the workflow is completed
// Skip if the workflow paniced to avoid potentially breaking old histories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Skip if the workflow paniced to avoid potentially breaking old histories
// Skip if the workflow panicked to avoid potentially breaking old histories

Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. 🙏 for the day we can move this to be on top of core.

 * Remove protobuf binary comparisons because they are not deterministic

 * Check histroy during a replay even if the workflow is complete

 * Remove dead code around strictMode
change to EVENT_TYPE_WORKFLOW_EXECUTION_TIMED_OUT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replaying non deterministic workflows that use local activities can give false negatives
3 participants