-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix history replayer #990
Fix history replayer #990
Conversation
923a027
to
64701b9
Compare
@@ -1197,8 +1203,16 @@ func skipDeterministicCheckForEvent(e *historypb.HistoryEvent) bool { | |||
if markerName == versionMarkerName || markerName == mutableSideEffectMarkerName { | |||
return true | |||
} | |||
// case enumspb.EVENT_TYPE_WORKFLOW_TASK_STARTED: | |||
// return true | |||
case enumspb.EVENT_TYPE_WORKFLOW_EXECUTION_COMPLETED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These result of a workflow execution is handled after we check the history so these will never be in the replayed history. The replayer already checks for for completed and continue as new outside the state machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, so before this was never called for these events anyways because we never did deterministic checks at the completed stage? How does "activity task timed out" play into this? I can't think of any harm in skipping more so this seems safe to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we would never check these before.
How does "activity task timed out" play into this
Oh that shouldn't, should be workflow task time out
@@ -1273,7 +1287,7 @@ func lastPartOfName(name string) string { | |||
return name[lastDotIdx+1:] | |||
} | |||
|
|||
func isCommandMatchEvent(d *commandpb.Command, e *historypb.HistoryEvent, strictMode bool) bool { | |||
func isCommandMatchEvent(d *commandpb.Command, e *historypb.HistoryEvent) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This strictMode
flag was never used and wouldn't work if the go version was changed
@@ -1308,49 +1300,6 @@ func (s *internalWorkerTestSuite) TestReplayWorkflowHistory_CancelWorkflowWhileS | |||
require.NoError(s.T(), err) | |||
} | |||
|
|||
func (s *internalWorkerTestSuite) TestReplayWorkflowHistory_LocalActivity_Result_Mismatch() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A different payload will no longer cause the history to be non deterministic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, so we are removing functionality? Can you give some background here?
"workflowTaskTimeout": "1s", | ||
"identity": "97228@samar-C02XG22GJGH6@" | ||
} | ||
"events": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regenerated this history because it failed the replayer for two reasons
-
The activities
name
are wrong, this just seems like an issue with the history -
The history was missing multiple events, I think because the history came from a WF that timed out. Unclear if that was intentional. Even if it was intentional it is not obvious this history should pass the replayer since the replayed history will not match the history file.
@@ -877,7 +883,7 @@ ProcessEvents: | |||
if err != nil { | |||
return nil, err | |||
} | |||
if w.isWorkflowCompleted { | |||
if w.isWorkflowCompleted && !shouldForceReplayCheck() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the harm of breaking out of the events here always? Why do we need to continue to process events after workflow completed? Wouldn't the presence/absence of a "completed command" in history vs not be enough trigger the non-determinism check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what you mean by a "completed command" can you elaborate?
The problem with breaking out is described here #983, basically if we break we skip populating the replayCommands
list and that is a important part of checking if a replay was deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, ok. I was expecting that if we skip populating replay commands then this workflow would add a workflow completed command and inside the replayer that would mismatch with the next event not being a complete command. But I think this works too. And we're pretty sure it's isolated only to the replayer.
But if this change is only for replayer and only applies to a completed workflow, I think we're good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The break would also causes false negatives if activities are removed from a workflow because we would break before processing the event. I made a test for this called TestBadEmptyWorkflow
that is an extreme case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting that if we skip populating replay commands then this workflow would add a workflow completed command and inside the replayer that would mismatch with the next event not being a complete command
Ah here is the thing, where would it mismatch with the next event? In the replayer (I assume you mean WorkflowReplayer
) we only check the last event in the history is a workflow completed event
@@ -1197,8 +1203,16 @@ func skipDeterministicCheckForEvent(e *historypb.HistoryEvent) bool { | |||
if markerName == versionMarkerName || markerName == mutableSideEffectMarkerName { | |||
return true | |||
} | |||
// case enumspb.EVENT_TYPE_WORKFLOW_TASK_STARTED: | |||
// return true | |||
case enumspb.EVENT_TYPE_WORKFLOW_EXECUTION_COMPLETED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, so before this was never called for these events anyways because we never did deterministic checks at the completed stage? How does "activity task timed out" play into this? I can't think of any harm in skipping more so this seems safe to me.
@@ -1286,20 +1289,12 @@ func (aw *WorkflowReplayer) replayWorkflowHistory(logger log.Logger, service wor | |||
for _, d := range completeReq.Commands { | |||
if d.GetCommandType() == enumspb.COMMAND_TYPE_CONTINUE_AS_NEW_WORKFLOW_EXECUTION { | |||
if last.GetEventType() == enumspb.EVENT_TYPE_WORKFLOW_EXECUTION_CONTINUED_AS_NEW { | |||
inputA := d.GetContinueAsNewWorkflowExecutionCommandAttributes().GetInput() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand this removal? Before we didn't return if not equal, now we always do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it based on the conversation we had here #978 we said
"we need to stop using payload comparison anyways. I don't believe we need to check input/output data"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, I may have been thinking more activities and child workflows and such inside the workflow replay check instead of the replayer. But this might be ok. Let me bring it up in Slack to get others' opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can try to keep this, but we can't leave it in its current form because the binary comparison is not reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conferred, good to remove!
@@ -1308,49 +1300,6 @@ func (s *internalWorkerTestSuite) TestReplayWorkflowHistory_CancelWorkflowWhileS | |||
require.NoError(s.T(), err) | |||
} | |||
|
|||
func (s *internalWorkerTestSuite) TestReplayWorkflowHistory_LocalActivity_Result_Mismatch() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, so we are removing functionality? Can you give some background here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I can't find anything that would break workflows today, only replayers. That's the biggest concern to check for I think.
internal/internal_task_handlers.go
Outdated
shouldForceReplayCheck := func() bool { | ||
isInReplayer := IsReplayNamespace(w.wth.namespace) | ||
// If we are in the replayer we should always check the history replay, even if the workflow is completed | ||
// Skip if the workflow paniced to avoid potentially breaking old histories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Skip if the workflow paniced to avoid potentially breaking old histories | |
// Skip if the workflow panicked to avoid potentially breaking old histories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. 🙏 for the day we can move this to be on top of core.
* Remove protobuf binary comparisons because they are not deterministic * Check histroy during a replay even if the workflow is complete * Remove dead code around strictMode
change to EVENT_TYPE_WORKFLOW_EXECUTION_TIMED_OUT
a00a0b8
to
4bcd0d3
Compare
Fix a few bugs causing false positives and false negatives when replay histories
Remove protobuf json binary comparisons because they are not deterministic
Check history during a replay even if the workflow is complete
Remove dead code around strictMode
Note on backwards compatibility: This PR will cause some histories to fail the replayer that didn't use to, but they always should have so I don't consider this a breaking change, just a bug. It could also cause some histories to pass that used to fail if the non determinism was from the workflow execution complete payload. I tried hard to preserve the error message of any workflow that would have failed before
resolves: #983, #876 and #978