-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework cancellation counting #1181
Rework cancellation counting #1181
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me and is definitely cleaner than before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all the tests pass, I support any fixing of counters in any way. Will hold off approval until debug print is removed.
var uncountedCommands int64 | ||
for curr := h.orderedCommands.Front(); curr != nil; { | ||
d := curr.Value.(commandStateMachine) | ||
command := d.getCommand() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish this call were createCommand()
. It was unclear to me what was happening here until I understood that's what this is.
d05d212
to
c2c865b
Compare
@cretz done |
c2c865b
to
329a90e
Compare
Summary
The Go SDK notoriously tries to count commands and this has lead to all sorts of issues in the Go SDK with the counter getting out of sync. At some point we introduced
workflowExecutionIsCancelling
to try to correct some counting issues when the workflow execution was cancelled. The reason workflow cancellation was messing with the counting is because when the workflow was cancelled and the cancel event was processed by the WFT state machine all the state-machines that were waiting on that context were generating their respective cancellation command. All these cancel commands were being generated before the Go SDK had processed the WFT started event for this WFT. This is problematic because the WFT started event sets the counter toworkflowTaskStartedEventID +2
so all those command went uncounted. The solution was to count these cancelled command separately and add them to the total.The Go SDK approach to cancellation also causes issues when the workflow is cancelled in the same WFT as other events that may resolve other state machines like
TimerCompleted
since the cancellation happens before those event are processed and state machines are resolved.Even with the 2+ years of work on this there are still issues with how this correction is implemented #1176.
I explored changing how cancellation is processed so it is delayed until the WFT state machine processes the WFT started event. In the Go SDK the only safe place to generate commands is after a WFT started event is processed. Unfortunately that is a breaking change because if the workflow is cancelled in the same WFT as a event that resolves a state machine is sent the order they are processed matter.
Since we can't change how cancellation is processed I think the next best solution is base the correction on the unsent commands generated before the WFT started event was processed, rather then keep a separate counter.
This should not be a breaking change to any workflow that worked before this change, but I welcome any new tests to prove this
There is a more history here:
#726
There was also a test in this issue (that we couldn't add due to CPU usage?) that I ran as well