Canceling a context with multiple child contexts can be non deterministic #1208

Quinn-With-Two-Ns · 2023-08-24T15:41:36Z

Expected Behavior

Child context cancellation propagates in a deterministic order

Actual Behavior

Child context cancellation does not propagates in a deterministic order.

When a parent context cancels its' children it loops through a map. Map order in Go is not deterministic so the order child contexts are canceled is not deterministic.

Steps to Reproduce the Problem

func TestContextCancellationOrderDeterminism(t *testing.T) {
	/*
		Previously, child-contexts were stored in a map, preventing deterministic order when propagating cancellation.
		The order of branches being selected in this test was random, both for the first event and in following ones.

		In principle this should be fine, but it's possible for the effects of cancellation to trigger a selector's
		future-done callback, which currently records the *real-time*-first event as the branch to unblock, rather than
		doing something more safe by design (e.g. choosing based on state when the selector's goroutine is unblocked).

		Unfortunately, we cannot change the selector's behavior without introducing non-backwards-compatible changes to
		currently-working workflows.

		So the workaround for now is to maintain child-context order, so they are canceled in a consistent order.
		As this order was not controlled before, and Go does a pretty good job at randomizing map iteration order,
		converting non-determinism to determinism should be strictly no worse for backwards compatibility, and it
		fixes the issue for future executions.
	*/
	check := func(t *testing.T, separateStart, separateSelect bool) {
		env := newTestWorkflowEnv(t)
		act := func(ctx context.Context) error {
			return nil // will be mocked
		}
		wf := func(ctx Context) ([]int, error) {
			ctx, cancel := WithCancel(ctx)
			Go(ctx, func(ctx Context) {
				_ = Sleep(ctx, time.Minute)
				cancel()
			})

			// start some activities, which will not complete before the timeout cancels them
			ctx = WithActivityOptions(ctx, ActivityOptions{
				TaskQueue:              "",
				ScheduleToCloseTimeout: time.Hour,
				ScheduleToStartTimeout: time.Hour,
				StartToCloseTimeout:    time.Hour,
			})
			s := NewSelector(ctx)
			var result []int
			for i := 0; i < 10; i++ {
				i := i
				// need a child context, a future alone is not enough as it does not become a child
				cctx, ccancel := WithCancel(ctx)

				s.AddFuture(ExecuteActivity(cctx, act), func(f Future) {
					ccancel() // TODO: is this necessary to prevent leaks?  if it is, how can we make it not?
					err := f.Get(ctx, nil)
					if err == nil || !IsCanceledError(err) {
						// fail the test, this should not happen - activities must be canceled or it's not valid.
						t.Errorf("activity completion or failure for some reason other than cancel: %v", err)
					}
					result = append(result, i)
				})

				if separateStart {
					// yield so they are submitted one at a time, in case that matters
					_ = Sleep(ctx, time.Second)
				}
			}
			for i := 0; i < 10; i++ {
				if separateSelect {
					// yield so they are selected one at a time, in case that matters
					_ = Sleep(ctx, time.Second)
				}
				s.Select(ctx)
			}

			return result, nil
		}
		env.RegisterWorkflow(wf)
		env.RegisterActivity(act)

		// activities must not complete in time
		env.OnActivity(act, mock.Anything).After(5 * time.Minute).Return(nil)

		env.ExecuteWorkflow(wf)
		require.NoError(t, env.GetWorkflowError())
		var result []int
		require.NoError(t, env.GetWorkflowResult(&result))
		require.NotEmpty(t, result)
		assert.Equal(t, 0, result[0], "first activity to be created should be the first one canceled")
		assert.Equal(t, []int{1, 2, 3, 4, 5, 6, 7, 8, 9}, result[1:], "other activities should finish in a consistent (but undefined) order")
	}

	type variant struct {
		name           string
		separateStart  bool
		separateSelect bool
	}
	// all variants expose this behavior, but being a bit more exhaustive in the face
	// of decision-scheduling differences seems good.
	for _, test := range []variant{
		{"many in one decision", false, false},
		{"many started at once, selected slowly", false, true},
		{"started slowly, selected quickly", true, false},
		{"started and selected slowly", true, true},
	} {
		t.Run(test.name, func(t *testing.T) {
			check(t, test.separateStart, test.separateSelect)
		})
	}
}

Note: this test came from cadence go client where I noticed they had fixed this bug.

The text was updated successfully, but these errors were encountered:

throwaway58383958484 · 2024-03-04T16:33:24Z

Could this be fixed now using the new helper? #1340

Quinn-With-Two-Ns · 2024-03-04T18:55:52Z

No since the keys of the map cannot be sorted. Likely we would just copy what Cadence did to resolve this problem cadence-workflow/cadence-go-client@dcaec77

Quinn-With-Two-Ns added the potential-bug label Aug 24, 2023

dhiaayachi mentioned this issue Sep 18, 2024

Expose FirstRunID in WorkflowExecutionInfo under DescribeWorkflowExecutionResponse dhiaayachi/temporal#36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canceling a context with multiple child contexts can be non deterministic #1208

Canceling a context with multiple child contexts can be non deterministic #1208

Quinn-With-Two-Ns commented Aug 24, 2023 •

edited by sync-by-unito bot

Loading

throwaway58383958484 commented Mar 4, 2024

Quinn-With-Two-Ns commented Mar 4, 2024

Canceling a context with multiple child contexts can be non deterministic #1208

Canceling a context with multiple child contexts can be non deterministic #1208

Comments

Quinn-With-Two-Ns commented Aug 24, 2023 • edited by sync-by-unito bot Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

throwaway58383958484 commented Mar 4, 2024

Quinn-With-Two-Ns commented Mar 4, 2024

Quinn-With-Two-Ns commented Aug 24, 2023 •

edited by sync-by-unito bot

Loading