Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canceling a context with multiple child contexts can be non deterministic #1208

Open
Quinn-With-Two-Ns opened this issue Aug 24, 2023 · 2 comments

Comments

@Quinn-With-Two-Ns
Copy link
Contributor

Quinn-With-Two-Ns commented Aug 24, 2023

Expected Behavior

Child context cancellation propagates in a deterministic order

Actual Behavior

Child context cancellation does not propagates in a deterministic order.

When a parent context cancels its' children it loops through a map. Map order in Go is not deterministic so the order child contexts are canceled is not deterministic.

Steps to Reproduce the Problem

func TestContextCancellationOrderDeterminism(t *testing.T) {
	/*
		Previously, child-contexts were stored in a map, preventing deterministic order when propagating cancellation.
		The order of branches being selected in this test was random, both for the first event and in following ones.

		In principle this should be fine, but it's possible for the effects of cancellation to trigger a selector's
		future-done callback, which currently records the *real-time*-first event as the branch to unblock, rather than
		doing something more safe by design (e.g. choosing based on state when the selector's goroutine is unblocked).

		Unfortunately, we cannot change the selector's behavior without introducing non-backwards-compatible changes to
		currently-working workflows.

		So the workaround for now is to maintain child-context order, so they are canceled in a consistent order.
		As this order was not controlled before, and Go does a pretty good job at randomizing map iteration order,
		converting non-determinism to determinism should be strictly no worse for backwards compatibility, and it
		fixes the issue for future executions.
	*/
	check := func(t *testing.T, separateStart, separateSelect bool) {
		env := newTestWorkflowEnv(t)
		act := func(ctx context.Context) error {
			return nil // will be mocked
		}
		wf := func(ctx Context) ([]int, error) {
			ctx, cancel := WithCancel(ctx)
			Go(ctx, func(ctx Context) {
				_ = Sleep(ctx, time.Minute)
				cancel()
			})

			// start some activities, which will not complete before the timeout cancels them
			ctx = WithActivityOptions(ctx, ActivityOptions{
				TaskQueue:              "",
				ScheduleToCloseTimeout: time.Hour,
				ScheduleToStartTimeout: time.Hour,
				StartToCloseTimeout:    time.Hour,
			})
			s := NewSelector(ctx)
			var result []int
			for i := 0; i < 10; i++ {
				i := i
				// need a child context, a future alone is not enough as it does not become a child
				cctx, ccancel := WithCancel(ctx)

				s.AddFuture(ExecuteActivity(cctx, act), func(f Future) {
					ccancel() // TODO: is this necessary to prevent leaks?  if it is, how can we make it not?
					err := f.Get(ctx, nil)
					if err == nil || !IsCanceledError(err) {
						// fail the test, this should not happen - activities must be canceled or it's not valid.
						t.Errorf("activity completion or failure for some reason other than cancel: %v", err)
					}
					result = append(result, i)
				})

				if separateStart {
					// yield so they are submitted one at a time, in case that matters
					_ = Sleep(ctx, time.Second)
				}
			}
			for i := 0; i < 10; i++ {
				if separateSelect {
					// yield so they are selected one at a time, in case that matters
					_ = Sleep(ctx, time.Second)
				}
				s.Select(ctx)
			}

			return result, nil
		}
		env.RegisterWorkflow(wf)
		env.RegisterActivity(act)

		// activities must not complete in time
		env.OnActivity(act, mock.Anything).After(5 * time.Minute).Return(nil)

		env.ExecuteWorkflow(wf)
		require.NoError(t, env.GetWorkflowError())
		var result []int
		require.NoError(t, env.GetWorkflowResult(&result))
		require.NotEmpty(t, result)
		assert.Equal(t, 0, result[0], "first activity to be created should be the first one canceled")
		assert.Equal(t, []int{1, 2, 3, 4, 5, 6, 7, 8, 9}, result[1:], "other activities should finish in a consistent (but undefined) order")
	}

	type variant struct {
		name           string
		separateStart  bool
		separateSelect bool
	}
	// all variants expose this behavior, but being a bit more exhaustive in the face
	// of decision-scheduling differences seems good.
	for _, test := range []variant{
		{"many in one decision", false, false},
		{"many started at once, selected slowly", false, true},
		{"started slowly, selected quickly", true, false},
		{"started and selected slowly", true, true},
	} {
		t.Run(test.name, func(t *testing.T) {
			check(t, test.separateStart, test.separateSelect)
		})
	}
}

Note: this test came from cadence go client where I noticed they had fixed this bug.

@throwaway58383958484
Copy link

Could this be fixed now using the new helper? #1340

@Quinn-With-Two-Ns
Copy link
Contributor Author

No since the keys of the map cannot be sorted. Likely we would just copy what Cadence did to resolve this problem cadence-workflow/cadence-go-client@dcaec77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants