runtime: fix five bugs #354

jgraettinger · 2022-01-31T14:15:03Z

See individual commits for fixes to #350 & #352. No behavior or documentation changes.

See also the related gazette fix: gazette/core#315

Testing:

runtime: failed shard teardown wedged on BuildHints #352 is a challenging / timing dependent reproduction, but I'm pretty confident this is the underlying cause. New unit tests are added.
runtime: leaking goroutines as capture terms restart #350 and consumer: ABA concurrency between shard watchdog & transition handler gazette/core#314 were manually confirmed by running a catalog that mixed a polled & streaming capture, along with a materialization, and excercising:
- Multiple timed polling intervals.
- Multiple deployments of the same catalog tasks, such that their specs were updated.
- Multiple unassignments of the running materialization to re-create the conditions of assigning to the same reactor & slot, to confirm the "delete and re-create" semantics were honored.

This change is

If StartCommit fails to transfer ownership of the futures to the client's readLoop. A failure to resolve this future _seems_ fine, but causes deadlock in certain conditions. Also tighten up some usage-related errors to be panics instead, to simplify the overall error flow. Also handle errors from publishing materialization stats. Fixes #352

Before, capture tasks spawned their own ShardSpec watch loops, and we were leaking these loops across connector invocations. Fix that (by only spawning new watch loops as required), but also centralize the creation of these watch loops and introduce a term context which is parented by the shard's context. Update shuffle.ReadBuilder to take a |doneCh|, which is simply the term's Context.Done(). Capture tasks also directly use the term context when starting up connector RPCs. Fixes #350

Fixes gazette/core#314

To notify the server that it will no longer send acknowledgements. This isn't typically required with true gRPC streams, but it's good practice and it *does* cause a wedged go-routine with our in-process adapter. Issue #350

If a connector error occurrs, immediately surface it into the read channel (causing the shard to fail), rather than squelching it and waiting out the capture polling interval. The prior logic pre-dates the FAILED shard watchdog, and we now want to consolidate error handling to simply let the shard crash and allow the watchdog to sort out retries. Now, we'll only wait out a polling interval if the connector exits with EOF and the term context is not cancelled. This means that a new activation of the task will immediately re-start the polling interval, which is generally what the user wants & expects. Tested locally, by running both streaming and polled captures, manually un-assigning failed and non-failed shards, and coercing captures to fail with driver errors. Issue #350

jgraettinger · 2022-02-03T17:40:37Z

FYI added two additional patches for new bugs discovered and discussed on #350

willdonnelly

LGTM

jgraettinger force-pushed the johnny/bug-fixes branch 2 times, most recently from 567395e to 276b829 Compare February 1, 2022 23:18

jgraettinger added 2 commits February 1, 2022 19:18

jgraettinger force-pushed the johnny/bug-fixes branch from 276b829 to e25bb05 Compare February 2, 2022 14:21

jgraettinger changed the title ~~protocols: materialize client ensures op futures are resolved~~ runtime: fix two bugs Feb 2, 2022

go.mod: bump gazette pin to bring in shard creation race fix

904fd86

Fixes gazette/core#314

jgraettinger force-pushed the johnny/bug-fixes branch from 6548c2b to 904fd86 Compare February 2, 2022 22:56

jgraettinger changed the title ~~runtime: fix two bugs~~ runtime: fix three bugs Feb 2, 2022

jgraettinger marked this pull request as ready for review February 2, 2022 23:00

jgraettinger requested a review from willdonnelly February 2, 2022 23:00

jgraettinger added 2 commits February 3, 2022 12:38

protocols/capture: PullClient must CloseSend

a5cc8dd

To notify the server that it will no longer send acknowledgements. This isn't typically required with true gRPC streams, but it's good practice and it *does* cause a wedged go-routine with our in-process adapter. Issue #350

jgraettinger changed the title ~~runtime: fix three bugs~~ runtime: fix five bugs Feb 3, 2022

willdonnelly approved these changes Feb 3, 2022

View reviewed changes

jgraettinger merged commit eff30b5 into master Feb 3, 2022

jgraettinger deleted the johnny/bug-fixes branch February 3, 2022 22:53

oliviamiannone added the docs complete / NA No (more) doc work related to this PR label Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: fix five bugs #354

runtime: fix five bugs #354

jgraettinger commented Jan 31, 2022 •

edited

Loading

jgraettinger commented Feb 3, 2022

willdonnelly left a comment

runtime: fix five bugs #354

runtime: fix five bugs #354

Conversation

jgraettinger commented Jan 31, 2022 • edited Loading

jgraettinger commented Feb 3, 2022

willdonnelly left a comment

Choose a reason for hiding this comment

jgraettinger commented Jan 31, 2022 •

edited

Loading