Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fix five bugs #354

Merged
merged 5 commits into from
Feb 3, 2022
Merged

runtime: fix five bugs #354

merged 5 commits into from
Feb 3, 2022

Conversation

jgraettinger
Copy link
Member

@jgraettinger jgraettinger commented Jan 31, 2022

See individual commits for fixes to #350 & #352. No behavior or documentation changes.

See also the related gazette fix: gazette/core#315

Testing:


This change is Reviewable

@jgraettinger jgraettinger force-pushed the johnny/bug-fixes branch 2 times, most recently from 567395e to 276b829 Compare February 1, 2022 23:18
If StartCommit fails to transfer ownership of the futures to the
client's readLoop.

A failure to resolve this future _seems_ fine, but causes deadlock in
certain conditions.

Also tighten up some usage-related errors to be panics instead,
to simplify the overall error flow.

Also handle errors from publishing materialization stats.

Fixes #352
Before, capture tasks spawned their own ShardSpec watch loops,
and we were leaking these loops across connector invocations.

Fix that (by only spawning new watch loops as required),
but also centralize the creation of these watch loops and introduce a
term context which is parented by the shard's context.

Update shuffle.ReadBuilder to take a |doneCh|, which is simply the
term's Context.Done(). Capture tasks also directly use the term context
when starting up connector RPCs.

Fixes #350
@jgraettinger jgraettinger changed the title protocols: materialize client ensures op futures are resolved runtime: fix two bugs Feb 2, 2022
@jgraettinger jgraettinger changed the title runtime: fix two bugs runtime: fix three bugs Feb 2, 2022
@jgraettinger jgraettinger marked this pull request as ready for review February 2, 2022 23:00
To notify the server that it will no longer send acknowledgements.
This isn't typically required with true gRPC streams, but it's good
practice and it *does* cause a wedged go-routine with our in-process
adapter.

Issue #350
If a connector error occurrs, immediately surface it into the read
channel (causing the shard to fail), rather than squelching it and
waiting out the capture polling interval. The prior logic pre-dates the
FAILED shard watchdog, and we now want to consolidate error handling to
simply let the shard crash and allow the watchdog to sort out retries.

Now, we'll only wait out a polling interval if the connector exits with
EOF and the term context is not cancelled. This means that a new
activation of the task will immediately re-start the polling interval,
which is generally what the user wants & expects.

Tested locally, by running both streaming and polled captures, manually
un-assigning failed and non-failed shards, and coercing captures to fail
with driver errors.

Issue #350
@jgraettinger
Copy link
Member Author

FYI added two additional patches for new bugs discovered and discussed on #350

@jgraettinger jgraettinger changed the title runtime: fix three bugs runtime: fix five bugs Feb 3, 2022
Copy link
Member

@willdonnelly willdonnelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jgraettinger jgraettinger merged commit eff30b5 into master Feb 3, 2022
@jgraettinger jgraettinger deleted the johnny/bug-fixes branch February 3, 2022 22:53
@oliviamiannone oliviamiannone added the docs complete / NA No (more) doc work related to this PR label Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs complete / NA No (more) doc work related to this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants