Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@apollo/gateway: Make ApolloGateway.stop() reliable and required #452

Merged
merged 2 commits into from
Feb 9, 2021

Conversation

glasser
Copy link
Member

@glasser glasser commented Feb 4, 2021

The (not particularly documented) ApolloGateway.stop() method didn't reliably
stop the gateway from polling. All it did was cancel a timeout. But if it is
called while pollServices is in the middle of running, it would do nothing, and
then pollServices would carry on and set another timeout.

Better semantics would be for stop() to reliably stop polling: allow the current
poll to complete, ensure that there will be no more polls, and then (async)
return. This PR implements those semantics, by implementing an explicit state
machine for ApolloGateway's polling.

One reason that these bugs were able to survive is that calling stop() is often
unnecessary. In apollographql/apollo-server#3105 we
chose to unref the polling timeout to allow the Node process to exit if it's
the only thing left on the event loop, instead of encouraging users of
ApolloGateway to be responsible and call stop() themselves. While this may
be reasonable when the gateway's lifecycle is merely a schema polling timer, we
may have future changes to the gateway where proper lifecycle handling is more
important. So this PR also moves away from the world where it's fine to not
bother to explicitly stop the gateway.

That said, in the common case, we don't want users to have to write gateway
stopping code. It makes more sense for stopping the ApolloGateway to be the
responsibility of ApolloServer.stop(), just as ApolloServer.stop() can
trigger events in plugins. So in the recently-released Apollo Server v2.20,
ApolloServer.stop() calls ApolloGateway.stop(). This should mean that in
most cases, the missing unref shouldn't keep the process running, as long as
you've run ApolloServer.stop() (and if you don't, it's likely that other parts
of the server are keeping your process running).

What if you're still running an old version of Apollo Server? For a bit of
backwards compatibility, ApolloGateway detects if it appears to be connected
to Apollo Server older than v2.18. If so, it continues to do the old
unref(). If you're using v2.18 or v2.19, the upgrade to v2.20 should be pretty
painless (it's mostly just minor bugfixes). If you're using ApolloGateway on its
own for some reason, and this change causes your processes to hang on shutdown,
adding a stop() call should be pretty straightforward. (If we learn that this
change really is devastating, we can always go back to an unconditional
timer.unref() later.)

Fixes apollographql/apollo-server#4428.

@glasser glasser requested a review from trevor-scheer February 4, 2021 06:21
@glasser
Copy link
Member Author

glasser commented Feb 4, 2021

Before merging this I should implement the Apollo Server change described in it and release v2.20. It's ready for review but I'll mark as draft for this reason.

@glasser glasser marked this pull request as draft February 4, 2021 06:22
@@ -243,6 +262,11 @@ export class ApolloGateway implements GraphQLService {
apollo?: ApolloConfig;
engine?: GraphQLServiceEngineConfig;
}) {
if (this.state.phase !== 'initialized') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I be able to call load again after awaiting the stop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably disregard, I can tell below it's explicitly written with the intention of not allowing that.

this.pollServices();
if (this.state.phase === 'polling') {
// If we weren't stopped, we should transition back to the initial 'loading' state and trigger
// another call to itself. (Do that in a setImmediate to avoid unbounded stack sizes.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! TIL

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I actually don't know offhand if you get unbounded stack sizes from recursion in async functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I checked and you do.)

glasser added a commit to apollographql/apollo-server that referenced this pull request Feb 8, 2021
Because AS is what invokes ApolloGateway.load, it should be its responsibility
to invoke the matching stop method.

apollographql/federation#452 (aimed at @apollo/gateway
0.23) will change ApolloGateway to no longer "unref" its polling timer: ie, it
makes calling `stop()` actually part of its expected API instead of something
you can ignore if you feel like it without affecting program shutdown. It has
a bit of a hack to still unref the timer if it looks like you're using an
old (pre-2.18) version of Apollo Server, but this PR (which will be released in
v2.20.0) will make ApolloServer stop the gateway for you.

Part of fixing #4428.
glasser added a commit to apollographql/apollo-server that referenced this pull request Feb 8, 2021
Because AS is what invokes ApolloGateway.load, it should be its responsibility
to invoke the matching stop method.

apollographql/federation#452 (aimed at @apollo/gateway
0.23) will change ApolloGateway to no longer "unref" its polling timer: ie, it
makes calling `stop()` actually part of its expected API instead of something
you can ignore if you feel like it without affecting program shutdown. It has
a bit of a hack to still unref the timer if it looks like you're using an
old (pre-2.18) version of Apollo Server, but this PR (which will be released in
v2.20.0) will make ApolloServer stop the gateway for you.

Part of fixing #4428.

Also simplify typings of toDispose set. 
`() => ValueOrPromise<void>` is a weird type because `void` means "I'm not going
to look at the return value" which is sorta incompatible with "but I need to see
if it's a Promise or not". So just make these all async (changing a couple
implementations by adding `async`).
The (not particularly documented) ApolloGateway.stop() method didn't reliably
stop the gateway from polling. All it did was cancel a timeout. But if it is
called while pollServices is in the middle of running, it would do nothing, and
then pollServices would carry on and set another timeout.

Better semantics would be for stop() to reliably stop polling: allow the current
poll to complete, ensure that there will be no more polls, and then (async)
return. This PR implements those semantics, by implementing an explicit state
machine for ApolloGateway's polling.

One reason that these bugs were able to survive is that calling stop() is often
unnecessary. In apollographql/apollo-server#3105 we
chose to `unref` the polling timeout to allow the Node process to exit if it's
the only thing left on the event loop, instead of encouraging users of
`ApolloGateway` to be responsible and call `stop()` themselves. While this may
be reasonable when the gateway's lifecycle is merely a schema polling timer, we
may have future changes to the gateway where proper lifecycle handling is more
important. So this PR also moves away from the world where it's fine to not
bother to explicitly stop the gateway.

That said, in the common case, we don't want users to have to write gateway
stopping code. It makes more sense for stopping the `ApolloGateway` to be the
responsibility of `ApolloServer.stop()`, just as `ApolloServer.stop()` can
trigger events in plugins. So in the recently-released Apollo Server v2.20,
`ApolloServer.stop()` calls `ApolloGateway.stop()`. This should mean that in
most cases, the missing `unref` shouldn't keep the process running, as long as
you've run `ApolloServer.stop()` (and if you don't, it's likely that other parts
of the server are keeping your process running).

What if you're still running an old version of Apollo Server? For a bit of
backwards compatibility, `ApolloGateway` detects if it appears to be connected
to Apollo Server older than v2.18. If so, it continues to do the old
unref().  If you're using v2.18 or v2.19, the upgrade to v2.20 should be pretty
painless (it's mostly just minor bugfixes). If you're using ApolloGateway on its
own for some reason, and this change causes your processes to hang on shutdown,
adding a `stop()` call should be pretty straightforward. (If we learn that this
change really is devastating, we can always go back to an unconditional
timer.unref() later.)

Fixes #4428.
@glasser glasser force-pushed the glasser/effective-stop branch from c4fe737 to d82e916 Compare February 9, 2021 23:09
@glasser glasser marked this pull request as ready for review February 9, 2021 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ApolloGateway.stop() is not reliable
2 participants