Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server race condition: connection ack + first subscribe #501

Closed
thegedge opened this issue Sep 20, 2023 · 6 comments · Fixed by #506 · May be fixed by a11ywatch/core#429
Closed

Server race condition: connection ack + first subscribe #501

thegedge opened this issue Sep 20, 2023 · 6 comments · Fixed by #506 · May be fixed by a11ywatch/core#429
Labels
bug Something isn't working released Has been released and published

Comments

@thegedge
Copy link

We've observed is that we periodically get [Network] undefined client-side, which we noticed was tied to a close code of 4401. This status code implies the first subscribe message is being sent before the connection ack. After scattering around many logs, we observed the client receiving the ack and only ever sending the first subscribe after that point.

On the server, we see the subscribe happen before ctx.acknowledged is set to true here:

graphql-ws/src/server.ts

Lines 616 to 632 in 50d5a51

await socket.send(
stringifyMessage<MessageType.ConnectionAck>(
isObject(permittedOrPayload)
? {
type: MessageType.ConnectionAck,
payload: permittedOrPayload,
}
: {
type: MessageType.ConnectionAck,
// payload is completely absent if not provided
},
replacer,
),
);
// @ts-expect-error: I can write
ctx.acknowledged = true;

I think busy event loops are to blame here. Do you think it's possible we've done something wrong (see below for more context)? We've resolved the issue locally by setting ctx.acknowledged = true in our onConnect handler, but I figure there's something we can do in graphql-ws itself.

Some ideas I had:

  1. Eagerly set ctx.acknowledged = true and roll that back if the send fails. I think in this instance a misbehaving clients could potentially send a subscribe without receiving the ack.
  2. Make acknowledged be a promise, and await it for relevant messages. As a protective measure, we could either put a timeout around that await or throw if a second message comes in that needs to await the promise.
  3. Do something like setTimeout(..., 0) or process.nextTick with the assumption that one event loop tick is enough to resolve this race condition (uncertain if it will be, but happy to test out that there).

Debug Information
We're using graphql-ws in tandem with @urql/core to power one of our clients, and we're also using it to power our server too (version 5.14.0).

We makeServer with a custom onConnect and onSubscribe handler. We provide server.opened server a custom socket, but primarily it's just a thin wrapper around a ws websocket. Looks something like this:

{
  send: async (data) => {
    await context.with(traceContext, async () => { // otel tracing
      await new Promise<void>((resolve, reject) => {
        if (wsSocket.readyState !== wsSocket.OPEN) return reject("socket not ready");

        try {
          wsSocket.send(data, (err) => {
            if (err) {
              reject(err);
            } else {
              resolve();
            }
          });
        } catch (error) {
          reject(error);
        }
      });
    });
  };
}

After that, we just load test our server and it's pretty easy to get the aforementioned issue.

@enisdenjo
Copy link
Owner

Hey there, thanks for reporting! The way I see it is that the only possible solution inside graphql-ws is to eagerly set the acknowledged flag - reverting wouldn't be necessary since if .send fails the connection will be closed anyways.

While I think about the implications of this change, a potential mitigation for you would be to perform the telemetry tracing in a setTimeout yourself and resolve the .send immediately.

@thegedge
Copy link
Author

Yeah, I was trying to think of the implications and I wasn't 100% sure either. I think it's okay, but haven't thought through this super deeply. I'll share some of my thoughts though, in case it's helpful:

  1. It just feels like the protocol is inherently race-y.
    1. The ACK only goes one way. When you look at something like TCP there's an ACK going each way (well, an ACK and a SYNACK) during the connection init process. This is intentional to properly set up a duplex stream.
    2. If we compute some random information and sent it with the ack, we could store that before the send and have the client send it with its messages. We could then check that to determine if the client truly received the ACK.
  2. When send resolves, I don't think there's any guarantee it was truly received by the other side, just that the message was flushed/written.

I'll think about this some more, but I'd be more than happy to send a PR for any solution we think is the answer!

@enisdenjo enisdenjo added the bug Something isn't working label Sep 28, 2023
enisdenjo added a commit that referenced this issue Sep 28, 2023
enisdenjo pushed a commit that referenced this issue Sep 28, 2023
## [5.14.1](v5.14.0...v5.14.1) (2023-09-28)

### Bug Fixes

* **server:** Acknowledge connection before notifying the client to avoid race conditions with slow sends ([#506](#506)) ([8cb82bd](8cb82bd)), closes [#501](#501)
@enisdenjo
Copy link
Owner

🎉 This issue has been resolved in version 5.14.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

@enisdenjo enisdenjo added the released Has been released and published label Sep 28, 2023
@enisdenjo
Copy link
Owner

Hey @thegedge, sorry for the delay.

This race condition has been fixed with #506 (I've even managed to add a proper test) and released with v5.14.1. Would appreciate if you could give it a spin and report if the issue persists.

@thegedge
Copy link
Author

🙇 No worries about the delay, appreciate you fixing this!

Quite confident it will work, because it's basically the same as what I did locally, except I set acknowledged at the end of my custom onConnect (only sync code between that and where you now set ctx.acknowledged). I'll give it a spin as soon as I get a chance though, just to be sure!

@thegedge
Copy link
Author

Sorry for the super long delay on getting back to you, @enisdenjo. Just did some load testing locally and no issues. Again, appreciate the quick response and fix <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working released Has been released and published
Projects
None yet
2 participants