Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket-mode: Handling WS errors during handshake #2099

Merged
merged 5 commits into from
Dec 9, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 7 additions & 8 deletions packages/socket-mode/src/SlackWebSocket.ts
Original file line number Diff line number Diff line change
Expand Up @@ -111,29 +111,28 @@
agent: this.options.httpAgent,
};

const ws = new WebSocket(this.options.url, options);
this.websocket = new WebSocket(this.options.url, options);

Check warning on line 114 in packages/socket-mode/src/SlackWebSocket.ts

View check run for this annotation

Codecov / codecov/patch

packages/socket-mode/src/SlackWebSocket.ts#L114

Added line #L114 was not covered by tests

ws.addEventListener('open', (_event) => {
this.websocket.addEventListener('open', (_event) => {

Check warning on line 116 in packages/socket-mode/src/SlackWebSocket.ts

View check run for this annotation

Codecov / codecov/patch

packages/socket-mode/src/SlackWebSocket.ts#L116

Added line #L116 was not covered by tests
this.logger.debug('WebSocket open event received (connection established)!');
this.websocket = ws;
Copy link
Contributor Author

@filmaj filmaj Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not tracking the current websocket instance was a problem in the case the web socket server returns an error. Since only once the open event was emitted would we keep track of the websocket (this line), any errors during the websocket handshake would call into disconnect which, crucially, only terminates the underlying network resources for the tracked this.websocket.

That was the underlying issue that caused the "doubling up" of reconnection attempts (see "Issue 1" described by @gdhardy1 in this comment): each time an error was encountered during websocket handshake, there would be an 'untracked' ws instance still executing reconnect attempts in addition to a new one being spawned. Essentially websocket-connection-bombing ourselves.

If you remove the changes in this file, you will see the integration test I added at the end of this PR fail.

Copy link

@gdhardy1 gdhardy1 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding this. I was able to patch this in and test it while we await a full fix. It stabilized things tremendously.

this.monitorPingToSlack();
});
ws.addEventListener('error', (event) => {
this.websocket.addEventListener('error', (event) => {

Check warning on line 120 in packages/socket-mode/src/SlackWebSocket.ts

View check run for this annotation

Codecov / codecov/patch

packages/socket-mode/src/SlackWebSocket.ts#L120

Added line #L120 was not covered by tests
this.logger.error(`WebSocket error occurred: ${event.message}`);
this.disconnect();
this.options.client.emit('error', websocketErrorWithOriginal(event.error));
});
ws.on('message', (msg, isBinary) => {
this.websocket.on('message', (msg, isBinary) => {

Check warning on line 125 in packages/socket-mode/src/SlackWebSocket.ts

View check run for this annotation

Codecov / codecov/patch

packages/socket-mode/src/SlackWebSocket.ts#L125

Added line #L125 was not covered by tests
this.options.client.emit('ws_message', msg, isBinary);
});
ws.on('close', (code: number, data: Buffer) => {
this.websocket.on('close', (code: number, data: Buffer) => {

Check warning on line 128 in packages/socket-mode/src/SlackWebSocket.ts

View check run for this annotation

Codecov / codecov/patch

packages/socket-mode/src/SlackWebSocket.ts#L128

Added line #L128 was not covered by tests
this.logger.debug(`WebSocket close frame received (code: ${code}, reason: ${data.toString()})`);
this.closeFrameReceived = true;
this.disconnect();
});

// Confirm WebSocket connection is still active
ws.on('ping', (data) => {
this.websocket.on('ping', (data) => {

Check warning on line 135 in packages/socket-mode/src/SlackWebSocket.ts

View check run for this annotation

Codecov / codecov/patch

packages/socket-mode/src/SlackWebSocket.ts#L135

Added line #L135 was not covered by tests
// Note that ws' `autoPong` option is true by default, so no need to respond to ping.
// see https://github.com/websockets/ws/blob/2aa0405a5e96754b296fef6bd6ebdfb2f11967fc/doc/ws.md#new-websocketaddress-protocols-options
if (this.options.pingPongLoggingEnabled) {
Expand All @@ -142,7 +141,7 @@
this.monitorPingFromSlack();
});

ws.on('pong', (data) => {
this.websocket.on('pong', (data) => {

Check warning on line 144 in packages/socket-mode/src/SlackWebSocket.ts

View check run for this annotation

Codecov / codecov/patch

packages/socket-mode/src/SlackWebSocket.ts#L144

Added line #L144 was not covered by tests
if (this.options.pingPongLoggingEnabled) {
this.logger.debug(`WebSocket received pong from Slack server (data: ${data.toString()})`);
}
Expand Down
10 changes: 7 additions & 3 deletions packages/socket-mode/src/SocketModeClient.ts
Original file line number Diff line number Diff line change
Expand Up @@ -229,9 +229,13 @@
this.logger.debug(`Before trying to reconnect, this client will wait for ${msBeforeRetry} milliseconds`);
return new Promise((res, _rej) => {
setTimeout(() => {
this.logger.debug('Continuing with reconnect...');
this.emit(State.Reconnecting);
cb.apply(this).then(res);
if (this.shuttingDown) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While not a problem brought up by users, I noticed a problem here when writing integration tests:

  • if you turn on the socket mode client and it is seeing errors from the web socket server, and
  • in between reconnect attempts you call disconnect() to stop the client,

.. then the client will keep trying to reconnect forever via this part of the code. Adding this guard here just in case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes for an awesome catch 🙏 ✨ Thanks for guarding against it!

this.logger.debug('Client shutting down, will not attempt reconnect.');
} else {
this.logger.debug('Continuing with reconnect...');
this.emit(State.Reconnecting);
cb.apply(this).then(res);
}

Check warning on line 238 in packages/socket-mode/src/SocketModeClient.ts

View check run for this annotation

Codecov / codecov/patch

packages/socket-mode/src/SocketModeClient.ts#L232-L238

Added lines #L232 - L238 were not covered by tests
}, msBeforeRetry);
});
}
Expand Down
5 changes: 1 addition & 4 deletions packages/socket-mode/src/errors.ts
Original file line number Diff line number Diff line change
Expand Up @@ -61,10 +61,7 @@ function errorWithCode(error: Error, code: ErrorCode): CodedError {
* A factory to create SMWebsocketError objects.
*/
export function websocketErrorWithOriginal(original: Error): SMWebsocketError {
const error = errorWithCode(
new Error(`Failed to send message on websocket: ${original.message}`),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "Failed to send message" as part of this error, as this error can be raised unrelated to message-sending. In this PR, I noticed that websocket handshake errors also used this code path, which made the error message misleading.

ErrorCode.WebsocketError,
) as Partial<SMWebsocketError>;
const error = errorWithCode(new Error(original.message), ErrorCode.WebsocketError) as Partial<SMWebsocketError>;
error.original = original;
return error as SMWebsocketError;
}
Expand Down
50 changes: 46 additions & 4 deletions packages/socket-mode/test/integration.spec.js
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,9 @@ describe('Integration tests with a WebSocket server', () => {
});
});
afterEach(async () => {
server.close();
if (server) server.close();
server = null;
wss.close();
if (wss) wss.close();
wss = null;
exposed_ws_connection = null;
if (client) {
Expand Down Expand Up @@ -90,7 +90,7 @@ describe('Integration tests with a WebSocket server', () => {
await client.disconnect();
});
});
describe('catastrophic server behaviour', () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥!

describe('`apps.connections.open` API failure modes', () => {
beforeEach(() => {
client = new SocketModeClient({
appToken: 'whatever',
Expand Down Expand Up @@ -130,7 +130,7 @@ describe('Integration tests with a WebSocket server', () => {
await client.disconnect();
});
});
describe('failure modes / unexpected messages sent to client', () => {
describe('unexpected socket messages sent to client', () => {
const debugLoggerSpy = sinon.stub(); // add the following to expose further logging: .callsFake(console.log);
const noop = () => {};
beforeEach(() => {
Expand All @@ -139,6 +139,7 @@ describe('Integration tests with a WebSocket server', () => {
clientOptions: {
slackApiUrl: `http://localhost:${HTTP_PORT}/`,
},
logLevel: 'debug',
logger: {
debug: debugLoggerSpy,
info: noop,
Expand Down Expand Up @@ -168,6 +169,47 @@ describe('Integration tests with a WebSocket server', () => {
assert.isTrue(debugLoggerSpy.calledWith(sinon.match('Unable to parse an incoming WebSocket message')));
await client.disconnect();
});
it('should maintain one serial reconnection attempt if WSS server sends unexpected HTTP response during handshake, like a 409', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test testing for the exponential reconnection doubling spiral bug.

// override socket mode client instance with lower client ping timeout, which controls reconnection rate
client = new SocketModeClient({
appToken: 'whatever',
clientOptions: {
slackApiUrl: `http://localhost:${HTTP_PORT}/`,
},
clientPingTimeout: 20,
logLevel: 'debug',
});
// shut down the default mock WS server used in these tests as we will customize its behaviour in this test
wss.close();
wss = null;
// custom HTTP server that blows up during initial WS handshake
const badServer = createServer((_req, res) => {
res.writeHead(409, { 'content-type': 'application/json' });
res.end(
JSON.stringify({
message: 'Unexpected server response: 409',
}),
);
});
badServer.listen(WSS_PORT);
let closed = 0;
client.on('close', () => {
closed++;
});
// do not use await here, since `start()` won't return until the connection is established. we are intentionally testing the connection establishment failure, so that will never finish. so, let's run this in a rogue "thread"!
client.start();
await sleep(50);
// after 50ms, with a timeout of 20ms, we would expect 2 retries by this point.
assert.equal(closed, 2, 'unexpected number of times `close` event was raised during reconnection!');
await client.disconnect();
await new Promise((res, rej) => {
// shut down the bad server
badServer.close((err) => {
if (err) rej(err);
else res();
});
});
});
});
describe('lifecycle events', () => {
beforeEach(() => {
Expand Down
Loading