-
Notifications
You must be signed in to change notification settings - Fork 663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
socket-mode: Handling WS errors during handshake #2099
Changes from all commits
9132ac1
4b4dd72
6f918b0
2fecb13
1c035e3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -111,29 +111,28 @@ export class SlackWebSocket { | |
agent: this.options.httpAgent, | ||
}; | ||
|
||
const ws = new WebSocket(this.options.url, options); | ||
this.websocket = new WebSocket(this.options.url, options); | ||
|
||
ws.addEventListener('open', (_event) => { | ||
this.websocket.addEventListener('open', (_event) => { | ||
this.logger.debug('WebSocket open event received (connection established)!'); | ||
this.websocket = ws; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not tracking the current websocket instance was a problem in the case the web socket server returns an error. Since only once the That was the underlying issue that caused the "doubling up" of reconnection attempts (see "Issue 1" described by @gdhardy1 in this comment): each time an error was encountered during websocket handshake, there would be an 'untracked' If you remove the changes in this file, you will see the integration test I added at the end of this PR fail. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for finding this. I was able to patch this in and test it while we await a full fix. It stabilized things tremendously. |
||
this.monitorPingToSlack(); | ||
}); | ||
ws.addEventListener('error', (event) => { | ||
this.websocket.addEventListener('error', (event) => { | ||
this.logger.error(`WebSocket error occurred: ${event.message}`); | ||
this.disconnect(); | ||
this.options.client.emit('error', websocketErrorWithOriginal(event.error)); | ||
}); | ||
ws.on('message', (msg, isBinary) => { | ||
this.websocket.on('message', (msg, isBinary) => { | ||
this.options.client.emit('ws_message', msg, isBinary); | ||
}); | ||
ws.on('close', (code: number, data: Buffer) => { | ||
this.websocket.on('close', (code: number, data: Buffer) => { | ||
this.logger.debug(`WebSocket close frame received (code: ${code}, reason: ${data.toString()})`); | ||
this.closeFrameReceived = true; | ||
this.disconnect(); | ||
}); | ||
|
||
// Confirm WebSocket connection is still active | ||
ws.on('ping', (data) => { | ||
this.websocket.on('ping', (data) => { | ||
// Note that ws' `autoPong` option is true by default, so no need to respond to ping. | ||
// see https://github.com/websockets/ws/blob/2aa0405a5e96754b296fef6bd6ebdfb2f11967fc/doc/ws.md#new-websocketaddress-protocols-options | ||
if (this.options.pingPongLoggingEnabled) { | ||
|
@@ -142,7 +141,7 @@ export class SlackWebSocket { | |
this.monitorPingFromSlack(); | ||
}); | ||
|
||
ws.on('pong', (data) => { | ||
this.websocket.on('pong', (data) => { | ||
if (this.options.pingPongLoggingEnabled) { | ||
this.logger.debug(`WebSocket received pong from Slack server (data: ${data.toString()})`); | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -229,9 +229,13 @@ | |
this.logger.debug(`Before trying to reconnect, this client will wait for ${msBeforeRetry} milliseconds`); | ||
return new Promise((res, _rej) => { | ||
setTimeout(() => { | ||
this.logger.debug('Continuing with reconnect...'); | ||
this.emit(State.Reconnecting); | ||
cb.apply(this).then(res); | ||
if (this.shuttingDown) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While not a problem brought up by users, I noticed a problem here when writing integration tests:
.. then the client will keep trying to reconnect forever via this part of the code. Adding this guard here just in case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
this.logger.debug('Client shutting down, will not attempt reconnect.'); | ||
} else { | ||
this.logger.debug('Continuing with reconnect...'); | ||
this.emit(State.Reconnecting); | ||
cb.apply(this).then(res); | ||
} | ||
}, msBeforeRetry); | ||
}); | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -61,10 +61,7 @@ function errorWithCode(error: Error, code: ErrorCode): CodedError { | |
* A factory to create SMWebsocketError objects. | ||
*/ | ||
export function websocketErrorWithOriginal(original: Error): SMWebsocketError { | ||
const error = errorWithCode( | ||
new Error(`Failed to send message on websocket: ${original.message}`), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed "Failed to send message" as part of this error, as this error can be raised unrelated to message-sending. In this PR, I noticed that websocket handshake errors also used this code path, which made the error message misleading. |
||
ErrorCode.WebsocketError, | ||
) as Partial<SMWebsocketError>; | ||
const error = errorWithCode(new Error(original.message), ErrorCode.WebsocketError) as Partial<SMWebsocketError>; | ||
error.original = original; | ||
return error as SMWebsocketError; | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -42,9 +42,9 @@ describe('Integration tests with a WebSocket server', () => { | |
}); | ||
}); | ||
afterEach(async () => { | ||
server.close(); | ||
if (server) server.close(); | ||
server = null; | ||
wss.close(); | ||
if (wss) wss.close(); | ||
wss = null; | ||
exposed_ws_connection = null; | ||
if (client) { | ||
|
@@ -90,7 +90,7 @@ describe('Integration tests with a WebSocket server', () => { | |
await client.disconnect(); | ||
}); | ||
}); | ||
describe('catastrophic server behaviour', () => { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔥! |
||
describe('`apps.connections.open` API failure modes', () => { | ||
beforeEach(() => { | ||
client = new SocketModeClient({ | ||
appToken: 'whatever', | ||
|
@@ -130,7 +130,7 @@ describe('Integration tests with a WebSocket server', () => { | |
await client.disconnect(); | ||
}); | ||
}); | ||
describe('failure modes / unexpected messages sent to client', () => { | ||
describe('unexpected socket messages sent to client', () => { | ||
const debugLoggerSpy = sinon.stub(); // add the following to expose further logging: .callsFake(console.log); | ||
const noop = () => {}; | ||
beforeEach(() => { | ||
|
@@ -139,6 +139,7 @@ describe('Integration tests with a WebSocket server', () => { | |
clientOptions: { | ||
slackApiUrl: `http://localhost:${HTTP_PORT}/`, | ||
}, | ||
logLevel: 'debug', | ||
logger: { | ||
debug: debugLoggerSpy, | ||
info: noop, | ||
|
@@ -168,6 +169,52 @@ describe('Integration tests with a WebSocket server', () => { | |
assert.isTrue(debugLoggerSpy.calledWith(sinon.match('Unable to parse an incoming WebSocket message'))); | ||
await client.disconnect(); | ||
}); | ||
it('should maintain one serial reconnection attempt if WSS server sends unexpected HTTP response during handshake, like a 409', async () => { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the test testing for the exponential reconnection doubling spiral bug. |
||
// test for https://github.com/slackapi/node-slack-sdk/issues/2094 | ||
// override socket mode client instance with lower client ping timeout, which controls reconnection rate | ||
client = new SocketModeClient({ | ||
appToken: 'whatever', | ||
clientOptions: { | ||
slackApiUrl: `http://localhost:${HTTP_PORT}/`, | ||
}, | ||
clientPingTimeout: 20, // controls reconnection rate | ||
logLevel: 'debug', | ||
}); | ||
// shut down the default mock WS server used in these tests as we will customize its behaviour in this test | ||
wss.close(); | ||
wss = null; | ||
// custom HTTP server that blows up during initial WS handshake | ||
const badServer = createServer((_req, res) => { | ||
res.writeHead(409, { 'content-type': 'application/json' }); | ||
res.end( | ||
JSON.stringify({ | ||
message: 'Unexpected server response: 409', | ||
}), | ||
); | ||
}); | ||
badServer.listen(WSS_PORT); | ||
let closed = 0; | ||
// the `close` event is raised every time the websocket server returns an error, so let's keep track of how often this event is emited and use that to infer correct reconnection attempt counts / behaviour | ||
client.on('close', () => { | ||
closed++; | ||
}); | ||
// do not use await here, since `start()` won't return until the connection is established. we are intentionally testing connection establishment failure, so that will never finish. so, let's run this in a rogue "thread", e.g. fire off an async method and let it do its thing! | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is awesome 👏 |
||
client.start(); | ||
await sleep(50); | ||
// after 50ms, with a timeout of 20ms, we would expect 2 retries. | ||
// crucially, the bug reported in https://github.com/slackapi/node-slack-sdk/issues/2094 shows that on every reconnection attempt, we spawn _another_ websocket instance, which attempts to reconnect forever and is never cleaned up. | ||
// effectively: with each reconnection attempt, we double the number of websockets, eventually causing crashes / out-of-memory issues / rate-limiting from Slack APIs. | ||
// with the bug not fixed, this assertion fails as `close` event was emitted 4 times! if we waited another 20ms, we would see this event count double again (8), and so on. | ||
assert.equal(closed, 2, 'unexpected number of times `close` event was raised during reconnection!'); | ||
await client.disconnect(); | ||
await new Promise((res, rej) => { | ||
// shut down the bad server | ||
badServer.close((err) => { | ||
if (err) rej(err); | ||
else res(); | ||
}); | ||
}); | ||
}); | ||
}); | ||
describe('lifecycle events', () => { | ||
beforeEach(() => { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Superb test setups all throughout 👏 This adds so much confidence around the changes.
With some of the retry timing we might have to lookout for flakiness of tests due to the CI runners - nothing a rerun might not solve, but with the matrix setups I think our chances of this happening increase eversomuch 👾