Skip to content

Commit

Permalink
Handle supervisor exits more gracefully (#5062)
Browse files Browse the repository at this point in the history
This change causes Positron to handle an abrupt or unexpected exit of
the supervisor process more gracefully. Before the change, the IDE
basically became unresponsive. After the change, the situation is still
Not Great, but it's recoverable; it's about the same as having all your
interpreters crash at once. You'll see some error popups, but you can
get back to a working state.

Because Positron has open websockets for all the active sessions, the
best clue that the supervisor process has been terminated is that these
sockets mysteriously all disconnect. So the main approach here is to
fire disconnect events from the sessions; if we get one of these for a
session that hasn't exited, we check the process table to see what's
going on.

There are a couple of significant caveats:

- This approach only works when the supervisor process is local. We will
probably want to implement a mechanism that works for remote supervisors
too, if we add support for those.
- This approach won't help if the supervisor process is alive but
unresponsive. Though that state hasn't occurred in nature so far, we
will eventually want to guard against it.

Addresses #5037.

### QA Notes

We don't expect supervisor exits to happen to anyone during the normal
course of events, and this change does not attempt to make a supervisor
exit a pleasant experience.

This change also has a drive-by fix that addresses an issue running
notebook kernels under the supervisor.

---------

Signed-off-by: Jonathan <jonathan@posit.co>
Co-authored-by: sharon <sharon-wang@users.noreply.github.com>
  • Loading branch information
jmcphers and sharon-wang authored Oct 21, 2024
1 parent 3054948 commit f0a0049
Show file tree
Hide file tree
Showing 2 changed files with 155 additions and 9 deletions.
121 changes: 116 additions & 5 deletions extensions/kallichore-adapter/src/KallichoreAdapterApi.ts
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ export class KCApi implements KallichoreAdapterApi {

/** A barrier that opens when the Kallichore server has successfully started;
* used to hold operations until we're online */
private readonly _started: Barrier = new Barrier();
private _started: Barrier = new Barrier();

/**
* If we're currently starting, this is the promise that resolves when the
Expand Down Expand Up @@ -211,9 +211,10 @@ export class KCApi implements KallichoreAdapterApi {
this._log.info(`Kallichore ${status.body.version} server online with ${status.body.sessions} sessions`);
break;
} catch (err) {
// ECONNREFUSED is a normal condition; the server isn't ready
// yet. Keep trying until we hit the retry limit, about 2
// seconds from the time we got a process ID established.
// ECONNREFUSED is a normal condition during startup; the server
// isn't ready yet. Keep trying until we hit the retry limit,
// about 2 seconds from the time we got a process ID
// established.
if (err.code === 'ECONNREFUSED') {
if (retry < 19) {
// Wait a bit and try again
Expand Down Expand Up @@ -308,14 +309,119 @@ export class KCApi implements KallichoreAdapterApi {
this._log.info(`Creating session: ${JSON.stringify(sessionMetadata)}`);

// Create the session on the server
await session.create(kernel);
try {
await session.create(kernel);
} catch (err) {
// If the connection was refused, check the server status; this
// suggests that the server may have exited
if (err.code === 'ECONNREFUSED') {
this._log.warn(`Connection refused while attempting to create session; checking server status`);
await this.testServerExited();
}

// Rethrow the error for the caller to handle
throw err;
}

// Save the session now that it has been created on the server
this.addDisconnectHandler(session);
this._sessions.push(session);

return session;
}

/**
* Adds a disconnect handler to a session that will check the server status
* when the session's websocket disconnects.
*
* @param session The session to add the disconnect handler to
*/
private addDisconnectHandler(session: KallichoreSession) {
session.disconnected.event(async (state: positron.RuntimeState) => {
if (state !== positron.RuntimeState.Exited) {
// The websocket disconnected while the session was still
// running. This could signal a problem with the supervisor; we
// should see if it's still running.
this._log.info(`Session '${session.metadata.sessionName}' disconnected while in state '${state}'. This is unexpected; checking server status.`);
await this.testServerExited();
}
});
}

/**
* Tests the server after a session disconnects, or after an RPC fails with
* ECONNREFUSED, to see if it is still running. If it isn't, marks all
* sessions as exited and restarts the server.
*
* Consider: This only tests the server's local process ID, not the server
* itself. We can't use this technique on a remote server, and it doesn't
* handle the case where the server process is running but it's become
* unresponsive.
*
* @returns A promise that resolves when the server has been confirmed to be
* running or has been restarted.
*/
private async testServerExited() {
// If we're currently starting, it doesn't make sense to test the
// server status since we're already in the process of starting it.
if (this._starting) {
return this._starting.promise;
}

// Load the server state so we can check the process ID
const serverState =
this._context.workspaceState.get<KallichoreServerState>(KALLICHORE_STATE_KEY);

// If there's no server state, return as we can't check its status
if (!serverState) {
this._log.warn(`No Kallichore server state found; cannot test server process`);
return;
}

// Test the process ID to see if the server is still running.
// If we have no process ID, we can't check the server status, so we
// presume it's running to be safe.
let serverRunning = true;
if (serverState.server_pid) {
try {
process.kill(serverState.server_pid, 0);
this._log.info(`Kallichore server PID ${serverState.server_pid} is still running`);
} catch (err) {
this._log.warn(`Kallichore server PID ${serverState.server_pid} is not running`);
serverRunning = false;
}
}

// The server is still running; nothing to do
if (serverRunning) {
return;
}

// Clean up the state so we don't try to reconnect to a server that
// isn't running.
this._context.workspaceState.update(KALLICHORE_STATE_KEY, undefined);

// We need to mark all sessions as exited since (at least right now)
// they cannot live without the supervisor.
for (const session of this._sessions) {
session.markExited(1, positron.RuntimeExitReason.Error);
}

// Reset the start barrier and start the server again.
this._started = new Barrier();
try {
// Start the server again
await this.ensureStarted();

vscode.window.showInformationMessage(
vscode.l10n.t('The process supervising the interpreters has exited unexpectedly and was automatically restarted. You may need to start your interpreter again.'));

} catch (err) {
vscode.window.showInformationMessage(
vscode.l10n.t('The process supervising the interpreters has exited unexpectedly and could not automatically restarted: ' + err));
}
}

/**
* Restores (reconnects to) an already running session on the Kallichore
* server.
Expand Down Expand Up @@ -353,9 +459,14 @@ export class KCApi implements KallichoreAdapterApi {
session.restore(kcSession);
} catch (err) {
this._log.error(`Failed to restore session ${sessionMetadata.sessionId}: ${JSON.stringify(err)}`);
if (err.code === 'ECONNREFUSED') {
this._log.warn(`Connection refused while attempting to restore session; checking server status`);
await this.testServerExited();
}
reject(err);
}
// Save the session
this.addDisconnectHandler(session);
this._sessions.push(session);
resolve(session);
}).catch((err) => {
Expand Down
43 changes: 39 additions & 4 deletions extensions/kallichore-adapter/src/KallichoreSession.ts
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,9 @@ export class KallichoreSession implements JupyterLanguageRuntimeSession {
/** Emitter for runtime exit events */
private readonly _exit: vscode.EventEmitter<positron.LanguageRuntimeExit>;

/** Emitter for disconnection events */
readonly disconnected: vscode.EventEmitter<positron.RuntimeState>;

/** Barrier: opens when the session has been established on Kallichore */
private readonly _established: Barrier = new Barrier();

Expand Down Expand Up @@ -130,6 +133,12 @@ export class KallichoreSession implements JupyterLanguageRuntimeSession {
// Create event emitters
this._state = new vscode.EventEmitter<positron.RuntimeState>();
this._exit = new vscode.EventEmitter<positron.LanguageRuntimeExit>();
this.disconnected = new vscode.EventEmitter<positron.RuntimeState>();

// Ensure the emitters are disposed when the session is disposed
this._disposables.push(this._state);
this._disposables.push(this._exit);
this._disposables.push(this.disconnected);

this.onDidReceiveRuntimeMessage = this._messages.event;

Expand Down Expand Up @@ -168,10 +177,15 @@ export class KallichoreSession implements JupyterLanguageRuntimeSession {
// otherwise the home directory
let workingDir = vscode.workspace.workspaceFolders?.[0].uri.fsPath || os.homedir();

// If we have a notebook URI, use its directory as the working directory
// instead
// If we have a notebook URI, use its parent directory as the working
// directory instead. Note that not all notebooks have valid on-disk
// URIs since they may be transient or not yet saved; for these, we fall
// back to the workspace root or home directory.
if (this.metadata.notebookUri?.fsPath) {
workingDir = this.metadata.notebookUri.fsPath;
const notebookPath = this.metadata.notebookUri.fsPath;
if (fs.existsSync(notebookPath)) {
workingDir = path.dirname(notebookPath);
}
}

// Form the command-line arguments to the kernel process
Expand Down Expand Up @@ -810,7 +824,9 @@ export class KallichoreSession implements JupyterLanguageRuntimeSession {
}
};

this._socket.ws.onclose = (_evt: any) => {
this._socket.ws.onclose = (evt: any) => {
this.log(`Websocket closed with kernel in status ${this._runtimeState}: ${JSON.stringify(evt)}`, vscode.LogLevel.Info);
this.disconnected.fire(this._runtimeState);
// When the socket is closed, reset the connected barrier and
// clear the websocket instance.
this._connected = new Barrier();
Expand Down Expand Up @@ -961,6 +977,13 @@ export class KallichoreSession implements JupyterLanguageRuntimeSession {
}
}

/**
* Gets the current runtime state of the kernel.
*/
get runtimeState(): positron.RuntimeState {
return this._runtimeState;
}

private onStateChange(newState: positron.RuntimeState) {
// If the kernel is ready, open the ready barrier
if (newState === positron.RuntimeState.Ready) {
Expand Down Expand Up @@ -988,6 +1011,18 @@ export class KallichoreSession implements JupyterLanguageRuntimeSession {
this._state.fire(newState);
}

/**
* Marks the kernel as exited.
*
* @param exitCode The exit code
* @param reason The reason for the exit
*/
markExited(exitCode: number, reason: positron.RuntimeExitReason) {
this._exitReason = reason;
this.onStateChange(positron.RuntimeState.Exited);
this.onExited(exitCode);
}

private onExited(exitCode: number) {
if (this._restarting) {
// If we're restarting, wait for the kernel to start up again
Expand Down

0 comments on commit f0a0049

Please sign in to comment.