Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): Consider timeout in shutdown an error #8050

Merged
merged 1 commit into from
Dec 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions packages/cli/src/commands/BaseCommand.ts
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,12 @@ export abstract class BaseCommand extends Command {

protected isShuttingDown = false;

/**
* How long to wait for graceful shutdown before force killing the process.
* Subclasses can override this value.
*/
protected gracefulShutdownTimeoutInS: number = 30;

async init(): Promise<void> {
await initErrorHandling();
initExpressionEvaluator();
Expand Down Expand Up @@ -309,9 +315,20 @@ export abstract class BaseCommand extends Command {
return;
}

const forceShutdownTimer = setTimeout(async () => {
// In case that something goes wrong with shutdown we
// kill after timeout no matter what
console.log(`process exited after ${this.gracefulShutdownTimeoutInS}s`);
const errorMsg = `Shutdown timed out after ${this.gracefulShutdownTimeoutInS} seconds`;
await this.exitWithCrash(errorMsg, new Error(errorMsg));
}, this.gracefulShutdownTimeoutInS * 1000);

this.logger.info(`Received ${signal}. Shutting down...`);
this.isShuttingDown = true;

await this.stopProcess();

clearTimeout(forceShutdownTimer);
};
}
}
7 changes: 0 additions & 7 deletions packages/cli/src/commands/start.ts
Original file line number Diff line number Diff line change
Expand Up @@ -103,13 +103,6 @@ export class Start extends BaseCommand {

await this.externalHooks.run('n8n.stop', []);

setTimeout(async () => {
// In case that something goes wrong with shutdown we
// kill after max. 30 seconds no matter what
console.log('process exited after 30s');
await this.exitSuccessFully();
}, 30000);

// Shut down License manager to unclaim any floating entitlements
// Note: While this saves a new license cert to DB, the previous entitlements are still kept in memory so that the shutdown process can complete
await Container.get(License).shutdown();
Expand Down
6 changes: 0 additions & 6 deletions packages/cli/src/commands/webhook.ts
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,6 @@ export class Webhook extends BaseCommand {
try {
await this.externalHooks.run('n8n.stop', []);

setTimeout(async () => {
// In case that something goes wrong with shutdown we
// kill after max. 30 seconds no matter what
await this.exitSuccessFully();
}, 30000);

// Wait for active workflow executions to finish
const activeExecutionsInstance = Container.get(ActiveExecutions);
let executingWorkflows = activeExecutionsInstance.getActiveExecutions();
Expand Down
13 changes: 3 additions & 10 deletions packages/cli/src/commands/worker.ts
Original file line number Diff line number Diff line change
Expand Up @@ -81,21 +81,13 @@ export class Worker extends BaseCommand {
try {
await this.externalHooks.run('n8n.stop', []);

const maxStopTime = config.getEnv('queue.bull.gracefulShutdownTimeout') * 1000;

const stopTime = new Date().getTime() + maxStopTime;

setTimeout(async () => {
// In case that something goes wrong with shutdown we
// kill after max. 30 seconds no matter what
await this.exitSuccessFully();
}, maxStopTime);
const hardStopTime = Date.now() + this.gracefulShutdownTimeoutInS;

// Wait for active workflow executions to finish
let count = 0;
while (Object.keys(Worker.runningJobs).length !== 0) {
if (count++ % 4 === 0) {
const waitLeft = Math.ceil((stopTime - new Date().getTime()) / 1000);
const waitLeft = Math.ceil((hardStopTime - Date.now()) / 1000);
this.logger.info(
`Waiting for ${
Object.keys(Worker.runningJobs).length
Expand Down Expand Up @@ -272,6 +264,7 @@ export class Worker extends BaseCommand {
}

async init() {
this.gracefulShutdownTimeoutInS = config.getEnv('queue.bull.gracefulShutdownTimeout');
await this.initCrashJournal();

this.logger.debug('Starting n8n worker...');
Expand Down
2 changes: 1 addition & 1 deletion packages/cli/src/config/schema.ts
Original file line number Diff line number Diff line change
Expand Up @@ -439,7 +439,7 @@ export const schema = {
env: 'QUEUE_RECOVERY_INTERVAL',
},
gracefulShutdownTimeout: {
doc: 'How long should n8n wait for running executions before exiting worker process',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move this to a generic config variable, instead of having a worker specific one? it would be a breaking change, but we can add it in BREAKING-CHANGES.md. Since it will affect a very small minority of users, I think it should be okay.
@krynble WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking whether it would make senes to have different timeouts by process type. Say 10 seconds for webhooks, 1 minute for mains and 2 minutes for workers, depending on the use case. Is it overkill?

If we use this approach, then we could have on "global default" value of 30 seconds but timers can be set separately.

Do we have any specific goals we're trying to achieve with this change other than changing this exit to a crash instead of successful? I'm fine to moving to a single global timeout as long as we continue supporting the old value for some time until users have a change to change, while sending a warning to the console.

I think relying solely on the breaking changes log isn't ideal.

Copy link
Contributor Author

@tomi tomi Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are couple reasons behind this change. First is that conceptually something timing out is an error. Second is that on successful exit we close down the DB connection gracefully. On an exit timeout we rather not do that, since it will wait for any active connections to close and would possible block the exit. I updated the PR description for better reasoning.

IMO it would make sense to have a more generic config for this. I don't think we have to create separate vars for separate process types, as they are separate deployments anyways and have different env vars. Uou can then define a different exit timeout value for each separately.

I can create a separate PR for adding a more generic config variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood - makes sense. Thanks for clarifying.

About the generic variable, I strongly suggest we continue supporting the existing name to prevent breaking existing deployment, while warning users that the current environment variable has been deprecated and should be simply renamed, similar to how we did for deprecating MySQL.

doc: 'How long should n8n wait for running executions before exiting worker process (seconds)',
format: Number,
default: 30,
env: 'QUEUE_WORKER_TIMEOUT',
Expand Down
Loading