Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pm2 reload downtime in cluster mode #3143

Open
brendonboshell opened this issue Sep 10, 2017 · 17 comments
Open

pm2 reload downtime in cluster mode #3143

brendonboshell opened this issue Sep 10, 2017 · 17 comments

Comments

@brendonboshell
Copy link

brendonboshell commented Sep 10, 2017

Please see the issue I reported previously - "pm2 reload causing connection timeout and downtime". Using pm2 reload causes connection errors and downtime when running a process in cluster mode. Since I have been stuck running 1.0.1 in production for some time, I have been motivated to track down the cause of this issue. The issue still appears in the latest version 2.6.1.

Using git bisect I have been able to track this issue down to commit d0a3f49 "(god)(stopProcessId) refactor: now it only kill process without disconnecting in cluster mode"

To reproduce

  1. Create basic server, server.js:
var http = require("http"),
    app = require("express")();

app.use("/", function (req, res) {
  return res.send(404);
});

var server = http.createServer(app);
server.listen(4000, function () {
});
  1. Run ./bin/pm2 --no-daemon on master/2.6.1.

  2. Run ../../pm2/bin/pm2 start server.js -i 2 --name api

  3. Run ab -n 100000 -c 1 http://127.0.0.1:4000/v1/

  4. While ab is running, run ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2

  5. Observe the following output:

Benchmarking 127.0.0.1 (be patient)
apr_pollset_poll: The timeout specified has expired (70007)
Total of 1407 requests completed
  1. Run git revert d0a3f49 and observe that ab completes without error.

I have also reproduced this issue with a small script

var Promise = require("bluebird"),
    request = require("request");

setInterval(function () {
  var startDate = new Date();
  Promise.promisify(request)({
    uri: "http://127.0.0.1:4000/v1/",
    timeout: 10000,
    forever: false
  }).then(function (res) {
    console.log(new Date(), res.statusCode, (new Date() - startDate));

    if (res.statusCode !== 404) {
      process.exit(0);
    }
  }).catch(function (err) {
    console.log(new Date(), err, (new Date() - startDate));
    process.exit(0);
  });
}, 5);

Solution

It appears that reverting d0a3f49 solves the issue, but I am not sure what the motivation for that change was. I have been running 2.6.1 in production for about a week and, since I regularly use pm2 reload, I have noticed a number of connection errors in my nginx logs. I suspect this issue is related to the above.

Update (10 Sep)

I have downgraded PM2 2.1.1 overnight on my production machine. See this chart from my status server for the past 24 hours. I use pm2 reload every hour and you can clearly see many downtime running under PM2 2.6.1 and no downtime while running PM2 2.1.1.

PM2 2.1.1 vs PM2 2.6.1

@vmarchaud
Copy link
Contributor

vmarchaud commented Sep 10, 2017

Could you use this snippet (from the docs) :

var http = require("http"),
    app = require("express")();

app.use("/", function (req, res) {
  return res.send(404);
});

var server = http.createServer(app);
server.listen(4000, function () {
   process.send('ready')
});

process.on('SIGINT', function() {
   server.close(function(err) {
     process.exit(err ? 1 : 0);
   });
});

And then start it with pm2 start server.js -i 2 --name api --wait-ready

@brendonboshell
Copy link
Author

I have tried with the above snippet and experience the same issue.

@laurentdebricon
Copy link

laurentdebricon commented Nov 6, 2017

Hello, i have the same bug. It used to work, but now with
node v8.5.0
pm2 2.7.2

graceful reload is not doing it's job, my process are reloaded at the same time.

setTimeout(function() {
	console.log('ready ' + new Date());
	process.send('ready');
}, 5000);


process.on('SIGINT', function() {
	console.log('i quit' + new Date());
});

pm2 start test.js -i 2 --wait-ready --listen-timeout 15000

I downgraded to pm2@2.1.1 . It's working if i > 2 , if i =2 , both are rebooted at the same time.

@guyellis
Copy link

guyellis commented Nov 11, 2017

I was looking at some of the code in the commit that @brendonboshell referenced. I noticed that this line which is the cb(...) in this snippet:

if (!proc.process.pid) {
  console.error('app=%s id=%d does not have a pid', proc.pm2_env.name, proc.pm2_env.pm_id);
  proc.pm2_env.status = cst.STOPPED_STATUS;
  return cb(null, { error : true, message : 'could not kill process w/o pid'});
}

Has what looks like an error object as the second parameter. Is that correct? i.e. we usually pass the error object as the first parameter in the callback and this pattern is followed in the other parts of the code in that file. But it might be that this isn't a "true" error condition which is why it's being signaled as a property and being passed in the second param.

@guyellis
Copy link

I can't get version 2.1.1 to reload with zero downtime in my environment using the same approach that @brendonboshell has used with ab.

@guyellis
Copy link

@laurentdebricon on ver 2.1.1 I just added an extra process with pm2 scale myapp +1 and then did a gracefulReload and it looks like the first two are restarted at the same time and the 3rd later. I still can't get Apache's ab to be able to hammer it without lost connections. You can see the first two started 53s ago and the 3rd 31s ago.

$ pm2 gracefulReload myapp
Restarts are now immutable, to update environment or conf use --update-env
[PM2] Applying action softReloadProcessId on app [myapp](ids: 0,1,2)
[PM2] [myapp](0) ✓
[PM2] [myapp](1) ✓
[PM2] [myapp](2) ✓
┌──────────┬────┬─────────┬───────┬────────┬─────────┬────────┬─────┬────────────┬──────────┐
│ App name │ id │ mode    │ pid   │ status │ restart │ uptime │ cpu │ mem        │ watching │
├──────────┼────┼─────────┼───────┼────────┼─────────┼────────┼─────┼────────────┼──────────┤
│ myapp    │ 0  │ cluster │ 27844 │ online │ 3       │ 53s    │ 0%  │ 145.0 MB   │ disabled │
│ myapp    │ 1  │ cluster │ 27850 │ online │ 3       │ 53s    │ 0%  │ 143.3 MB   │ disabled │
│ myapp    │ 2  │ cluster │ 27973 │ online │ 2       │ 31s    │ 0%  │ 144.2 MB   │ disabled │
└──────────┴────┴─────────┴───────┴────────┴─────────┴────────┴─────┴────────────┴──────────┘

@guyellis
Copy link

It just seems to be those first two. If I scale out to 10 instances and reload the first two reloaded at the same time and the rest are distributed.

┌──────────┬────┬─────────┬───────┬────────┬─────────┬────────┬─────┬────────────┬──────────┐
│ App name │ id │ mode    │ pid   │ status │ restart │ uptime │ cpu │ mem        │ watching │
├──────────┼────┼─────────┼───────┼────────┼─────────┼────────┼─────┼────────────┼──────────┤
│ myapp    │ 0  │ cluster │ 14228 │ online │ 5       │ 103s   │ 0%  │ 101.2 MB   │ disabled │
│ myapp    │ 1  │ cluster │ 14234 │ online │ 5       │ 103s   │ 0%  │ 99.2 MB    │ disabled │
│ myapp    │ 2  │ cluster │ 14315 │ online │ 3       │ 95s    │ 0%  │ 97.8 MB    │ disabled │
│ myapp    │ 3  │ cluster │ 14284 │ online │ 1       │ 99s    │ 0%  │ 99.6 MB    │ disabled │
│ myapp    │ 4  │ cluster │ 14343 │ online │ 2       │ 94s    │ 0%  │ 98.9 MB    │ disabled │
│ myapp    │ 5  │ cluster │ 14552 │ online │ 2       │ 75s    │ 0%  │ 99.8 MB    │ disabled │
│ myapp    │ 6  │ cluster │ 14598 │ online │ 2       │ 72s    │ 0%  │ 106.3 MB   │ disabled │
│ myapp    │ 7  │ cluster │ 14679 │ online │ 1       │ 61s    │ 0%  │ 142.2 MB   │ disabled │
│ myapp    │ 8  │ cluster │ 14716 │ online │ 1       │ 59s    │ 0%  │ 142.4 MB   │ disabled │
│ myapp    │ 9  │ cluster │ 14929 │ online │ 1       │ 41s    │ 0%  │ 141.9 MB   │ disabled │
└──────────┴────┴─────────┴───────┴────────┴─────────┴────────┴─────┴────────────┴──────────┘

@djbobbydrake
Copy link

Bump

@wirtsi
Copy link

wirtsi commented Sep 19, 2018

Can confirm, 3.1.2 loses requests on reload, switching to 2.1.1 fixes the issue

@curtisbelt
Copy link

curtisbelt commented Nov 29, 2018

@brendonboshell @laurentdebricon

I am running on 3.2.2 and I have fixed this problem for myself by using pm2 reload appname --parallel 1. The issue was that pm2 performs the reload concurrently in pairs, which will be bad for anyone using only 2 instances. I created a suggestion issue here to adjust how that works #4047 I hope this may solve your problems as well!

@curtisbelt
Copy link

For using an ecosystem.json file, I found that --parallel would no longer work. It was also not configurable in the json file. I was able to resolve this by using system environment variable PM2_CONCURRENT_ACTIONS, like so:

#!/bin/bash
cd /path/to/appname
PM2_CONCURRENT_ACTIONS=1 pm2 reload ecosystem.json

Source:

pm2/constants.js

Lines 88 to 93 in 9178610

// Concurrent actions when doing start/restart/reload
CONCURRENT_ACTIONS : (function() {
var concurrent_actions = parseInt(process.env.PM2_CONCURRENT_ACTIONS) || 2;
debug('Using %d parallelism (CONCURRENT_ACTIONS)', concurrent_actions);
return concurrent_actions;
})(),

@harshmandan
Copy link

I'm still experiencing downtime when using pm2 reload {processname}.
I started the serve using ecosystem file.
PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime.
What should I change?

@jeffreytkj
Copy link

I'm still experiencing downtime when using pm2 reload {processname}.
I started the serve using ecosystem file.
PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime.
What should I change?

I am facing the same issue as you faced. Do you have any solution for that?

@ibraah88
Copy link

I'm still experiencing downtime when using pm2 reload {processname}.
I started the serve using ecosystem file.
PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime.
What should I change?

I am facing the same issue, any solution ?

@nursultan156
Copy link

I'm still experiencing downtime when using pm2 reload {processname}. I started the serve using ecosystem file. PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime. What should I change?

+1, same issue, use pm2 (5.1.2) cluster-mode, nodejs app and json config in app root.

@strokirk
Copy link

Can confirm, even the "graceful restart" that is supposed to happen via max_memory_restart doesn't work well in v5.3.0. Downgrading to v2.1 removes over 80% of the connection closed before message completed errors I otherwise see.

@hiepnsx
Copy link

hiepnsx commented Feb 28, 2024

+1, same issue with pm2(5.3.0), nextjs and nginx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests