pm2 reload downtime in cluster mode #3143

brendonboshell · 2017-09-10T00:56:51Z

Please see the issue I reported previously - "pm2 reload causing connection timeout and downtime". Using pm2 reload causes connection errors and downtime when running a process in cluster mode. Since I have been stuck running 1.0.1 in production for some time, I have been motivated to track down the cause of this issue. The issue still appears in the latest version 2.6.1.

Using git bisect I have been able to track this issue down to commit d0a3f49 "(god)(stopProcessId) refactor: now it only kill process without disconnecting in cluster mode"

To reproduce

Create basic server, server.js:

var http = require("http"),
    app = require("express")();

app.use("/", function (req, res) {
  return res.send(404);
});

var server = http.createServer(app);
server.listen(4000, function () {
});

Run ./bin/pm2 --no-daemon on master/2.6.1.
Run ../../pm2/bin/pm2 start server.js -i 2 --name api
Run ab -n 100000 -c 1 http://127.0.0.1:4000/v1/
While ab is running, run ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2 && ../../pm2/bin/pm2 reload api && sleep 2
Observe the following output:

Benchmarking 127.0.0.1 (be patient)
apr_pollset_poll: The timeout specified has expired (70007)
Total of 1407 requests completed

Run git revert d0a3f49 and observe that ab completes without error.

I have also reproduced this issue with a small script

var Promise = require("bluebird"),
    request = require("request");

setInterval(function () {
  var startDate = new Date();
  Promise.promisify(request)({
    uri: "http://127.0.0.1:4000/v1/",
    timeout: 10000,
    forever: false
  }).then(function (res) {
    console.log(new Date(), res.statusCode, (new Date() - startDate));

    if (res.statusCode !== 404) {
      process.exit(0);
    }
  }).catch(function (err) {
    console.log(new Date(), err, (new Date() - startDate));
    process.exit(0);
  });
}, 5);

Solution

It appears that reverting d0a3f49 solves the issue, but I am not sure what the motivation for that change was. I have been running 2.6.1 in production for about a week and, since I regularly use pm2 reload, I have noticed a number of connection errors in my nginx logs. I suspect this issue is related to the above.

Update (10 Sep)

I have downgraded PM2 2.1.1 overnight on my production machine. See this chart from my status server for the past 24 hours. I use pm2 reload every hour and you can clearly see many downtime running under PM2 2.6.1 and no downtime while running PM2 2.1.1.

The text was updated successfully, but these errors were encountered:

vmarchaud · 2017-09-10T14:53:53Z

Could you use this snippet (from the docs) :

var http = require("http"),
    app = require("express")();

app.use("/", function (req, res) {
  return res.send(404);
});

var server = http.createServer(app);
server.listen(4000, function () {
   process.send('ready')
});

process.on('SIGINT', function() {
   server.close(function(err) {
     process.exit(err ? 1 : 0);
   });
});

And then start it with pm2 start server.js -i 2 --name api --wait-ready

brendonboshell · 2017-09-10T15:04:11Z

I have tried with the above snippet and experience the same issue.

laurentdebricon · 2017-11-06T09:20:38Z

Hello, i have the same bug. It used to work, but now with
node v8.5.0
pm2 2.7.2

graceful reload is not doing it's job, my process are reloaded at the same time.

setTimeout(function() {
	console.log('ready ' + new Date());
	process.send('ready');
}, 5000);


process.on('SIGINT', function() {
	console.log('i quit' + new Date());
});

pm2 start test.js -i 2 --wait-ready --listen-timeout 15000

I downgraded to pm2@2.1.1 . It's working if i > 2 , if i =2 , both are rebooted at the same time.

guyellis · 2017-11-11T03:53:04Z

I was looking at some of the code in the commit that @brendonboshell referenced. I noticed that this line which is the cb(...) in this snippet:

if (!proc.process.pid) {
  console.error('app=%s id=%d does not have a pid', proc.pm2_env.name, proc.pm2_env.pm_id);
  proc.pm2_env.status = cst.STOPPED_STATUS;
  return cb(null, { error : true, message : 'could not kill process w/o pid'});
}

Has what looks like an error object as the second parameter. Is that correct? i.e. we usually pass the error object as the first parameter in the callback and this pattern is followed in the other parts of the code in that file. But it might be that this isn't a "true" error condition which is why it's being signaled as a property and being passed in the second param.

guyellis · 2017-11-13T22:43:20Z

I can't get version 2.1.1 to reload with zero downtime in my environment using the same approach that @brendonboshell has used with ab.

guyellis · 2017-11-21T20:37:45Z

@laurentdebricon on ver 2.1.1 I just added an extra process with pm2 scale myapp +1 and then did a gracefulReload and it looks like the first two are restarted at the same time and the 3rd later. I still can't get Apache's ab to be able to hammer it without lost connections. You can see the first two started 53s ago and the 3rd 31s ago.

$ pm2 gracefulReload myapp
Restarts are now immutable, to update environment or conf use --update-env
[PM2] Applying action softReloadProcessId on app [myapp](ids: 0,1,2)
[PM2] [myapp](0) ✓
[PM2] [myapp](1) ✓
[PM2] [myapp](2) ✓
┌──────────┬────┬─────────┬───────┬────────┬─────────┬────────┬─────┬────────────┬──────────┐
│ App name │ id │ mode    │ pid   │ status │ restart │ uptime │ cpu │ mem        │ watching │
├──────────┼────┼─────────┼───────┼────────┼─────────┼────────┼─────┼────────────┼──────────┤
│ myapp    │ 0  │ cluster │ 27844 │ online │ 3       │ 53s    │ 0%  │ 145.0 MB   │ disabled │
│ myapp    │ 1  │ cluster │ 27850 │ online │ 3       │ 53s    │ 0%  │ 143.3 MB   │ disabled │
│ myapp    │ 2  │ cluster │ 27973 │ online │ 2       │ 31s    │ 0%  │ 144.2 MB   │ disabled │
└──────────┴────┴─────────┴───────┴────────┴─────────┴────────┴─────┴────────────┴──────────┘

guyellis · 2017-11-21T20:50:27Z

It just seems to be those first two. If I scale out to 10 instances and reload the first two reloaded at the same time and the rest are distributed.

┌──────────┬────┬─────────┬───────┬────────┬─────────┬────────┬─────┬────────────┬──────────┐
│ App name │ id │ mode    │ pid   │ status │ restart │ uptime │ cpu │ mem        │ watching │
├──────────┼────┼─────────┼───────┼────────┼─────────┼────────┼─────┼────────────┼──────────┤
│ myapp    │ 0  │ cluster │ 14228 │ online │ 5       │ 103s   │ 0%  │ 101.2 MB   │ disabled │
│ myapp    │ 1  │ cluster │ 14234 │ online │ 5       │ 103s   │ 0%  │ 99.2 MB    │ disabled │
│ myapp    │ 2  │ cluster │ 14315 │ online │ 3       │ 95s    │ 0%  │ 97.8 MB    │ disabled │
│ myapp    │ 3  │ cluster │ 14284 │ online │ 1       │ 99s    │ 0%  │ 99.6 MB    │ disabled │
│ myapp    │ 4  │ cluster │ 14343 │ online │ 2       │ 94s    │ 0%  │ 98.9 MB    │ disabled │
│ myapp    │ 5  │ cluster │ 14552 │ online │ 2       │ 75s    │ 0%  │ 99.8 MB    │ disabled │
│ myapp    │ 6  │ cluster │ 14598 │ online │ 2       │ 72s    │ 0%  │ 106.3 MB   │ disabled │
│ myapp    │ 7  │ cluster │ 14679 │ online │ 1       │ 61s    │ 0%  │ 142.2 MB   │ disabled │
│ myapp    │ 8  │ cluster │ 14716 │ online │ 1       │ 59s    │ 0%  │ 142.4 MB   │ disabled │
│ myapp    │ 9  │ cluster │ 14929 │ online │ 1       │ 41s    │ 0%  │ 141.9 MB   │ disabled │
└──────────┴────┴─────────┴───────┴────────┴─────────┴────────┴─────┴────────────┴──────────┘

djbobbydrake · 2018-03-14T11:51:14Z

Bump

wirtsi · 2018-09-19T12:29:46Z

Can confirm, 3.1.2 loses requests on reload, switching to 2.1.1 fixes the issue

curtisbelt · 2018-11-29T18:33:32Z

@brendonboshell @laurentdebricon

I am running on 3.2.2 and I have fixed this problem for myself by using pm2 reload appname --parallel 1. The issue was that pm2 performs the reload concurrently in pairs, which will be bad for anyone using only 2 instances. I created a suggestion issue here to adjust how that works #4047 I hope this may solve your problems as well!

curtisbelt · 2018-12-01T06:24:12Z

For using an ecosystem.json file, I found that --parallel would no longer work. It was also not configurable in the json file. I was able to resolve this by using system environment variable PM2_CONCURRENT_ACTIONS, like so:

#!/bin/bash
cd /path/to/appname
PM2_CONCURRENT_ACTIONS=1 pm2 reload ecosystem.json

Source:

pm2/constants.js

Lines 88 to 93 in 9178610

    
           // Concurrent actions when doing start/restart/reload 
        
           CONCURRENT_ACTIONS      : (function() { 
        
             var concurrent_actions = parseInt(process.env.PM2_CONCURRENT_ACTIONS) || 2; 
        
             debug('Using %d parallelism (CONCURRENT_ACTIONS)', concurrent_actions); 
        
             return concurrent_actions; 
        
           })(),

harshmandan · 2020-10-16T10:39:30Z

I'm still experiencing downtime when using pm2 reload {processname}.
I started the serve using ecosystem file.
PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime.
What should I change?

jeffreytkj · 2021-05-25T12:32:59Z

I'm still experiencing downtime when using pm2 reload {processname}.
I started the serve using ecosystem file.
PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime.
What should I change?

I am facing the same issue as you faced. Do you have any solution for that?

ibraah88 · 2021-08-28T10:59:06Z

I'm still experiencing downtime when using pm2 reload {processname}.
I started the serve using ecosystem file.
PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime.
What should I change?

I am facing the same issue, any solution ?

nursultan156 · 2022-02-10T10:51:07Z

I'm still experiencing downtime when using pm2 reload {processname}. I started the serve using ecosystem file. PM2 is sitting behind NGINX reverse proxy.

Most of the times when I pm2 reload myproc, hit on the api returns 502. Only some of the times it reloads gracefully and without any downtime. What should I change?

+1, same issue, use pm2 (5.1.2) cluster-mode, nodejs app and json config in app root.

strokirk · 2023-04-19T14:48:18Z

Can confirm, even the "graceful restart" that is supposed to happen via max_memory_restart doesn't work well in v5.3.0. Downgrading to v2.1 removes over 80% of the connection closed before message completed errors I otherwise see.

hiepnsx · 2024-02-28T10:20:54Z

+1, same issue with pm2(5.3.0), nextjs and nginx.

brendonboshell mentioned this issue Sep 10, 2017

Graceful reload doesn't wait for new process to launch before killing old one #3078

Open

vmarchaud added SYS: Daemon T: Bug labels Oct 30, 2017

f-hj mentioned this issue Dec 3, 2018

fix: limit concurrent_actions to 1 if nb process <= 2 #4053

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pm2 reload downtime in cluster mode #3143

pm2 reload downtime in cluster mode #3143

brendonboshell commented Sep 10, 2017 •

edited

Loading

vmarchaud commented Sep 10, 2017 •

edited

Loading

brendonboshell commented Sep 10, 2017

laurentdebricon commented Nov 6, 2017 •

edited

Loading

guyellis commented Nov 11, 2017 •

edited

Loading

guyellis commented Nov 13, 2017

guyellis commented Nov 21, 2017

guyellis commented Nov 21, 2017

djbobbydrake commented Mar 14, 2018

wirtsi commented Sep 19, 2018

curtisbelt commented Nov 29, 2018 •

edited

Loading

curtisbelt commented Dec 1, 2018

harshmandan commented Oct 16, 2020

jeffreytkj commented May 25, 2021

ibraah88 commented Aug 28, 2021

nursultan156 commented Feb 10, 2022

strokirk commented Apr 19, 2023

hiepnsx commented Feb 28, 2024

pm2 reload downtime in cluster mode #3143

pm2 reload downtime in cluster mode #3143

Comments

brendonboshell commented Sep 10, 2017 • edited Loading

To reproduce

Solution

Update (10 Sep)

vmarchaud commented Sep 10, 2017 • edited Loading

brendonboshell commented Sep 10, 2017

laurentdebricon commented Nov 6, 2017 • edited Loading

guyellis commented Nov 11, 2017 • edited Loading

guyellis commented Nov 13, 2017

guyellis commented Nov 21, 2017

guyellis commented Nov 21, 2017

djbobbydrake commented Mar 14, 2018

wirtsi commented Sep 19, 2018

curtisbelt commented Nov 29, 2018 • edited Loading

curtisbelt commented Dec 1, 2018

harshmandan commented Oct 16, 2020

jeffreytkj commented May 25, 2021

ibraah88 commented Aug 28, 2021

nursultan156 commented Feb 10, 2022

strokirk commented Apr 19, 2023

hiepnsx commented Feb 28, 2024

brendonboshell commented Sep 10, 2017 •

edited

Loading

vmarchaud commented Sep 10, 2017 •

edited

Loading

laurentdebricon commented Nov 6, 2017 •

edited

Loading

guyellis commented Nov 11, 2017 •

edited

Loading

curtisbelt commented Nov 29, 2018 •

edited

Loading