Issues starting after 3.5.1 upgrade #18496

calcium90 · 2020-08-06T12:45:39Z

Description:

We successfully upgraded to 3.5.1 but when we later tried to restart Rocketchat it hung at SAUMonitor ➔ info [aggregate] - Start Cron. and progressed no further.

Since this was in Kubernetes the liveness probe eventually killed it, each time getting stuck at the same point until eventually after 7 times it managed to progress and start correctly. It's possible that it would've succeeded with a longer wait for liveness check but there's no way to verify now as it seems to be fine now.

See additional context for some background to why we restarted.

Steps to reproduce:

Not able to reproduce now, but it was upon start of a new kubernetes pod.

Expected behavior:

Rocketchat starts up in a timely fashion.

Actual behavior:

Rocketchat container starts and gets to a certain point in the logs (see logs) and hangs before eventually being killed by kubernetes due to liveness/readiness probes.

Server Setup Information:

Version of Rocket.Chat Server: 3.5.1
Operating System: Official docker image 3.5.1
Deployment Method: docker in kubernetes
Number of Running Instances: 1
DB Replicaset Oplog:
NodeJS Version:
MongoDB Version:

Client Setup Information

Desktop App or Browser Version:
Operating System:

Additional context

One point to note is the reason we restarted was to apply the enabling of push notifications, I'm not sure if there is any possible interaction here. We received the notice to enable them and actioned accordingly and we then encountered this.

These are the liveness/readiness checks in kubernetes which have served us well for a few years:

        livenessProbe:
          httpGet:
            path: /home
            port: 3000
          initialDelaySeconds: 90
          timeoutSeconds: 10
        readinessProbe:
          httpGet:
            path: /home
            port: 3000
          initialDelaySeconds: 30
          timeoutSeconds: 10

Relevant logs:

kubectl -n xxxx logs rocketchat-7b4dc69ddd-4k77b
rocketchat:lib ➔ oauth_updated Accounts_OAuth_Facebook
...
...
rocketchat:lib ➔ oauth_updated Accounts_OAuth_Apple
Federation ➔ Setup.info Federation is disabled
Setting default file store to FileSystem
CAS ➔ info Disabling CAS login service
Search Logger ➔ info create search provider defaultProvider
Search Logger ➔ info create search provider chatpalProvider
{"line":"120","file":"migrations.js","message":"Migrations: Not migrating, already at version 202","time":{"$date":1596709208408},"level":"info"}
ufs: temp directory created at "/tmp/ufs"
SAUMonitor ➔ info [aggregate] - Start Cron.

The text was updated successfully, but these errors were encountered:

bkraul · 2020-08-13T13:30:35Z

This is also happening on docker develop-80ba8cf. I do not have the extra debugging options @cjpluss2k18 has, but I can see that it is getting hung at:

ufs: temp directory created at "/tmp/ufs"

What can we do to work around this problem?

alyandon · 2020-08-13T13:42:52Z

Yep - looks like snap package updated to 3.5.1 and after restart rocketchat-server is getting stuck at that same point and it isn't opening the listening TCP socket so my reverse proxy gateway is responding with a 502 bad gateway error.

Edit: Reverting snap to 3.4.2 doesn't seem to help either. Server starts and then exits immediately without any indication of error.

Further Edit: stracing the node process it is definitely talking to mongo at least - looks vaguely like it is looping over rocket_notification_queue - just nothing else going on.

cjhille · 2020-08-13T13:44:06Z

I have exactly the same issue after upgrading to 3.5.1. First restart of docker containers after upgrading worked, but it gets stuck after the second restart at ufs: temp directory created at "/tmp/ufs".

Same as @cjpluss2k18 , I restarted the second time to enable push notifications, but my rocketchat instance does not start after 7 minutes (... or half an hour for that matter) :/

EDIT: Now it progresses to this point and gets stuck here

rocketchat_1          | LocalStore: store created at
rocketchat_1          | LocalStore: store created at
rocketchat_1          | LocalStore: store created at
rocketchat_1          | Setting default file store to GridFS
rocketchat_1          | {"line":"120","file":"migrations.js","message":"Migrations: Not migrating, already at version 202","time":{"$date":1597326036694},"level":"info"}
rocketchat_1          | ufs: temp directory created at "/tmp/ufs"
rocketchat_1          | LocalStore: store created at
rocketchat_1          | LocalStore: store created at
rocketchat_1          | LocalStore: store created at
rocketchat_1          | Setting default file store to GridFS
rocketchat_1          | {"line":"120","file":"migrations.js","message":"Migrations: Not migrating, already at version 202","time":{"$date":1597326196454},"level":"info"}

bkraul · 2020-08-13T13:45:37Z

@sampaiodiego could we please get this triaged? Seems like a breaking issue :)

o5k · 2020-08-13T14:11:21Z

how many times does this have to happen

Mixelito · 2020-08-13T14:30:19Z

same here

cjhille · 2020-08-13T14:33:50Z

Well this is quite nasty.. since the system worked for a while after upgrading. Now I'm forced to revert to an older backup from 2 days ago... which means dataloss for everyone.. \o/

acmeplus · 2020-08-13T14:47:40Z

Same problem here since our snap installation automatically upgraded to 3.5.1 during the night. I've tried to revert to a previous snap but the same issue persists. Still stuck at the SAUMonitor line.

alyandon · 2020-08-13T15:18:39Z

For those curious - I popped into the community rocketchat server snap channel and based on the conversation going on there it appears that this problem is being caused by the rocketchat server attempting to reach out to cloud.rocket.chat which currently appears to be non-responsive in some fashion?

Adding a firewall rule to block outbound traffic - iptables -I OUTPUT -d 51.81.11.138/24 -j REJECT appears to allow my instance to start normally but presumably things like push notifications won't work.

Obviously, I don't recommend using that as anything other than a work-around.

phrankerco · 2020-08-13T15:25:05Z

Adding a firewall rule to block outbound traffic - iptables -I OUTPUT -d 51.81.11.138/24 -j REJECT appears to allow my instance to start normally but presumably things like push notifications won't work.

This work-around does work!
Will continue to follow this thread for an actual solution.

sampaiodiego · 2020-08-13T15:27:19Z

A proper fix will land on release 3.5.2. a Docker image will be available in a few minutes. A snap release will be available in a few hours.

bkraul · 2020-08-13T15:27:59Z

I wondered why my server started right up on my test instance which is not joined to the rocket.chat cloud or connected to its services. Makes perfect sense. @sampaiodiego , I noticed the PR you merged only defers the startup checks, but it doesn't seem to address the issue of trying to access cloud.rocket.chat when it is unavailable. Is this something needing to be handled as well?

iovcho · 2020-08-13T16:04:49Z

My RocketChat server (running on virtual machine CentOS Linux release 7.8.2003 (Core) with latest updates just crash today. I run all OS updates (MongoDB, NodeJS and RocketChat to the latest version), but without success.
I run
iptables -I OUTPUT -d 51.81.11.138/24 -j REJECT
systemctl restart rocketchat
and server now is online.

rodrigok · 2020-08-13T16:36:27Z

@bkraul it's not a case when the service isn't available, but when the service doesn't reply keeping the connection waiting for the stoping the startup, we didn't realize those requests were making the startup to wait so the first solution was to move them to async.

This PR changes the HTTP calls to always have a forced timeout of 20s and prevent those connections to stay forever there.

We are fixing our cloud services to provide a timeout as well to prevent this to happen again on old versions of Rocket.Chat or any other place using the cloud APIs.

Unfortunately, it was the first time we had this kind of problem, it seems to be related to our services and database connection making the requests to wait forever. We are implementing all the necessary changes to prevent this to happen again.

Our cloud services are back to the normal response time since a few minutes ago, so if you restart your servers they should ba to work again.

Thanks

alyandon · 2020-08-13T16:41:29Z

@rodrigok Thanks for the update. Sounds painful. :(

sampaiodiego mentioned this issue Aug 13, 2020

Defer startup checks #18547

Merged

12 tasks

sampaiodiego closed this as completed in #18547 Aug 13, 2020

This was referenced Aug 20, 2020

Mobile Push Notifications Delayed for 2 minutes #18556

Closed

Mobile Push Notifications Delayed for Days #18559

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues starting after 3.5.1 upgrade #18496

Issues starting after 3.5.1 upgrade #18496

calcium90 commented Aug 6, 2020

bkraul commented Aug 13, 2020

alyandon commented Aug 13, 2020 •

edited

Loading

cjhille commented Aug 13, 2020 •

edited

Loading

bkraul commented Aug 13, 2020

o5k commented Aug 13, 2020

Mixelito commented Aug 13, 2020

cjhille commented Aug 13, 2020 •

edited

Loading

acmeplus commented Aug 13, 2020

alyandon commented Aug 13, 2020

phrankerco commented Aug 13, 2020

sampaiodiego commented Aug 13, 2020

bkraul commented Aug 13, 2020

iovcho commented Aug 13, 2020 •

edited

Loading

rodrigok commented Aug 13, 2020

alyandon commented Aug 13, 2020 •

edited

Loading

Issues starting after 3.5.1 upgrade #18496

Issues starting after 3.5.1 upgrade #18496

Comments

calcium90 commented Aug 6, 2020

Description:

Steps to reproduce:

Expected behavior:

Actual behavior:

Server Setup Information:

Client Setup Information

Additional context

Relevant logs:

bkraul commented Aug 13, 2020

alyandon commented Aug 13, 2020 • edited Loading

cjhille commented Aug 13, 2020 • edited Loading

bkraul commented Aug 13, 2020

o5k commented Aug 13, 2020

Mixelito commented Aug 13, 2020

cjhille commented Aug 13, 2020 • edited Loading

acmeplus commented Aug 13, 2020

alyandon commented Aug 13, 2020

phrankerco commented Aug 13, 2020

sampaiodiego commented Aug 13, 2020

bkraul commented Aug 13, 2020

iovcho commented Aug 13, 2020 • edited Loading

rodrigok commented Aug 13, 2020

alyandon commented Aug 13, 2020 • edited Loading

alyandon commented Aug 13, 2020 •

edited

Loading

cjhille commented Aug 13, 2020 •

edited

Loading

cjhille commented Aug 13, 2020 •

edited

Loading

iovcho commented Aug 13, 2020 •

edited

Loading

alyandon commented Aug 13, 2020 •

edited

Loading