Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues starting after 3.5.1 upgrade #18496

Closed
calcium90 opened this issue Aug 6, 2020 · 15 comments · Fixed by #18547
Closed

Issues starting after 3.5.1 upgrade #18496

calcium90 opened this issue Aug 6, 2020 · 15 comments · Fixed by #18547

Comments

@calcium90
Copy link

Description:

We successfully upgraded to 3.5.1 but when we later tried to restart Rocketchat it hung at SAUMonitor ➔ info [aggregate] - Start Cron. and progressed no further.

Since this was in Kubernetes the liveness probe eventually killed it, each time getting stuck at the same point until eventually after 7 times it managed to progress and start correctly. It's possible that it would've succeeded with a longer wait for liveness check but there's no way to verify now as it seems to be fine now.

See additional context for some background to why we restarted.

Steps to reproduce:

Not able to reproduce now, but it was upon start of a new kubernetes pod.

Expected behavior:

Rocketchat starts up in a timely fashion.

Actual behavior:

Rocketchat container starts and gets to a certain point in the logs (see logs) and hangs before eventually being killed by kubernetes due to liveness/readiness probes.

Server Setup Information:

  • Version of Rocket.Chat Server: 3.5.1
  • Operating System: Official docker image 3.5.1
  • Deployment Method: docker in kubernetes
  • Number of Running Instances: 1
  • DB Replicaset Oplog:
  • NodeJS Version:
  • MongoDB Version:

Client Setup Information

  • Desktop App or Browser Version:
  • Operating System:

Additional context

One point to note is the reason we restarted was to apply the enabling of push notifications, I'm not sure if there is any possible interaction here. We received the notice to enable them and actioned accordingly and we then encountered this.

These are the liveness/readiness checks in kubernetes which have served us well for a few years:

        livenessProbe:
          httpGet:
            path: /home
            port: 3000
          initialDelaySeconds: 90
          timeoutSeconds: 10
        readinessProbe:
          httpGet:
            path: /home
            port: 3000
          initialDelaySeconds: 30
          timeoutSeconds: 10

Relevant logs:

kubectl -n xxxx logs rocketchat-7b4dc69ddd-4k77b
rocketchat:lib ➔ oauth_updated Accounts_OAuth_Facebook
...
...
rocketchat:lib ➔ oauth_updated Accounts_OAuth_Apple
Federation ➔ Setup.info Federation is disabled
Setting default file store to FileSystem
CAS ➔ info Disabling CAS login service
Search Logger ➔ info create search provider defaultProvider
Search Logger ➔ info create search provider chatpalProvider
{"line":"120","file":"migrations.js","message":"Migrations: Not migrating, already at version 202","time":{"$date":1596709208408},"level":"info"}
ufs: temp directory created at "/tmp/ufs"
SAUMonitor ➔ info [aggregate] - Start Cron.
@bkraul
Copy link

bkraul commented Aug 13, 2020

This is also happening on docker develop-80ba8cf. I do not have the extra debugging options @cjpluss2k18 has, but I can see that it is getting hung at:

ufs: temp directory created at "/tmp/ufs"

What can we do to work around this problem?

@alyandon
Copy link

alyandon commented Aug 13, 2020

Yep - looks like snap package updated to 3.5.1 and after restart rocketchat-server is getting stuck at that same point and it isn't opening the listening TCP socket so my reverse proxy gateway is responding with a 502 bad gateway error.

Edit: Reverting snap to 3.4.2 doesn't seem to help either. Server starts and then exits immediately without any indication of error.

Further Edit: stracing the node process it is definitely talking to mongo at least - looks vaguely like it is looping over rocket_notification_queue - just nothing else going on.

@cjhille
Copy link

cjhille commented Aug 13, 2020

I have exactly the same issue after upgrading to 3.5.1. First restart of docker containers after upgrading worked, but it gets stuck after the second restart at ufs: temp directory created at "/tmp/ufs".

Same as @cjpluss2k18 , I restarted the second time to enable push notifications, but my rocketchat instance does not start after 7 minutes (... or half an hour for that matter) :/

EDIT: Now it progresses to this point and gets stuck here

rocketchat_1          | LocalStore: store created at
rocketchat_1          | LocalStore: store created at
rocketchat_1          | LocalStore: store created at
rocketchat_1          | Setting default file store to GridFS
rocketchat_1          | {"line":"120","file":"migrations.js","message":"Migrations: Not migrating, already at version 202","time":{"$date":1597326036694},"level":"info"}
rocketchat_1          | ufs: temp directory created at "/tmp/ufs"
rocketchat_1          | LocalStore: store created at
rocketchat_1          | LocalStore: store created at
rocketchat_1          | LocalStore: store created at
rocketchat_1          | Setting default file store to GridFS
rocketchat_1          | {"line":"120","file":"migrations.js","message":"Migrations: Not migrating, already at version 202","time":{"$date":1597326196454},"level":"info"}

@bkraul
Copy link

bkraul commented Aug 13, 2020

@sampaiodiego could we please get this triaged? Seems like a breaking issue :)

@o5k
Copy link

o5k commented Aug 13, 2020

how many times does this have to happen

@Mixelito
Copy link

same here

@cjhille
Copy link

cjhille commented Aug 13, 2020

Well this is quite nasty.. since the system worked for a while after upgrading. Now I'm forced to revert to an older backup from 2 days ago... which means dataloss for everyone.. \o/

@acmeplus
Copy link

Same problem here since our snap installation automatically upgraded to 3.5.1 during the night. I've tried to revert to a previous snap but the same issue persists. Still stuck at the SAUMonitor line.

@alyandon
Copy link

For those curious - I popped into the community rocketchat server snap channel and based on the conversation going on there it appears that this problem is being caused by the rocketchat server attempting to reach out to cloud.rocket.chat which currently appears to be non-responsive in some fashion?

Adding a firewall rule to block outbound traffic - iptables -I OUTPUT -d 51.81.11.138/24 -j REJECT appears to allow my instance to start normally but presumably things like push notifications won't work.

Obviously, I don't recommend using that as anything other than a work-around.

@phrankerco
Copy link

Adding a firewall rule to block outbound traffic - iptables -I OUTPUT -d 51.81.11.138/24 -j REJECT appears to allow my instance to start normally but presumably things like push notifications won't work.

This work-around does work!
Will continue to follow this thread for an actual solution.

@sampaiodiego
Copy link
Member

A proper fix will land on release 3.5.2. a Docker image will be available in a few minutes. A snap release will be available in a few hours.

@bkraul
Copy link

bkraul commented Aug 13, 2020

I wondered why my server started right up on my test instance which is not joined to the rocket.chat cloud or connected to its services. Makes perfect sense. @sampaiodiego , I noticed the PR you merged only defers the startup checks, but it doesn't seem to address the issue of trying to access cloud.rocket.chat when it is unavailable. Is this something needing to be handled as well?

@iovcho
Copy link

iovcho commented Aug 13, 2020

My RocketChat server (running on virtual machine CentOS Linux release 7.8.2003 (Core) with latest updates just crash today. I run all OS updates (MongoDB, NodeJS and RocketChat to the latest version), but without success.
I run
iptables -I OUTPUT -d 51.81.11.138/24 -j REJECT
systemctl restart rocketchat
and server now is online.

@rodrigok
Copy link
Member

@bkraul it's not a case when the service isn't available, but when the service doesn't reply keeping the connection waiting for the stoping the startup, we didn't realize those requests were making the startup to wait so the first solution was to move them to async.

This PR changes the HTTP calls to always have a forced timeout of 20s and prevent those connections to stay forever there.

We are fixing our cloud services to provide a timeout as well to prevent this to happen again on old versions of Rocket.Chat or any other place using the cloud APIs.

Unfortunately, it was the first time we had this kind of problem, it seems to be related to our services and database connection making the requests to wait forever. We are implementing all the necessary changes to prevent this to happen again.

Our cloud services are back to the normal response time since a few minutes ago, so if you restart your servers they should ba to work again.

Thanks

@alyandon
Copy link

alyandon commented Aug 13, 2020

@rodrigok Thanks for the update. Sounds painful. :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.