Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REST endpoint failing #3206

Closed
5 tasks
sryps opened this issue Mar 29, 2023 · 6 comments
Closed
5 tasks

REST endpoint failing #3206

sryps opened this issue Mar 29, 2023 · 6 comments
Assignees
Milestone

Comments

@sryps
Copy link

sryps commented Mar 29, 2023

Summary of Bug

We keep running into a problem where we get an error Message too long. It reconnects to the websocket after this error and keeps relaying but every time this happens the REST endpoint dies and wont stand back up until we restart hermes. Since we use the /state endpoint for some monitoring, our system alerts on this happening frequently.

2023-03-26T17:26:10.764153Z ERROR ThreadId(129) event_monitor{chain=osmosis-1}: failed to collect events: WebSocket driver failed: web socket error: failed to read from WebSocket connection: Space limit exceeded: Message too long: 203625165 > 16777216
2023-03-26T17:26:11.910211Z  INFO ThreadId(129) event_monitor{chain=osmosis-1}:event_monitor.reconnect{chain=osmosis-1}: successfully reconnected to WebSocket endpoint ws://re.dac.te.d:12345/websocket

Version

running: hermes version 1.3.0

Acceptance Criteria

REST endpoint survives any websocket disconnection, or specifically this event.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@github-project-automation github-project-automation bot moved this to 🩹 Triage in Hermes Mar 29, 2023
@seanchen1991 seanchen1991 moved this from 🩹 Triage to 📥 Todo in Hermes Mar 29, 2023
@seanchen1991 seanchen1991 added this to the v1.5 milestone Mar 29, 2023
@romac romac added the A: critical Admin: critical or important label Mar 31, 2023
@seanchen1991 seanchen1991 moved this from 📥 Todo to 🏗 In progress in Hermes Apr 10, 2023
@ancazamfir
Copy link
Collaborator

ancazamfir commented Apr 17, 2023

REST endpoint dies and wont stand back up...

What does this mean exactly? For which chains do you see that, only osmosis? And how frequently?

@ancazamfir
Copy link
Collaborator

Also, could you provide the node config.toml, in particular have you tried with the parameters mentioned here?
We had similar issue a while back for which I thought the workaround above worked.

@romac
Copy link
Member

romac commented Apr 21, 2023

@sryps I unfortunately haven't been able to reproduce the issue with the REST endpoint even after killing the WebSocket.

If you have the bandwidth, would you will to build and deploy Hermes from this branch: romac/poll-block-results and see if Hermes and its REST endpoint survives the Osmosis epoch block?

That bypasses the WebSocket and relies on pulling data from the chain every block.

We've also upgrade the built-in REST and telemetry server to use more modern libraries which may also alleviate the issue, so if you'd rather not run an experimental branch, you can also try building Hermes from master and see if you notice any improvements.

@sryps
Copy link
Author

sryps commented Apr 21, 2023

REST endpoint dies and wont stand back up...

What does this mean exactly? For which chains do you see that, only osmosis? And how frequently?

@ancazamfir We use the hermes REST endpoint to monitor the health of the node. So http://{INSERT_HERMES_IP}:3000/state gives us a response like this:

  "status": "success",
  "result": {
    "chains": [
      "cosmoshub-4",
      "evmos_9001-2",
      "osmosis-1",
      "panacea-3"
    ],
    "workers": {
      "Client": [
        {......etc

When we get a websocket error as shown above, this /state REST endpoint dies, throws a 404 error and never recovers until we restart hermes.
This websocket error has only happened with osmosis from what I've seen parsing the logs.
But the osmosis websocket connection establishes itself again just fine and keeps on relaying packets, the issue is the REST endpoint dying and requiring a restart for us to monitor hermes again.

@romac I will try to find time next week to use our orchestration tools and test this branch.
The new rest/telemetry libraries might just do the trick, I will check that as well.

Also I will note this behaviour has been very sporatic, not on every epoch, which makes it even more difficult to troubleshoot :(

Thanks everyone! I will will be in touch, hopefully with good results.

@romac romac mentioned this issue May 9, 2023
7 tasks
@romac romac modified the milestones: v1.5, v1.6 May 23, 2023
@romac romac removed the A: critical Admin: critical or important label May 30, 2023
@romac
Copy link
Member

romac commented May 30, 2023

@sryps Are you still seeing this issue with Hermes 1.5?

@sryps
Copy link
Author

sryps commented May 30, 2023

@romac so far so good. We can close for now and if it resurfaces I'll let you know! Thanks!

@sryps sryps closed this as completed May 30, 2023
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Hermes May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

No branches or pull requests

4 participants