Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node health #1203

Merged
merged 37 commits into from
Dec 21, 2023
Merged

Node health #1203

merged 37 commits into from
Dec 21, 2023

Conversation

pavelkrolevets
Copy link
Contributor

@pavelkrolevets pavelkrolevets commented Nov 13, 2023

Health check at GET /v1/node/health

  • checks BN connection health
  • checks execution health
  • checks event sync health
  • checks amount of connected peers (less than 10 isnt good)
  • checks if inbound peer equal to 0 - probably local port isnt reachable by other peers

Returns:

{
  "p2p": "bad: no peers are connected",
  "beacon_node": "good",
  "execution_node": "good",
  "event_syncer": "good",
  "advanced": {
    "peers": 0,
    "inbound_conns": 0,
    "outbound_conns": 0,
    "p2p_listen_addresses": [
      "192.168.1.112:13001"
    ]
  }
}

Example:

curl -i -H "Accept: application/json" http://5.161.192.132:16052/v1/node/health
curl -i -H "Accept:  text/plain" http://5.161.192.132:16052/v1/node/health

@pavelkrolevets pavelkrolevets changed the base branch from main to stage November 13, 2023 12:10
api/handlers/node.go Outdated Show resolved Hide resolved
api/handlers/node.go Outdated Show resolved Hide resolved
}

func (h *Node) Topics(w http.ResponseWriter, r *http.Request) error {
allpeers, peerbytpc := h.TopicIndex.PeersByTopic()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tpc is more complex to understand than topic and saves just 2 symbols. Also, let's follow camel case (https://go.dev/doc/effective_go#mixed-caps)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Topics was here before the PR. But yes, we should follow camel case everywhere. Why linter isnt complaining? Maybe we need to make it more strict...

api/handlers/node.go Outdated Show resolved Hide resolved
api/handlers/node.go Outdated Show resolved Hide resolved
api/handlers/node.go Outdated Show resolved Hide resolved
api/handlers/node.go Outdated Show resolved Hide resolved
api/handlers/node.go Outdated Show resolved Hide resolved
api/handlers/node.go Outdated Show resolved Hide resolved
nodeprobe/nodeprobe.go Outdated Show resolved Hide resolved
return nil
}

func (p *Prober) CheckEventSycNodeHealth(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (p *Prober) CheckEventSycNodeHealth(ctx context.Context) error {
func (p *Prober) CheckEventSyncNodeHealth(ctx context.Context) error {

nodeprobe/nodeprobe.go Outdated Show resolved Hide resolved
)

func (c HealthStatus) String() string {
str := [...]string{"NotHealthy", "Healthy"}
Copy link
Contributor

@y0sher y0sher Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confusing name, I'd chose something more bold like "OK"/"ERROR" and "GOOD"/"BAD"

Comment on lines 154 to 161
switch {
case activePeerCount >= healthyPeersAmount:
resp.PeersHealthStatus = fmt.Sprintf("%s: %d peers are connected", Healthy, activePeerCount)
case activePeerCount > 0 && activePeerCount < healthyPeersAmount:
resp.PeersHealthStatus = fmt.Sprintf("%s: less than %d peers are connected", NotHealthy, healthyPeersAmount)
case activePeerCount == 0:
resp.PeersHealthStatus = fmt.Sprintf("%s: %s", NotHealthy, "error: no peers are connected")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would break this into two. one that says GOOD/BAD (so status only), and one that says the number, or maybe reason.

@pavelkrolevets pavelkrolevets marked this pull request as ready for review November 15, 2023 07:23
MatusKysel
MatusKysel previously approved these changes Nov 15, 2023
This reverts commit 2f54f4e.
MatusKysel
MatusKysel previously approved these changes Nov 15, 2023
y0sher
y0sher previously approved these changes Nov 16, 2023
Copy link
Contributor

@y0sher y0sher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@moshe-blox
Copy link
Contributor

Please also merge #1141 with this

@pavelkrolevets
Copy link
Contributor Author

Please also merge #1141 with this

merged ssv-api-docs to here

MatusKysel
MatusKysel previously approved these changes Nov 27, 2023
* refactor: node health API
MatusKysel
MatusKysel previously approved these changes Dec 6, 2023
Comment on lines +116 to +122
peers, byTopic := h.TopicIndex.PeersByTopic()

resp := AllPeersAndTopicsJSON{
AllPeers: peers,
}
for topic, peers := range byTopic {
resp.PeersByTopic = append(resp.PeersByTopic, topicIndexJSON{TopicName: topic, Peers: peers})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using peers twice for different stuff, though I think its not a big issue since first one usage is done before scond is started

y0sher
y0sher previously approved these changes Dec 7, 2023
@moshe-blox moshe-blox dismissed stale reviews from y0sher and MatusKysel via 959b4ed December 13, 2023 14:08
Copy link
Contributor

@y0sher y0sher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moshe-blox @pavelkrolevets @MatusKysel what is the status here?

@MatusKysel
Copy link
Contributor

@moshe-blox @pavelkrolevets @MatusKysel what is the status here?

we need to cover case when you have no validators, you need to subscribe to random subnet, because we will show wrong statistics

y0sher
y0sher previously approved these changes Dec 19, 2023
@y0sher y0sher merged commit 84582f9 into stage Dec 21, 2023
5 checks passed
moshe-blox added a commit that referenced this pull request Jan 2, 2024
* fix: stale operator ID in `p2pNetwork` (#1229)

* fix: stale operator ID in `p2pNetwork`

---------

Co-authored-by: Lior Rutenberg <liorr@blox.io>

* fix: sufficient timeout for initial duty fetch (#1214)

* fix: sufficient timeout for initial duty fetch (#1214)
---------

Co-authored-by: Matus Kysel <matus@blox.io>

* deployment: change bootnode for holesky-stage (#1233)

* deploy new bootnode ENR for holesky stage

* Voluntary exit (#1200)

* Voluntary exit

---------

Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: olegshmuelov <oleg@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>

* feat: include 'connected' in peer scores log (#1224)

* feat: include 'connected' in peer scores log

* P2P metrics  (#1163)

* metrics: added p4 score (invalidMessageDeliveries)

* metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification

* deploy to the stage

# Conflicts:
#	.gitlab-ci.yml

* metrics: added counters flush once in 8h

* metrics: added counters for duties

* Added auto delete disconnected peers label metrics. Added RSA verifications metric

* disabled ci

* Updated p4 score metric update

* metrics: added p4 score (invalidMessageDeliveries)

* metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification

* deploy to the stage

# Conflicts:
#	.gitlab-ci.yml

* metrics: added counters flush once in 8h

* metrics: added counters for duties

* Added auto delete disconnected peers label metrics. Added RSA verifications metric

* disabled ci

* Updated p4 score metric update

* review comments fixes

* deploy

* deploy

* testing metrics differ

* added duties created and finalized

* removed extra monitoring.metricsreporter usage

* removed topic from p4score. made it a sum of squares

* deploy to all nodes

* deploy p2p_metrics

* pr review fixes

* trigger ci

* deploy 9-20

* Fix msg validation rsa verification counter metric

* allocate p2p_metrics

* inspect scores more frequently, but don't log every time

* renamed duties created/finalized metrics and refactored to record role

* deploy to 5--8

* deploy to 5--8

* rename metric

* blank space

* approve spec change (just a metric)

* refactors

* reset scores

* metric help msg

* revert gitlab

* deploy

* revert deploy

* Add metrics for signature verifications

* Update differ.config.yaml with approved changes

---------

Co-authored-by: Anton Korpusenko <anton@blox.io>
Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>
Co-authored-by: moshe-blox <moshe@blox.io>

* Node health (#1203)

* chore: link to SSV API docs in configs & README

* initial commit

* add node health route to ssv API

* update health route

* update health route

* update health route

* update health route

* deploy to stage

* add plaintext response

* lint

* lint

* change to good/bad

* lint

* lint

* refactor

* Revert "deploy to stage"

This reverts commit 2f54f4e.

* lint

* add inbound/outbound count for health + deploy to stage

* change ports back

* update count

* lint

* update conns

* lint

* remove connected peer count

* test blocked ports

* Revert "remove connected peer count"

This reverts commit 79e2b94.

* leave only active peers count

* Revert "test blocked ports"

This reverts commit 6fc9282.

* ci to stage

* add mutex to nodes access

* refactor: node health API (#1222)

* refactor: node health API

* added cpu_cores to healthcheck output

* fix inbound/outbound stats

* Remove CPU core reporting

---------

Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com>

* feat: subscribe to a random subnet with 0 validators (#1245)

* feat: subscribe to a random subnet with 0 validators

* Fix: health route host addres (#1246)

* node health api route advertises the host addresses from the config.

* set up listenaddress directly

* feat: rate limit inbound connections by IP (#1226)

* feat: rate limit inbound connections by IP

* activate conngater

* deploy to 5--8

* fix

* fix

* Refactor connection gating in p2p setup

* Update ipLimiter parameters

* revert gitlab

* Revert "revert gitlab"

This reverts commit fcc7902.

* Revert "Revert "revert gitlab""

This reverts commit feb9e4e.

---------

Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>

---------

Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com>
Co-authored-by: Lior Rutenberg <liorr@blox.io>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: Nikita Kryuchkov <nkryuchkov10@gmail.com>
Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: olegshmuelov <oleg@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>
Co-authored-by: Anton Korpusenko <antokorp@gmail.com>
Co-authored-by: Anton Korpusenko <anton@blox.io>
Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>
Co-authored-by: Pavel Krolevets <pavelkrolevets@gmail.com>
@MatusKysel MatusKysel deleted the node_health branch July 9, 2024 09:49
y0sher added a commit that referenced this pull request Jul 28, 2024
* fix: stale operator ID in `p2pNetwork` (#1229) (#1230)

* fix: stale operator ID in `p2pNetwork`

---------

Co-authored-by: Lior Rutenberg <liorr@blox.io>

* v1.2.2 (#1248)

* fix: stale operator ID in `p2pNetwork` (#1229)

* fix: stale operator ID in `p2pNetwork`

---------

Co-authored-by: Lior Rutenberg <liorr@blox.io>

* fix: sufficient timeout for initial duty fetch (#1214)

* fix: sufficient timeout for initial duty fetch (#1214)
---------

Co-authored-by: Matus Kysel <matus@blox.io>

* deployment: change bootnode for holesky-stage (#1233)

* deploy new bootnode ENR for holesky stage

* Voluntary exit (#1200)

* Voluntary exit

---------

Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: olegshmuelov <oleg@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>

* feat: include 'connected' in peer scores log (#1224)

* feat: include 'connected' in peer scores log

* P2P metrics  (#1163)

* metrics: added p4 score (invalidMessageDeliveries)

* metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification

* deploy to the stage

# Conflicts:
#	.gitlab-ci.yml

* metrics: added counters flush once in 8h

* metrics: added counters for duties

* Added auto delete disconnected peers label metrics. Added RSA verifications metric

* disabled ci

* Updated p4 score metric update

* metrics: added p4 score (invalidMessageDeliveries)

* metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification

* deploy to the stage

# Conflicts:
#	.gitlab-ci.yml

* metrics: added counters flush once in 8h

* metrics: added counters for duties

* Added auto delete disconnected peers label metrics. Added RSA verifications metric

* disabled ci

* Updated p4 score metric update

* review comments fixes

* deploy

* deploy

* testing metrics differ

* added duties created and finalized

* removed extra monitoring.metricsreporter usage

* removed topic from p4score. made it a sum of squares

* deploy to all nodes

* deploy p2p_metrics

* pr review fixes

* trigger ci

* deploy 9-20

* Fix msg validation rsa verification counter metric

* allocate p2p_metrics

* inspect scores more frequently, but don't log every time

* renamed duties created/finalized metrics and refactored to record role

* deploy to 5--8

* deploy to 5--8

* rename metric

* blank space

* approve spec change (just a metric)

* refactors

* reset scores

* metric help msg

* revert gitlab

* deploy

* revert deploy

* Add metrics for signature verifications

* Update differ.config.yaml with approved changes

---------

Co-authored-by: Anton Korpusenko <anton@blox.io>
Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>
Co-authored-by: moshe-blox <moshe@blox.io>

* Node health (#1203)

* chore: link to SSV API docs in configs & README

* initial commit

* add node health route to ssv API

* update health route

* update health route

* update health route

* update health route

* deploy to stage

* add plaintext response

* lint

* lint

* change to good/bad

* lint

* lint

* refactor

* Revert "deploy to stage"

This reverts commit 2f54f4e.

* lint

* add inbound/outbound count for health + deploy to stage

* change ports back

* update count

* lint

* update conns

* lint

* remove connected peer count

* test blocked ports

* Revert "remove connected peer count"

This reverts commit 79e2b94.

* leave only active peers count

* Revert "test blocked ports"

This reverts commit 6fc9282.

* ci to stage

* add mutex to nodes access

* refactor: node health API (#1222)

* refactor: node health API

* added cpu_cores to healthcheck output

* fix inbound/outbound stats

* Remove CPU core reporting

---------

Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com>

* feat: subscribe to a random subnet with 0 validators (#1245)

* feat: subscribe to a random subnet with 0 validators

* Fix: health route host addres (#1246)

* node health api route advertises the host addresses from the config.

* set up listenaddress directly

* feat: rate limit inbound connections by IP (#1226)

* feat: rate limit inbound connections by IP

* activate conngater

* deploy to 5--8

* fix

* fix

* Refactor connection gating in p2p setup

* Update ipLimiter parameters

* revert gitlab

* Revert "revert gitlab"

This reverts commit fcc7902.

* Revert "Revert "revert gitlab""

This reverts commit feb9e4e.

---------

Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>

---------

Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com>
Co-authored-by: Lior Rutenberg <liorr@blox.io>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: Nikita Kryuchkov <nkryuchkov10@gmail.com>
Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: olegshmuelov <oleg@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>
Co-authored-by: Anton Korpusenko <antokorp@gmail.com>
Co-authored-by: Anton Korpusenko <anton@blox.io>
Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>
Co-authored-by: Pavel Krolevets <pavelkrolevets@gmail.com>

* Revert "v1.2.2 (#1248)" (#1263)

This reverts commit ff2f594.

* chore: fix some function names

Signed-off-by: linghuying <1599935829@qq.com>

---------

Signed-off-by: linghuying <1599935829@qq.com>
Co-authored-by: Lior Rutenberg <liorr@blox.io>
Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com>
Co-authored-by: rehs0y <lyosher@gmail.com>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: Nikita Kryuchkov <nkryuchkov10@gmail.com>
Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: olegshmuelov <oleg@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>
Co-authored-by: Anton Korpusenko <antokorp@gmail.com>
Co-authored-by: Anton Korpusenko <anton@blox.io>
Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>
Co-authored-by: Pavel Krolevets <pavelkrolevets@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants