-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node health #1203
Node health #1203
Conversation
api/handlers/node.go
Outdated
} | ||
|
||
func (h *Node) Topics(w http.ResponseWriter, r *http.Request) error { | ||
allpeers, peerbytpc := h.TopicIndex.PeersByTopic() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tpc
is more complex to understand than topic
and saves just 2 symbols. Also, let's follow camel case (https://go.dev/doc/effective_go#mixed-caps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Topics was here before the PR. But yes, we should follow camel case everywhere. Why linter isnt complaining? Maybe we need to make it more strict...
nodeprobe/nodeprobe.go
Outdated
return nil | ||
} | ||
|
||
func (p *Prober) CheckEventSycNodeHealth(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (p *Prober) CheckEventSycNodeHealth(ctx context.Context) error { | |
func (p *Prober) CheckEventSyncNodeHealth(ctx context.Context) error { |
api/handlers/node.go
Outdated
) | ||
|
||
func (c HealthStatus) String() string { | ||
str := [...]string{"NotHealthy", "Healthy"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confusing name, I'd chose something more bold like "OK"/"ERROR" and "GOOD"/"BAD"
api/handlers/node.go
Outdated
switch { | ||
case activePeerCount >= healthyPeersAmount: | ||
resp.PeersHealthStatus = fmt.Sprintf("%s: %d peers are connected", Healthy, activePeerCount) | ||
case activePeerCount > 0 && activePeerCount < healthyPeersAmount: | ||
resp.PeersHealthStatus = fmt.Sprintf("%s: less than %d peers are connected", NotHealthy, healthyPeersAmount) | ||
case activePeerCount == 0: | ||
resp.PeersHealthStatus = fmt.Sprintf("%s: %s", NotHealthy, "error: no peers are connected") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would break this into two. one that says GOOD/BAD (so status only), and one that says the number, or maybe reason.
This reverts commit 2f54f4e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Please also merge #1141 with this |
merged ssv-api-docs to here |
* refactor: node health API
peers, byTopic := h.TopicIndex.PeersByTopic() | ||
|
||
resp := AllPeersAndTopicsJSON{ | ||
AllPeers: peers, | ||
} | ||
for topic, peers := range byTopic { | ||
resp.PeersByTopic = append(resp.PeersByTopic, topicIndexJSON{TopicName: topic, Peers: peers}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using peers twice for different stuff, though I think its not a big issue since first one usage is done before scond is started
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moshe-blox @pavelkrolevets @MatusKysel what is the status here?
we need to cover case when you have no validators, you need to subscribe to random subnet, because we will show wrong statistics |
* fix: stale operator ID in `p2pNetwork` (#1229) * fix: stale operator ID in `p2pNetwork` --------- Co-authored-by: Lior Rutenberg <liorr@blox.io> * fix: sufficient timeout for initial duty fetch (#1214) * fix: sufficient timeout for initial duty fetch (#1214) --------- Co-authored-by: Matus Kysel <matus@blox.io> * deployment: change bootnode for holesky-stage (#1233) * deploy new bootnode ENR for holesky stage * Voluntary exit (#1200) * Voluntary exit --------- Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: olegshmuelov <oleg@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> * feat: include 'connected' in peer scores log (#1224) * feat: include 'connected' in peer scores log * P2P metrics (#1163) * metrics: added p4 score (invalidMessageDeliveries) * metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification * deploy to the stage # Conflicts: # .gitlab-ci.yml * metrics: added counters flush once in 8h * metrics: added counters for duties * Added auto delete disconnected peers label metrics. Added RSA verifications metric * disabled ci * Updated p4 score metric update * metrics: added p4 score (invalidMessageDeliveries) * metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification * deploy to the stage # Conflicts: # .gitlab-ci.yml * metrics: added counters flush once in 8h * metrics: added counters for duties * Added auto delete disconnected peers label metrics. Added RSA verifications metric * disabled ci * Updated p4 score metric update * review comments fixes * deploy * deploy * testing metrics differ * added duties created and finalized * removed extra monitoring.metricsreporter usage * removed topic from p4score. made it a sum of squares * deploy to all nodes * deploy p2p_metrics * pr review fixes * trigger ci * deploy 9-20 * Fix msg validation rsa verification counter metric * allocate p2p_metrics * inspect scores more frequently, but don't log every time * renamed duties created/finalized metrics and refactored to record role * deploy to 5--8 * deploy to 5--8 * rename metric * blank space * approve spec change (just a metric) * refactors * reset scores * metric help msg * revert gitlab * deploy * revert deploy * Add metrics for signature verifications * Update differ.config.yaml with approved changes --------- Co-authored-by: Anton Korpusenko <anton@blox.io> Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> Co-authored-by: moshe-blox <moshe@blox.io> * Node health (#1203) * chore: link to SSV API docs in configs & README * initial commit * add node health route to ssv API * update health route * update health route * update health route * update health route * deploy to stage * add plaintext response * lint * lint * change to good/bad * lint * lint * refactor * Revert "deploy to stage" This reverts commit 2f54f4e. * lint * add inbound/outbound count for health + deploy to stage * change ports back * update count * lint * update conns * lint * remove connected peer count * test blocked ports * Revert "remove connected peer count" This reverts commit 79e2b94. * leave only active peers count * Revert "test blocked ports" This reverts commit 6fc9282. * ci to stage * add mutex to nodes access * refactor: node health API (#1222) * refactor: node health API * added cpu_cores to healthcheck output * fix inbound/outbound stats * Remove CPU core reporting --------- Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com> * feat: subscribe to a random subnet with 0 validators (#1245) * feat: subscribe to a random subnet with 0 validators * Fix: health route host addres (#1246) * node health api route advertises the host addresses from the config. * set up listenaddress directly * feat: rate limit inbound connections by IP (#1226) * feat: rate limit inbound connections by IP * activate conngater * deploy to 5--8 * fix * fix * Refactor connection gating in p2p setup * Update ipLimiter parameters * revert gitlab * Revert "revert gitlab" This reverts commit fcc7902. * Revert "Revert "revert gitlab"" This reverts commit feb9e4e. --------- Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> --------- Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com> Co-authored-by: Lior Rutenberg <liorr@blox.io> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: Nikita Kryuchkov <nkryuchkov10@gmail.com> Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: olegshmuelov <oleg@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> Co-authored-by: Anton Korpusenko <antokorp@gmail.com> Co-authored-by: Anton Korpusenko <anton@blox.io> Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> Co-authored-by: Pavel Krolevets <pavelkrolevets@gmail.com>
* fix: stale operator ID in `p2pNetwork` (#1229) (#1230) * fix: stale operator ID in `p2pNetwork` --------- Co-authored-by: Lior Rutenberg <liorr@blox.io> * v1.2.2 (#1248) * fix: stale operator ID in `p2pNetwork` (#1229) * fix: stale operator ID in `p2pNetwork` --------- Co-authored-by: Lior Rutenberg <liorr@blox.io> * fix: sufficient timeout for initial duty fetch (#1214) * fix: sufficient timeout for initial duty fetch (#1214) --------- Co-authored-by: Matus Kysel <matus@blox.io> * deployment: change bootnode for holesky-stage (#1233) * deploy new bootnode ENR for holesky stage * Voluntary exit (#1200) * Voluntary exit --------- Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: olegshmuelov <oleg@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> * feat: include 'connected' in peer scores log (#1224) * feat: include 'connected' in peer scores log * P2P metrics (#1163) * metrics: added p4 score (invalidMessageDeliveries) * metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification * deploy to the stage # Conflicts: # .gitlab-ci.yml * metrics: added counters flush once in 8h * metrics: added counters for duties * Added auto delete disconnected peers label metrics. Added RSA verifications metric * disabled ci * Updated p4 score metric update * metrics: added p4 score (invalidMessageDeliveries) * metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification * deploy to the stage # Conflicts: # .gitlab-ci.yml * metrics: added counters flush once in 8h * metrics: added counters for duties * Added auto delete disconnected peers label metrics. Added RSA verifications metric * disabled ci * Updated p4 score metric update * review comments fixes * deploy * deploy * testing metrics differ * added duties created and finalized * removed extra monitoring.metricsreporter usage * removed topic from p4score. made it a sum of squares * deploy to all nodes * deploy p2p_metrics * pr review fixes * trigger ci * deploy 9-20 * Fix msg validation rsa verification counter metric * allocate p2p_metrics * inspect scores more frequently, but don't log every time * renamed duties created/finalized metrics and refactored to record role * deploy to 5--8 * deploy to 5--8 * rename metric * blank space * approve spec change (just a metric) * refactors * reset scores * metric help msg * revert gitlab * deploy * revert deploy * Add metrics for signature verifications * Update differ.config.yaml with approved changes --------- Co-authored-by: Anton Korpusenko <anton@blox.io> Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> Co-authored-by: moshe-blox <moshe@blox.io> * Node health (#1203) * chore: link to SSV API docs in configs & README * initial commit * add node health route to ssv API * update health route * update health route * update health route * update health route * deploy to stage * add plaintext response * lint * lint * change to good/bad * lint * lint * refactor * Revert "deploy to stage" This reverts commit 2f54f4e. * lint * add inbound/outbound count for health + deploy to stage * change ports back * update count * lint * update conns * lint * remove connected peer count * test blocked ports * Revert "remove connected peer count" This reverts commit 79e2b94. * leave only active peers count * Revert "test blocked ports" This reverts commit 6fc9282. * ci to stage * add mutex to nodes access * refactor: node health API (#1222) * refactor: node health API * added cpu_cores to healthcheck output * fix inbound/outbound stats * Remove CPU core reporting --------- Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com> * feat: subscribe to a random subnet with 0 validators (#1245) * feat: subscribe to a random subnet with 0 validators * Fix: health route host addres (#1246) * node health api route advertises the host addresses from the config. * set up listenaddress directly * feat: rate limit inbound connections by IP (#1226) * feat: rate limit inbound connections by IP * activate conngater * deploy to 5--8 * fix * fix * Refactor connection gating in p2p setup * Update ipLimiter parameters * revert gitlab * Revert "revert gitlab" This reverts commit fcc7902. * Revert "Revert "revert gitlab"" This reverts commit feb9e4e. --------- Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> --------- Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com> Co-authored-by: Lior Rutenberg <liorr@blox.io> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: Nikita Kryuchkov <nkryuchkov10@gmail.com> Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: olegshmuelov <oleg@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> Co-authored-by: Anton Korpusenko <antokorp@gmail.com> Co-authored-by: Anton Korpusenko <anton@blox.io> Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> Co-authored-by: Pavel Krolevets <pavelkrolevets@gmail.com> * Revert "v1.2.2 (#1248)" (#1263) This reverts commit ff2f594. * chore: fix some function names Signed-off-by: linghuying <1599935829@qq.com> --------- Signed-off-by: linghuying <1599935829@qq.com> Co-authored-by: Lior Rutenberg <liorr@blox.io> Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com> Co-authored-by: rehs0y <lyosher@gmail.com> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: Nikita Kryuchkov <nkryuchkov10@gmail.com> Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: olegshmuelov <oleg@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> Co-authored-by: Anton Korpusenko <antokorp@gmail.com> Co-authored-by: Anton Korpusenko <anton@blox.io> Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> Co-authored-by: Pavel Krolevets <pavelkrolevets@gmail.com>
Health check at GET
/v1/node/health
Returns:
Example:
curl -i -H "Accept: application/json" http://5.161.192.132:16052/v1/node/health
curl -i -H "Accept: text/plain" http://5.161.192.132:16052/v1/node/health