Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BS: Add metrics for the keepalive messages #3151

Merged
merged 1 commit into from
Sep 18, 2019

Conversation

karampok
Copy link
Contributor

@karampok karampok commented Sep 16, 2019

This PR works the first bullet on #3104

# HELP bs_keepalive_receive_msgs_total Total number of received keepalive msgs.
# TYPE bs_keepalive_receive_msgs_total counter
bs_keepalive_receive_msgs_total{ifid="1",result="success"} 118
bs_keepalive_receive_msgs_total{ifid="2",result="success"} 118
# HELP bs_keepalive_transmit_msgs_total Total number of transmitted keepalive msgs.
# TYPE bs_keepalive_transmit_msgs_total counter
bs_keepalive_transmit_msgs_total{ifid="1",result="process_err"} 118
bs_keepalive_transmit_msgs_total{ifid="2",result="process_err"} 118

This change is Reviewable

@karampok karampok force-pushed the pub-keepalive-metrics branch from d32f7b0 to 8c3e535 Compare September 16, 2019 12:06
@oncilla oncilla changed the title Add metrics for the keepalive msgs in the beacon_srv BS: Add metrics for the keepalive messages Sep 16, 2019
Copy link
Contributor

@oncilla oncilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r1.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @karampok and @lukedirtwalker)


go/beacon_srv/internal/keepalive/handler.go, line 94 at r1 (raw file):

	keepalive, ok := h.request.Message.(*ifid.IFID)
	if !ok {
		increaseReceiveErrors("unknown")

Use labels as described below in metrics.go

make this a constant.
also unknown_ifid, otherwise it is ambiguous
use ifid = 0, since it is a placeholder ifid


go/beacon_srv/internal/keepalive/handler.go, line 102 at r1 (raw file):

	ifid, info, err := h.getIntfInfo()
	if err != nil {
		increaseReceiveErrors("unknown")

same, probably different err message. e.g. err_processing


go/beacon_srv/internal/keepalive/handler.go, line 109 at r1 (raw file):

		h.startPush(ifid)
		if err := h.dropRevs(ifid, keepalive.OrigIfID, info.TopoInfo().ISD_AS); err != nil {
			increaseReceiveErrors(ifid.String())

err_db


go/beacon_srv/internal/keepalive/metrics.go, line 25 at r1 (raw file):

)

//bs_keepalive_receive_errs_total

I assume these lines are here for easy grepping ?

In that case, please add a comment on top describing the purpose.

  • space between // and bs

go/beacon_srv/internal/keepalive/metrics.go, line 31 at r1 (raw file):

const (
	promNamespace = "bs_keepalive"

Namespace should be bs and the subsystem keepalive. see #3104 (comment)


go/beacon_srv/internal/keepalive/metrics.go, line 37 at r1 (raw file):

	initOnce sync.Once

	outMsg *prometheus.CounterVec

instead of having outMsg and outErr, have one counter vector with labels that distinguish the results.
This allows for easy sums in promql
e.g. labels: ifid: {1,2,...} , result`= { "success", "err_payload", "err_send", "err_db"}

bs_keepalive_sent_total(ifid, result) count
bs_keepalive_received_total(ifid, result) count


go/beacon_srv/internal/keepalive/metrics.go, line 44 at r1 (raw file):

// InitMetrics initializes the metrics
func InitMetrics() {

the way this is done now, you could also simply use a init function.
there is no way of using the package without calling InitMetrics first anyway.

Copy link
Collaborator

@lukedirtwalker lukedirtwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @karampok and @oncilla)


go/beacon_srv/internal/keepalive/metrics.go, line 25 at r1 (raw file):

Previously, Oncilla wrote…

I assume these lines are here for easy grepping ?

In that case, please add a comment on top describing the purpose.

  • space between // and bs

I would maybe add them to the variable declaration. So it's a bit closer and would maybe be seen when doing refactorings.

@karampok
Copy link
Contributor Author


go/beacon_srv/internal/keepalive/handler.go, line 102 at r1 (raw file):

Previously, Oncilla wrote…

same, probably different err message. e.g. err_processing

I have on purpose avoid having label on what error it is.
The idea is that we have an alert on a metric error that will show the errors on bs on server X are increasing. In that moment ops should go to logs to see details. What do you think?

@karampok
Copy link
Contributor Author


go/beacon_srv/internal/keepalive/metrics.go, line 37 at r1 (raw file):

Previously, Oncilla wrote…

instead of having outMsg and outErr, have one counter vector with labels that distinguish the results.
This allows for easy sums in promql
e.g. labels: ifid: {1,2,...} , result`= { "success", "err_payload", "err_send", "err_db"}

bs_keepalive_sent_total(ifid, result) count
bs_keepalive_received_total(ifid, result) count

Having labels with high cardinality will blow up eventually the prometheus server.
https://prometheus.io/docs/practices/naming/#labels (the yellow box).

And i followed the approach of node_exporter which is
node_network_{transmit,receive}_{bytes,error} and all metrics from component X should have the same labels (e.g. env=prod, os=linux)

@karampok
Copy link
Contributor Author


go/beacon_srv/internal/keepalive/metrics.go, line 25 at r1 (raw file):

Previously, lukedirtwalker (Lukas Vogel) wrote…

I would maybe add them to the variable declaration. So it's a bit closer and would maybe be seen when doing refactorings.

I think this would add more confusion, so I remove them. In all case you can grep by transmit_msgs_total

@karampok
Copy link
Contributor Author


go/beacon_srv/internal/keepalive/metrics.go, line 44 at r1 (raw file):

Previously, Oncilla wrote…

the way this is done now, you could also simply use a init function.
there is no way of using the package without calling InitMetrics first anyway.

I like the init function. Do you think we still need the initOnce.Do?
Or you prefer to do exactly as your beaconing/metric implementation with a struct?

@karampok karampok force-pushed the pub-keepalive-metrics branch from 8c3e535 to 02be0ae Compare September 17, 2019 08:32
Copy link
Contributor Author

@karampok karampok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 7 unresolved discussions (waiting on @oncilla)


go/beacon_srv/internal/keepalive/handler.go, line 94 at r1 (raw file):

Previously, Oncilla wrote…

Use labels as described below in metrics.go

make this a constant.
also unknown_ifid, otherwise it is ambiguous
use ifid = 0, since it is a placeholder ifid

Done.


go/beacon_srv/internal/keepalive/handler.go, line 102 at r1 (raw file):

Previously, karampok (Konstantinos) wrote…

I have on purpose avoid having label on what error it is.
The idea is that we have an alert on a metric error that will show the errors on bs on server X are increasing. In that moment ops should go to logs to see details. What do you think?

done


go/beacon_srv/internal/keepalive/handler.go, line 109 at r1 (raw file):

Previously, Oncilla wrote…

err_db

need still to do that


go/beacon_srv/internal/keepalive/metrics.go, line 31 at r1 (raw file):

Previously, Oncilla wrote…

Namespace should be bs and the subsystem keepalive. see #3104 (comment)

Done.

@karampok karampok force-pushed the pub-keepalive-metrics branch 4 times, most recently from 85a237f to fbb7782 Compare September 18, 2019 09:17
Copy link
Contributor

@oncilla oncilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 11 files at r2.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @karampok and @oncilla)


go/beacon_srv/internal/keepalive/handler.go, line 102 at r1 (raw file):

Previously, karampok (Konstantinos) wrote…

done

ops can just do err_* to get all error counts.
I think it is nice to a some coarse grained distinction in the errors.

Not in this particular case, but let's assume the following:
You have a handler that needs to respond with a message
and it can either fail in the database access, sending the response or some internal error

if they are 3 different errors, you can do better alerting. E.g. the severity of internal error is probably higher than db error. And the db error is probably higher than the response error.


go/beacon_srv/internal/keepalive/handler_test.go, line 166 at r2 (raw file):

}

func waitTimeout(wg *sync.WaitGroup, t *testing.T) {

testing usually should be the first argument.


go/beacon_srv/internal/keepalive/sender.go, line 57 at r2 (raw file):

	}
	var sentIfids []common.IFIDType
	for ifid, intf := range topo.IFInfoMap {

do l := metrics.KeepaliveLabel{IfID: ifid, Result: metrics.ErrProcess} here


go/beacon_srv/internal/keepalive/sender.go, line 75 at r2 (raw file):

		if err := s.Send(msg, ov); err != nil {
			logger.Error("[keepalive.Sender] Unable to send packet", "err", err)
			l := metrics.KeepaliveLabels{Result: metrics.ProcessErr}

drop l := ...


go/beacon_srv/internal/keepalive/sender.go, line 82 at r2 (raw file):

		sentIfids = append(sentIfids, ifid)

		l := metrics.KeepaliveLabels{IfID: ifid, Result: metrics.ProcessErr}

l.Result = metrics.Success


go/beacon_srv/internal/metrics/keepalive.go, line 35 at r2 (raw file):

}

func (l *KeepaliveLabels) values() []string {

Values


go/beacon_srv/internal/metrics/keepalive.go, line 56 at r2 (raw file):

}

// Transmits returns transmit counter

period at the end.


go/beacon_srv/internal/metrics/keepalive.go, line 61 at r2 (raw file):

}

// Receives returns the receive counter

period.


go/beacon_srv/internal/metrics/metrics.go, line 21 at r2 (raw file):

	// Success indicates a successful result.
	Success string = "success"

should be = prom.Success

(you need to rebase on master first)


go/beacon_srv/internal/metrics/metrics.go, line 23 at r2 (raw file):

	Success string = "success"
	// ProcessErr indicates an error during processing.
	ProcessErr string = "process_err"

Should be ErrProcess string = "process_err"

@karampok
Copy link
Contributor Author


go/beacon_srv/internal/metrics/keepalive.go, line 35 at r2 (raw file):

Previously, Oncilla wrote…

Values

Why do we need to export that?

@karampok karampok force-pushed the pub-keepalive-metrics branch 3 times, most recently from 42c7fc3 to 87e17ad Compare September 18, 2019 12:03
Copy link
Contributor Author

@karampok karampok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 8 files reviewed, 11 unresolved discussions (waiting on @oncilla)


go/beacon_srv/internal/keepalive/handler.go, line 102 at r1 (raw file):

Previously, Oncilla wrote…

ops can just do err_* to get all error counts.
I think it is nice to a some coarse grained distinction in the errors.

Not in this particular case, but let's assume the following:
You have a handler that needs to respond with a message
and it can either fail in the database access, sending the response or some internal error

if they are 3 different errors, you can do better alerting. E.g. the severity of internal error is probably higher than db error. And the db error is probably higher than the response error.

there is no wrong or right thing, but let's try not to pre-optimise.


go/beacon_srv/internal/keepalive/handler.go, line 109 at r1 (raw file):

Previously, karampok (Konstantinos) wrote…

need still to do that

As far as I see the only the dropRevs, I would let the generic error message.


go/beacon_srv/internal/keepalive/handler_test.go, line 166 at r2 (raw file):

Previously, Oncilla wrote…

testing usually should be the first argument.

Done.


go/beacon_srv/internal/keepalive/sender.go, line 57 at r2 (raw file):

Previously, Oncilla wrote…

do l := metrics.KeepaliveLabel{IfID: ifid, Result: metrics.ErrProcess} here

Done.


go/beacon_srv/internal/keepalive/sender.go, line 75 at r2 (raw file):

Previously, Oncilla wrote…

drop l := ...

Done.


go/beacon_srv/internal/keepalive/sender.go, line 82 at r2 (raw file):

Previously, Oncilla wrote…

l.Result = metrics.Success

Done.


go/beacon_srv/internal/metrics/keepalive.go, line 56 at r2 (raw file):

Previously, Oncilla wrote…

period at the end.

Done.


go/beacon_srv/internal/metrics/keepalive.go, line 61 at r2 (raw file):

Previously, Oncilla wrote…

period.

Done.


go/beacon_srv/internal/metrics/metrics.go, line 21 at r2 (raw file):

Previously, Oncilla wrote…

should be = prom.Success

(you need to rebase on master first)

Done.


go/beacon_srv/internal/metrics/metrics.go, line 23 at r2 (raw file):

Previously, Oncilla wrote…

Should be ErrProcess string = "process_err"

This should be also from prom or we prefer to keep that value?

Copy link
Contributor

@oncilla oncilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 7 files at r3.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @karampok and @oncilla)


go/beacon_srv/internal/keepalive/sender.go, line 57 at r2 (raw file):

Previously, karampok (Konstantinos) wrote…

Done.

ifid is still missing


go/beacon_srv/internal/metrics/keepalive.go, line 35 at r2 (raw file):

Previously, karampok (Konstantinos) wrote…

Why do we need to export that?

for them to be testable.


go/beacon_srv/internal/metrics/metrics.go, line 23 at r2 (raw file):

Previously, karampok (Konstantinos) wrote…

This should be also from prom or we prefer to keep that value?

hm, yeah it is probably general enough to go to prom.
You decide

* remove goconvey from keepalive tests

Contributes scionproto#3104
@karampok karampok force-pushed the pub-keepalive-metrics branch from 87e17ad to 4400d62 Compare September 18, 2019 14:04
Copy link
Contributor Author

@karampok karampok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 9 files reviewed, 1 unresolved discussion (waiting on @oncilla)


go/beacon_srv/internal/keepalive/sender.go, line 57 at r2 (raw file):

Previously, Oncilla wrote…

ifid is still missing

I cannot get used to reviewable :)


go/beacon_srv/internal/metrics/keepalive.go, line 35 at r2 (raw file):

Previously, Oncilla wrote…

for them to be testable.

Maybe we could consider something like that
https://medium.com/@robiplus/golang-trick-export-for-test-aa16cbd7b8cd

The risk is that people might start using it and then we want to refactor

Copy link
Contributor

@oncilla oncilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r4.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@karampok karampok merged commit 1634637 into scionproto:master Sep 18, 2019
@karampok karampok mentioned this pull request Sep 18, 2019
@karampok karampok deleted the pub-keepalive-metrics branch September 20, 2019 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants