Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus metrics for antrea controller #325

Merged
merged 1 commit into from
Apr 29, 2020

Conversation

weiqiangt
Copy link
Contributor

This CL brings metrics of antrea controller events processing, includes: syncing time, processed number, and queue length.

Signed-off-by: Weiqiang TANG weiqiangt@vmware.com

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-all: to trigger all tests.
  • /skip-all: to skip all tests.

These commands can only be run by members of the vmware-tanzu organization.

@weiqiangt
Copy link
Contributor Author

/test-all

@antoninbas
Copy link
Contributor

@weiqiangt what's the relationship to Kobi's PR: #321?

@jianjuns
Copy link
Contributor

@salv-orlando @ksamoray @tnqn : do you think we should use prometheus packages and format in Controller/Agent code or we should define stats with our own or some generic format?

@weiqiangt
Copy link
Contributor Author

@weiqiangt what's the relationship to Kobi's PR: #321?

I think this CL is independent with #321, because it only provides some metrics about the controller. BTW, these metrics were used in the scale test.

@ksamoray
Copy link
Contributor

@weiqiangt what's the relationship to Kobi's PR: #321?

I think this CL is independent with #321, because it only provides some metrics about the controller. BTW, these metrics were used in the scale test.

Were you able to retrieve the metrics without initializing a listener? How do you retrieve these?

@ksamoray
Copy link
Contributor

@salv-orlando @ksamoray @tnqn : do you think we should use prometheus packages and format in Controller/Agent code or we should define stats with our own or some generic format?

Prometheus will provide some metrics for process, Prometheus listener, and go related metrics.
Other than this I think that we should provide our own metrics as we do with the monitor CRD.

Unless I'm answering the wrong question :)

@jianjuns
Copy link
Contributor

Prometheus will provide some metrics for process, Prometheus listener, and go related metrics.
Other than this I think that we should provide our own metrics as we do with the monitor CRD.

@ksamoray I am asking Prometheus packages, structs, and funcs, like promauto.NewCounter, Summary, etc.
Should we use them or define our own.

@ksamoray
Copy link
Contributor

ksamoray commented Jan 20, 2020

@ksamoray I am asking Prometheus packages, structs, and funcs, like promauto.NewCounter, Summary, etc.
Should we use them or define our own.
@jianjuns
Yeah we should use these unless they do not supply the required functionality for a specific statistic.

@jianjuns
Copy link
Contributor

@ksamoray I am asking Prometheus packages, structs, and funcs, like promauto.NewCounter, Summary, etc.
Should we use them or define our own.
@jianjuns
Yeah we should use these unless they do not supply the required functionality for a specific statistic.
So, besides prometheus I believe Antrea will have other consumers (through API and CLI) of the metrics and stats. Do you think the prometheus definition is generic enough for other consumers? @ksamoray

@ksamoray
Copy link
Contributor

So, besides prometheus I believe Antrea will have other consumers (through API and CLI) of the metrics and stats. Do you think the prometheus definition is generic enough for other consumers? @ksamoray
Does "generic enough" implies "has support to enough data types"?
Or "has enough means for data retrieval"?
As Prometheus' listener is an API endpoint , we could either leverage it - or do we need another mean maybe?
Maybe it's worthwhile to discuss this during the Antrea weekly. Or I could update that doc again with additional info which I gathered, and use it as a base for discussion.
I'm more concerned with scale issues and performance impact than with genericity - after all stats are about exposing floats, counters in various ways (tagged lists etc) which is what Prometheus does.
@jianjuns

@ksamoray
Copy link
Contributor

I still have to gather which metrics should be exposed by the agent - in the meantime I exposed the values which are exposed by the custom crd (# of pods, OVS table stats).
@weiqiangt collected quite a few of these from the controller already.

@jianjuns
Copy link
Contributor

jianjuns commented Jan 23, 2020

@ksamoray: I just wonder if Prometheus structs and funcs are generic enough and widely used for metrics that can be consumed by other consumers through other interfaces, besides Prometheus?
If we want to expose stats/metrics through CLI and API, we need to define Go structs for them (I guess Prometheus definition can not be used directly).

As Prometheus' listener is an API endpoint , we could either leverage it - or do we need another mean maybe?

Sorry, I did not know much about Prometheus. But the listener is in where? Controller/Agent, another process, or Promeheus?

Maybe it's worthwhile to discuss this during the Antrea weekly. Or I could update that doc again with additional info which I gathered, and use it as a base for discussion.

It will be great if you can update the doc with more info.

This CL brings metrics of antrea controller events
processing, includes: syncing time, processed number,
and queue length.

Signed-off-by: Weiqiang TANG <weiqiangt@vmware.com>
@weiqiangt
Copy link
Contributor Author

/test-all

1 similar comment
@weiqiangt
Copy link
Contributor Author

/test-all

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just want to confirm the program won't panic if metrics is disabled

@@ -838,6 +839,7 @@ func (n *NetworkPolicyController) deleteNamespace(old interface{}) {
func (n *NetworkPolicyController) enqueueAppliedToGroup(key string) {
klog.V(4).Infof("Adding new key %s to AppliedToGroup queue", key)
n.appliedToGroupQueue.Add(key)
metrics.LengthAppliedToGroupQueue.Set(float64(n.appliedToGroupQueue.Len()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it panic if metrics is disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this PR can pass kind e2e test.

@weiqiangt
Copy link
Contributor Author

/test-e2e

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@weiqiangt
Copy link
Contributor Author

/test-all

@weiqiangt weiqiangt merged commit e0d0543 into antrea-io:master Apr 29, 2020
@weiqiangt weiqiangt deleted the controller-prometheus branch April 29, 2020 08:56
Dyanngg pushed a commit to Dyanngg/antrea that referenced this pull request Apr 29, 2020
Add metrics for antrea controller events processing, 
includes syncing time, processed number, and queue length.

Signed-off-by: Weiqiang TANG <weiqiangt@vmware.com>
McCodeman pushed a commit to McCodeman/antrea that referenced this pull request Jun 2, 2020
Add metrics for antrea controller events processing, 
includes syncing time, processed number, and queue length.

Signed-off-by: Weiqiang TANG <weiqiangt@vmware.com>
McCodeman pushed a commit that referenced this pull request Jun 2, 2020
Add metrics for antrea controller events processing, 
includes syncing time, processed number, and queue length.

Signed-off-by: Weiqiang TANG <weiqiangt@vmware.com>
GraysonWu pushed a commit to GraysonWu/antrea that referenced this pull request Sep 22, 2020
Add metrics for antrea controller events processing,
includes syncing time, processed number, and queue length.

Signed-off-by: Weiqiang TANG <weiqiangt@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants