-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETOBSERV-754: Optimize cache usage #476
Conversation
@jotak: This pull request references NETOBSERV-754 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #476 +/- ##
==========================================
+ Coverage 54.90% 55.61% +0.71%
==========================================
Files 49 52 +3
Lines 6453 6660 +207
==========================================
+ Hits 3543 3704 +161
- Misses 2666 2701 +35
- Partials 244 255 +11
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
Owns(&appsv1.Deployment{}). | ||
Owns(&appsv1.DaemonSet{}). | ||
Owns(&ascv2.HorizontalPodAutoscaler{}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of removing this is that we will not get notification for reconciliation if the HPA is manually modified.
https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/builder#Builder.Owns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah tbh I'm not sure why I removed HPA, my point is really about configmap, I'll put HPA back.
It's a trade-off, might be loosing some reconcile events, but the gain overcomes the disadvantages
main.go
Outdated
@@ -132,6 +133,11 @@ func main() { | |||
HealthProbeBindAddress: probeAddr, | |||
LeaderElection: enableLeaderElection, | |||
LeaderElectionID: "7a7ecdcd.netobserv.io", | |||
Client: client.Options{ | |||
Cache: &client.CacheOptions{ | |||
DisableFor: []client.Object{&corev1.ConfigMap{}, &corev1.Secret{}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will reduce cache memory but will increase a lot queries to kubeAPI.
Are we sure about this? This make many GET request to kubeAPI each reconciliation loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but it will DRASTICALLY reduce cache memory especially on clusters with many cm/secrets, because this caches the whole cluster configmaps/secrets. I think the additional GET are peanuts compared to this (cf slack thread https://app.slack.com/client/T027F3GAJ/C02939DP5L5/thread/C02939DP5L5-1698237974.684409 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for info, look what the others are doing: https://github.com/search?q=org%3Aopenshift%20DisableFor%3A&type=code ;-)
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:2fd7345 make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-2fd7345 Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-2fd7345
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
86631fb
to
1a4ca67
Compare
@OlivierCazade I'll investigate creating a custom cache instead of removing it entirely for CM/Secrets |
FTR here's my WIP attempt to create a custom cache, but I prefer to stop here since it's still not as good as disabling cache: jotak@26c3fbc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank for the investigations Joel
Thanks @OlivierCazade |
/ok-to-test |
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:e90e8d4 make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-e90e8d4 Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-e90e8d4
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
I ran a node-density-heavy job with this PR image, comparing it with our current baseline tracked by UUID 46dbc801-3b5d-471e-8172-6a71c0eedccb In that baseline run[1], we see the following values for the memory of the controller pod:
In the new run[2], we see the following values:
While the values do not look that much improved, the second run processed many more flows (almost double) than the original run[3]
@jotak let me know your thoughts around this data - another idea I had was running our cluster-density test case which typically requires raising the controller memory limit to 800MB and seeing if we can run it using your PR image with the default limit of 400MB [1]https://docs.google.com/spreadsheets/d/11_ppiK42Y1KIssmmYLt9L678EtZEiQ0Qn3EdIOFD4SA/edit?usp=sharing |
So it seems like there are side-effects that I didn't see I first - I did my performance tests with monolithic Loki, but it seems it's doesn't work correctly when using the loki operator instead. @nathan-weinberg you might be blocked by that. |
793477f
to
45811ce
Compare
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:2c8f84d make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-2c8f84d Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-2c8f84d
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
45811ce
to
6243d54
Compare
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:9adc98c make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-9adc98c Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-9adc98c
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
@OlivierCazade @nathan-weinberg FYI I've changed a lot of things in my last commit. Disabling cache like I did before was breaking our certificate watching. So I had to implement a whole new thing which should be optimal: still using cache, but no more than what is needed. |
@jotak can you rebase so we can run the regressions? |
- Uses lower-level watches rather than informers, to watch more selectively by single resource - Used exclusively for GET/WATCH , Configmaps/Secrets - Other kinds of queries still use the usual client
6243d54
to
7b7bd5f
Compare
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:deecd40 make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-deecd40 Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-deecd40
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
done |
Ran the regressions against this PR: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/ginkgo-test/213434/console Only failing tests had to do with expecting downstream metrics source instead of upstream, so those should be passing once the changes making it downstream. |
/label qe-approved |
@jotak: This pull request references NETOBSERV-754 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks Joel.
LGTM
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jotak The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
Cut down memory usage; controller-runtime may cache whole cluster configmap / secret for a single
client.Get
call: prevent that.Dependencies
n/a
Checklist
If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.