-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1779990: fix(packageserver): cache packagemanifests #1150
Bug 1779990: fix(packageserver): cache packagemanifests #1150
Conversation
Playing around with this on minikube w/ community-operators (87 packages total) reveals some pretty promising results running Before: After: ~1000x faster 😅 The OperatorHub UI loads almost instantly too. |
f4c1362
to
fbc3413
Compare
/retest |
1 similar comment
/retest |
/retry |
@@ -165,9 +200,107 @@ func (p *RegistryProvider) syncCatalogSource(obj interface{}) (syncError error) | |||
p.setClient(newRegistryClient(source, conn), key) | |||
logger.Info("new grpc connection added") | |||
|
|||
syncError = p.refreshCache(context.Background(), key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my only real design comment - we don't need to rebuild the cache on every sync. We should only need to do it when the catalogsource changes, which should be indicated by the lastupdatetime, or when the grpc connection dies, etc.
const ( | ||
defaultWakeupInterval = 5 * time.Minute | ||
) | ||
|
||
// config flags defined globally so that they appear on the test binary as well | ||
var ( | ||
ctx = signals.Context() | ||
options = server.NewPackageServerOptions(os.Stdout, os.Stderr) | ||
cmd = &cobra.Command{ | ||
Short: "Launch a package-server", | ||
Long: "Launch a package-server", | ||
RunE: func(c *cobra.Command, args []string) error { | ||
if err := options.Run(ctx); err != nil { | ||
return err | ||
} | ||
return nil | ||
}, | ||
} | ||
) | ||
|
||
func init() { | ||
flags := cmd.Flags() | ||
|
||
flags.DurationVar(&options.WakeupInterval, "interval", options.WakeupInterval, "Interval at which to re-sync CatalogSources") | ||
flags.StringVar(&options.GlobalNamespace, "global-namespace", options.GlobalNamespace, "Name of the namespace where the global CatalogSources are located") | ||
flags.StringSliceVar(&options.WatchedNamespaces, "watched-namespaces", options.WatchedNamespaces, "List of namespaces the package-server will watch watch for CatalogSources") | ||
flags.StringVar(&options.Kubeconfig, "kubeconfig", options.Kubeconfig, "The path to the kubeconfig used to connect to the Kubernetes API server and the Kubelets (defaults to in-cluster config)") | ||
flags.BoolVar(&options.Debug, "debug", options.Debug, "use debug log level") | ||
|
||
options.SecureServing.AddFlags(flags) | ||
options.Authentication.AddFlags(flags) | ||
options.Authorization.AddFlags(flags) | ||
options.Features.AddFlags(flags) | ||
|
||
flags.AddGoFlagSet(flag.CommandLine) | ||
flags.Parse(flag.Args()) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to clean all this up while I'm here.
@@ -43,6 +43,7 @@ require ( | |||
k8s.io/client-go v8.0.0+incompatible | |||
k8s.io/code-generator v0.0.0 | |||
k8s.io/component-base v0.0.0 | |||
k8s.io/klog v0.4.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OLM codepaths using glog
were brought into the packageserver dependency tree transitively. This was causing flag registration conflicts with klog
, so I swapped all references from glog
to klog
.
globalNamespace string | ||
clients map[sourceKey]registryClient | ||
sources *registrygrpc.SourceStore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the local gRPC connection handling with @ecordell's SourceStore
. It has a cleaner interface, and now with its use here, consolidates gRPC connection management logic across OLM.
} | ||
|
||
for _, namespace := range watchedNamespaces { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got rid of multi-namespace configuration, since other OLM controllers already assume cluster-wide privileges. I also did this to avoid needing to register/access different CatalogSource
listers for each namespace.
return | ||
} | ||
|
||
func (p *RegistryProvider) refreshCache(ctx context.Context, client *registryClient) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to pass this a *registryClient
instead of a key to make testing easier, since we have pre-existing tests that use a fake of this type to directly inject registry responses.
@@ -55,7 +56,7 @@ func server() { | |||
logrus.Fatal(err) | |||
} | |||
|
|||
loader := sqlite.NewSQLLoaderForDirectory(load, "manifests") | |||
loader := sqlite.NewSQLLoaderForDirectory(load, filepath.Join("testdata", "manifests")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been trying to use the testdata
convention throughout operator-framework, so I updated it while I was here. I can remove it from the PR, but I don't think it's too risky to leave it.
} | ||
|
||
func toPackageManifest(logger *logrus.Entry, pkg *api.Package, client registryClient) (*operators.PackageManifest, error) { | ||
func newPackageManifest(ctx context.Context, logger *logrus.Entry, pkg *api.Package, client *registryClient) (*operators.PackageManifest, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plumbed down the context so that cache building can enforce a timeout on packagemanifest generation.
@@ -39,9 +39,6 @@ var ( | |||
) | |||
|
|||
func TestMain(m *testing.M) { | |||
if err := flag.Set("logtostderr", "true"); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In klog
this is true by default.
@njhale: This pull request references Bugzilla bug 1779990, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.3 |
@njhale: once the present PR merges, I will cherry-pick it on top of release-4.3 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest Flakes look related to OAuth. |
/retest |
/retest One of the flakes this time was this marketplace one https://bugzilla.redhat.com/show_bug.cgi?id=1780143. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is some quality code. There are a lot of nice improvements here.
} | ||
case connectivity.TransientFailure, connectivity.Shutdown: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if having transient failure here actually adds additional work. Did you find that gc-ing a source and then re-adding was quicker than letting a source transition back to connecting and (hopefully) ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed offline that having something in the cache in this state is probably not the best, so we'll just leave it as is.
break | ||
} | ||
|
||
g.Go(func() (pkgErr error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :)
/retest |
pkg, err := client.GetPackage(ctx, &api.GetPackageRequest{Name: pkgName.GetName()}) | ||
if err != nil { | ||
logger.WithField("err", err.Error()).Warnf("eliding package: error getting package") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the anonymous function you defined a named-error return type pkgErr
but then don't set pkgErr
equal to the returned error before returning - does this function return the intended error or nil? Might be simpler to just return the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I found I didn't want to return an error anywhere since this should be best effort, and didn't want to write return nil
4 times. l like the way errorgroups manage goroutines, but it may not be the best abstraction to use here. Any suggestions, using a waitgroup leaves something to be desired 🙁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see what you mean...You're saying based off the *Group.Go()
signature you need the anonymous function to to return an error (https://godoc.org/golang.org/x/sync/errgroup#Group.Go) but you don't actually want *Group.Wait()
to return a non-nil error if one does come up. Since the derived Context is canceled the first time a function passed to Go returns a non-nil error
and this is best effort we don't want that.
Maybe just have _ := g.Wait()
instead and comment that the error is being dropped intentionally as this is best effort to make it more clear for a reader? I see what you mean about the abstraction though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@exdx So I took a look back at error group and found that some assumptions I made around how it handles context cancellation were wrong. I'll switch this to using WaitGroup
instead and then we won't be returning an error from the anonymous function anyway. Should have this change up soon.
/retest |
/retest I did see one comment that might be worth addressing in a follow up. But let's try and get this in today. |
/retest |
/retest |
1 similar comment
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Bowenislandsong, jpeeler, njhale The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
} | ||
if sourceMeta := p.sources.GetMeta(key); sourceMeta != nil && sourceMeta.Address == address { | ||
// If the address hasn't changed, don't bother creating a new source | ||
logger.Debug("catalog address unchanged, skipping source creation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can’t the pod be updated with new content without the address changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that, at minimum, the gRPC connection changes state when catalog content is updated. That's probably a bad assumption in general, but I believe it's the behavior of all current registry implementations.
|
||
var err error | ||
switch state.State { | ||
case connectivity.Ready: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s unlikely, but there are cases where you might not receive an event for this (for example, if the connection is already in the ready state before it’s events are watched, or if the states change too rapidly). In practice this doesn’t seem to be an issue for grpc connections, but I wanted to mention it.
(This is due to relying on the grpc wait for state change, which requires you specify the state you’re waiting to change away from)
It’s not an issue in catalog-operator because the state change is used as an optimization and requeues a catalogsource for processing - if an event is missed like that, a periodic resync will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair -- We can make caching more aggressive in the sync function to compensate.
/retest |
@njhale: All pull requests linked via external trackers have merged. Bugzilla bug 1779990 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@njhale: new pull request created: #1181 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.2 |
@njhale: #1150 failed to apply on top of branch "release-4.2":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Description of the change:
Generate and maintain a running cache of
PackageManifests
. Serve get/list requests directly from this cache.Motivation for the change:
Packageserver queries catalog gRPC APIs and constructs
PackageManifest
resources, on the fly, per request. When dealing with large catalogs containing icon images, this can cause requests to be very slow. Poor response times from packageserver can impact the OperatorHub UI's load time and cause its tests to timeout.Reviewer Checklist
/docs