Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1779990: fix(packageserver): cache packagemanifests #1150

Merged
merged 3 commits into from
Dec 9, 2019

Conversation

njhale
Copy link
Member

@njhale njhale commented Nov 23, 2019

Description of the change:

Generate and maintain a running cache of PackageManifests. Serve get/list requests directly from this cache.

Motivation for the change:

Packageserver queries catalog gRPC APIs and constructs PackageManifest resources, on the fly, per request. When dealing with large catalogs containing icon images, this can cause requests to be very slow. Poor response times from packageserver can impact the OperatorHub UI's load time and cause its tests to timeout.

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /docs
  • Commit messages sensible and descriptive

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 23, 2019
@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 23, 2019
@njhale
Copy link
Member Author

njhale commented Nov 23, 2019

Playing around with this on minikube w/ community-operators (87 packages total) reveals some pretty promising results running kubectl get packagemanifests --all-namespaces

Before:
GET /apis/packages.operators.coreos.com/v1/packagemanifests?limit=500: (3.84524649s) 200 [kubectl/v1.16.3 (darwin/amd64) kubernetes/b3cbbae 192.168.64.13:49110]

After:
GET /apis/packages.operators.coreos.com/v1/packagemanifests?limit=500: (3.679124ms) 200 [kubectl/v1.16.3 (darwin/amd64) kubernetes/b3cbbae 192.168.64.13:44272]

~1000x faster 😅

The OperatorHub UI loads almost instantly too.

@njhale njhale force-pushed the spur-pkgsvr branch 3 times, most recently from f4c1362 to fbc3413 Compare November 25, 2019 14:31
@njhale njhale changed the title WIP: fix(packageserver): cache packagemanifests fix(packageserver): cache packagemanifests Nov 25, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 25, 2019
@njhale
Copy link
Member Author

njhale commented Nov 26, 2019

/retest

1 similar comment
@njhale
Copy link
Member Author

njhale commented Nov 27, 2019

/retest

@exdx
Copy link
Member

exdx commented Nov 27, 2019

/retry

pkg/package-server/provider/registry.go Show resolved Hide resolved
@@ -165,9 +200,107 @@ func (p *RegistryProvider) syncCatalogSource(obj interface{}) (syncError error)
p.setClient(newRegistryClient(source, conn), key)
logger.Info("new grpc connection added")

syncError = p.refreshCache(context.Background(), key)
Copy link
Member

@ecordell ecordell Nov 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my only real design comment - we don't need to rebuild the cache on every sync. We should only need to do it when the catalogsource changes, which should be indicated by the lastupdatetime, or when the grpc connection dies, etc.

@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 5, 2019
Comment on lines -16 to -53
const (
defaultWakeupInterval = 5 * time.Minute
)

// config flags defined globally so that they appear on the test binary as well
var (
ctx = signals.Context()
options = server.NewPackageServerOptions(os.Stdout, os.Stderr)
cmd = &cobra.Command{
Short: "Launch a package-server",
Long: "Launch a package-server",
RunE: func(c *cobra.Command, args []string) error {
if err := options.Run(ctx); err != nil {
return err
}
return nil
},
}
)

func init() {
flags := cmd.Flags()

flags.DurationVar(&options.WakeupInterval, "interval", options.WakeupInterval, "Interval at which to re-sync CatalogSources")
flags.StringVar(&options.GlobalNamespace, "global-namespace", options.GlobalNamespace, "Name of the namespace where the global CatalogSources are located")
flags.StringSliceVar(&options.WatchedNamespaces, "watched-namespaces", options.WatchedNamespaces, "List of namespaces the package-server will watch watch for CatalogSources")
flags.StringVar(&options.Kubeconfig, "kubeconfig", options.Kubeconfig, "The path to the kubeconfig used to connect to the Kubernetes API server and the Kubelets (defaults to in-cluster config)")
flags.BoolVar(&options.Debug, "debug", options.Debug, "use debug log level")

options.SecureServing.AddFlags(flags)
options.Authentication.AddFlags(flags)
options.Authorization.AddFlags(flags)
options.Features.AddFlags(flags)

flags.AddGoFlagSet(flag.CommandLine)
flags.Parse(flag.Args())
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to clean all this up while I'm here.

@@ -43,6 +43,7 @@ require (
k8s.io/client-go v8.0.0+incompatible
k8s.io/code-generator v0.0.0
k8s.io/component-base v0.0.0
k8s.io/klog v0.4.0
Copy link
Member Author

@njhale njhale Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLM codepaths using glog were brought into the packageserver dependency tree transitively. This was causing flag registration conflicts with klog, so I swapped all references from glog to klog.

globalNamespace string
clients map[sourceKey]registryClient
sources *registrygrpc.SourceStore
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the local gRPC connection handling with @ecordell's SourceStore. It has a cleaner interface, and now with its use here, consolidates gRPC connection management logic across OLM.

}

for _, namespace := range watchedNamespaces {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of multi-namespace configuration, since other OLM controllers already assume cluster-wide privileges. I also did this to avoid needing to register/access different CatalogSource listers for each namespace.

return
}

func (p *RegistryProvider) refreshCache(ctx context.Context, client *registryClient) error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to pass this a *registryClient instead of a key to make testing easier, since we have pre-existing tests that use a fake of this type to directly inject registry responses.

@@ -55,7 +56,7 @@ func server() {
logrus.Fatal(err)
}

loader := sqlite.NewSQLLoaderForDirectory(load, "manifests")
loader := sqlite.NewSQLLoaderForDirectory(load, filepath.Join("testdata", "manifests"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been trying to use the testdata convention throughout operator-framework, so I updated it while I was here. I can remove it from the PR, but I don't think it's too risky to leave it.

}

func toPackageManifest(logger *logrus.Entry, pkg *api.Package, client registryClient) (*operators.PackageManifest, error) {
func newPackageManifest(ctx context.Context, logger *logrus.Entry, pkg *api.Package, client *registryClient) (*operators.PackageManifest, error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plumbed down the context so that cache building can enforce a timeout on packagemanifest generation.

@@ -39,9 +39,6 @@ var (
)

func TestMain(m *testing.M) {
if err := flag.Set("logtostderr", "true"); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In klog this is true by default.

@njhale njhale changed the title fix(packageserver): cache packagemanifests Bug 1779990: fix(packageserver): cache packagemanifests Dec 5, 2019
@openshift-ci-robot
Copy link
Collaborator

@njhale: This pull request references Bugzilla bug 1779990, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Bug 1779990: fix(packageserver): cache packagemanifests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Dec 5, 2019
@njhale
Copy link
Member Author

njhale commented Dec 5, 2019

/cherry-pick release-4.3

@openshift-cherrypick-robot

@njhale: once the present PR merges, I will cherry-pick it on top of release-4.3 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@njhale
Copy link
Member Author

njhale commented Dec 5, 2019

/retest

Flakes look related to OAuth.

@njhale
Copy link
Member Author

njhale commented Dec 5, 2019

/retest

@jpeeler
Copy link

jpeeler commented Dec 5, 2019

/retest

One of the flakes this time was this marketplace one https://bugzilla.redhat.com/show_bug.cgi?id=1780143.

Copy link

@jpeeler jpeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some quality code. There are a lot of nice improvements here.

}
case connectivity.TransientFailure, connectivity.Shutdown:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if having transient failure here actually adds additional work. Did you find that gc-ing a source and then re-adding was quicker than letting a source transition back to connecting and (hopefully) ready?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline that having something in the cache in this state is probably not the best, so we'll just leave it as is.

break
}

g.Go(func() (pkgErr error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

@njhale
Copy link
Member Author

njhale commented Dec 5, 2019

/retest

pkg, err := client.GetPackage(ctx, &api.GetPackageRequest{Name: pkgName.GetName()})
if err != nil {
logger.WithField("err", err.Error()).Warnf("eliding package: error getting package")
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the anonymous function you defined a named-error return type pkgErr but then don't set pkgErr equal to the returned error before returning - does this function return the intended error or nil? Might be simpler to just return the error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I found I didn't want to return an error anywhere since this should be best effort, and didn't want to write return nil 4 times. l like the way errorgroups manage goroutines, but it may not be the best abstraction to use here. Any suggestions, using a waitgroup leaves something to be desired 🙁

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what you mean...You're saying based off the *Group.Go() signature you need the anonymous function to to return an error (https://godoc.org/golang.org/x/sync/errgroup#Group.Go) but you don't actually want *Group.Wait() to return a non-nil error if one does come up. Since the derived Context is canceled the first time a function passed to Go returns a non-nil error and this is best effort we don't want that.

Maybe just have _ := g.Wait() instead and comment that the error is being dropped intentionally as this is best effort to make it more clear for a reader? I see what you mean about the abstraction though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@exdx So I took a look back at error group and found that some assumptions I made around how it handles context cancellation were wrong. I'll switch this to using WaitGroup instead and then we won't be returning an error from the anonymous function anyway. Should have this change up soon.

@njhale
Copy link
Member Author

njhale commented Dec 6, 2019

/retest

@jpeeler
Copy link

jpeeler commented Dec 6, 2019

/retest
/lgtm

I did see one comment that might be worth addressing in a follow up. But let's try and get this in today.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 6, 2019
@jpeeler
Copy link

jpeeler commented Dec 6, 2019

/retest

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 6, 2019
@njhale
Copy link
Member Author

njhale commented Dec 6, 2019

/retest

1 similar comment
@njhale
Copy link
Member Author

njhale commented Dec 7, 2019

/retest

@Bowenislandsong
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 9, 2019
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Bowenislandsong, jpeeler, njhale

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

}
if sourceMeta := p.sources.GetMeta(key); sourceMeta != nil && sourceMeta.Address == address {
// If the address hasn't changed, don't bother creating a new source
logger.Debug("catalog address unchanged, skipping source creation")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can’t the pod be updated with new content without the address changing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that, at minimum, the gRPC connection changes state when catalog content is updated. That's probably a bad assumption in general, but I believe it's the behavior of all current registry implementations.


var err error
switch state.State {
case connectivity.Ready:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s unlikely, but there are cases where you might not receive an event for this (for example, if the connection is already in the ready state before it’s events are watched, or if the states change too rapidly). In practice this doesn’t seem to be an issue for grpc connections, but I wanted to mention it.

(This is due to relying on the grpc wait for state change, which requires you specify the state you’re waiting to change away from)

It’s not an issue in catalog-operator because the state change is used as an optimization and requeues a catalogsource for processing - if an event is missed like that, a periodic resync will fix it.

Copy link
Member Author

@njhale njhale Dec 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair -- We can make caching more aggressive in the sync function to compensate.

@njhale
Copy link
Member Author

njhale commented Dec 9, 2019

/retest

@openshift-merge-robot openshift-merge-robot merged commit 0e18a81 into operator-framework:master Dec 9, 2019
@openshift-ci-robot
Copy link
Collaborator

@njhale: All pull requests linked via external trackers have merged. Bugzilla bug 1779990 has been moved to the MODIFIED state.

In response to this:

Bug 1779990: fix(packageserver): cache packagemanifests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@njhale: new pull request created: #1181

In response to this:

/cherry-pick release-4.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@njhale
Copy link
Member Author

njhale commented Jan 20, 2020

/cherry-pick release-4.2

@openshift-cherrypick-robot

@njhale: #1150 failed to apply on top of branch "release-4.2":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	go.mod
M	go.sum
M	pkg/package-server/provider/registry_test.go
M	vendor/modules.txt
Falling back to patching base and 3-way merge...
Auto-merging vendor/modules.txt
Auto-merging pkg/package-server/provider/registry_test.go
Auto-merging go.sum
CONFLICT (content): Merge conflict in go.sum
Auto-merging go.mod
CONFLICT (content): Merge conflict in go.mod
Patch failed at 0001 fix(packageserver): cache packagemanifests

In response to this:

/cherry-pick release-4.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@njhale njhale deleted the spur-pkgsvr branch March 31, 2020 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants