Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[epic] v1.0.0 Performance and Scale #920

Open
joelanford opened this issue Jun 11, 2024 · 2 comments
Open

[epic] v1.0.0 Performance and Scale #920

joelanford opened this issue Jun 11, 2024 · 2 comments
Assignees
Labels
epic v1.0 Issues related to the initial stable release of OLMv1

Comments

@joelanford
Copy link
Member

Epic Goal

  • Measure and implement any necessary improvements to ensure OLM v1.0.0 meets or exceeds OCP guidelines around performance and scalability.

Why is this important?

  • OLM v1.0.0 will be a payload component that always runs in OCP clusters. In order to reduce SD and customer costs, we need to minimize this overhead.
  • OLM v1.0.0 is intended to be used on a wide variety of clusters, ranging from single node clusters with just a few namespaces to clusters 2-3 orders of magnitude larger. We need make sure that it runs just as well on a small cluster as it does a large cluster.
  • In order to reduce user frustration, we need to provide a responsive user experience. Reconciliation needs to be fast and non-blocking to ensure users receive the experience they have come to expect from OCP. To the extent possible, long-running tasks (e.g. catalog fetching/caching and image pulling) should be performed asynchronously.

Scenarios

  1. Collect pprof profiles for CPU and memory when running standard user flows around installing, upgrading, and removing operators from public catalogs (e.g. operatorhub)
  2. Find the most resource intensive code paths. Provide documentation and recommendations related to making improvements in those areas.
  3. Coordinate with OLM maintainers to make improvements in areas deemed to provide the most significant performance and scale gain.
  4. Implement automated performance and scale regression tests in the existing upstream CI test suite.

Examples of known areas for improvement include:

  • When reconciling a ClusterExtension to resolve a bundle from the criteria provided by a user, the reconciler should return a desired bundle within 100ms and allocate no more memory than the size of the catalog metadata for the named spec.packageName.
  • When the ClusterExtension reconciler does not have the contents of a resolved image bundle available, it does not block waiting for the image to be pulled and processed. Rather, it starts an asychronous job, reports the pending image pull via the ClusterExtension status, and returns from reconcile.
@joelanford joelanford added v1.0 Issues related to the initial stable release of OLMv1 epic labels Jun 11, 2024
@OchiengEd
Copy link

/assign

@joelanford
Copy link
Member Author

joelanford commented Jul 14, 2024

I think I've found one unexpected slowdown: the bundle handler that converts a registry+v1 bundle to plain and then to helm. It takes 5s on my machine in the "Force upgrade" e2e test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic v1.0 Issues related to the initial stable release of OLMv1
Projects
Status: No status
Development

No branches or pull requests

2 participants