Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/dag: validate Secrets only when used #4788

Merged
merged 9 commits into from
Oct 21, 2022

Conversation

skriss
Copy link
Member

@skriss skriss commented Oct 12, 2022

Changes Secret validation to happen when Secrets
are used by processors, rather than as they are added
to the DAG cache. Also caches the validation results
on the DAG objects for efficiency.

This eliminates the need for Contour to inspect Secrets
that are not used by an Ingress/HTTPProxy/Gateway. It
also replaces the confusing "Secret not found" errors
with specific validation errors when a Secret does exist,
but is not valid.

This also allows Secret validation logic to be more
precise, since we know what the processors are trying
to use them for.

Fixes #2616.
Fixes #2908.

@skriss
Copy link
Member Author

skriss commented Oct 12, 2022

Still testing this, but putting up as a draft in case anyone wants to check out the approach.

@codecov
Copy link

codecov bot commented Oct 12, 2022

Codecov Report

Merging #4788 (17d1286) into main (f9ae406) will increase coverage by 0.04%.
The diff coverage is 91.39%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4788      +/-   ##
==========================================
+ Coverage   76.16%   76.20%   +0.04%     
==========================================
  Files         140      140              
  Lines       16914    16921       +7     
==========================================
+ Hits        12882    12895      +13     
+ Misses       3780     3776       -4     
+ Partials      252      250       -2     
Impacted Files Coverage Δ
internal/dag/dag.go 96.62% <ø> (ø)
internal/dag/cache.go 92.39% <85.00%> (+0.90%) ⬆️
internal/dag/secret.go 75.62% <95.45%> (+1.23%) ⬆️
internal/dag/extension_processor.go 95.79% <100.00%> (ø)
internal/dag/gatewayapi_processor.go 94.72% <100.00%> (ø)
internal/dag/httpproxy_processor.go 93.20% <100.00%> (ø)
internal/dag/ingress_processor.go 97.07% <100.00%> (ø)

kc.secrets[k8s.NamespacedNameOf(obj)] = obj
// Secret validation status is intentionally cleared, it needs
// to be re-validated after an insert.
kc.secrets[k8s.NamespacedNameOf(obj)] = &Secret{Object: obj}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we end up caching more Secrets this way, but we never look at them unless they're actually referenced. I'm not sure it matters all that much memory-wise, since these are all in the controller-runtime cache already anyway, and we're just storing pointers to them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah even if it stores a few more references, we do gain some information since you have the validation status cached which stops redoing some work if a secret is used in multiple resources

Copy link

@ravilr ravilr Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skriss @sunjayBhatia

Noticing a regression in Contour's resource usage after upgrading to contour-v1.24.1 (from contour-v1.22.x) . IMO, above change could be the reason ? Contour's memory usage in clusters having very large secret resources (like secrets of type helm.sh/release.v1 ) keeps growing and leading to OOM due to reaching container's resources.memory.limits. contour-v1.22.x on the same cluster (with similar K8s resources profile and large secrets) didn't had this issue.

1.) heap pprof from a Contour-v1.24.1 instance shows 3GB+ in use memory towards v1.Secrets resources unmarshalling ( the above change is holding back references to secrets in informer cache, preventing the unmarshall'ed bytes from being go-GC'ed ?) :
contour-heap

2.) memory usage graph (before 03/06 12 pm and after):

Screen Shot 2023-03-07 at 11 24 14 PM

3.) Also, the cpu usage of Contour has gone up considerably since the v1.22.x to v1.24.1 upgrade (before 03/06 12pm and after) :

Screen Shot 2023-03-07 at 11 24 26 PM

Also cc @tsaarni (who's related work on filtering out unwanted Secrets from the informer cache is also relevant, IMO).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravilr sounds like a definite possibility. One change we could make here is to re-introduce a filter to only cache Secrets of type Opaque and TLS. I will also take a further look at @tsaarni's work here.

Copy link

@ravilr ravilr Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further debugging, this turned out to be a bug in a controller causing it to patch Secret resources in a tight loop(more than 20 times per second). since the cluster in question went through the upgrade of multiple components including Contour, the change in contour's resource usage was wrongly attributed to contour upgrade change.

but, IMO, contour should still filter out unrelated secrets like helm release secrets which are very large, from its secret informer cache, to prevent contour being affected with high resource usage and event processing overhead from those unrelated Secrets resources in a cluster. So, I hope we can take tsaarni's PR forward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that info @ravilr, and yes, I agree this still seems like a good optimization to make. Will continue to follow @tsaarni's work in #5099.

internal/dag/cache.go Outdated Show resolved Hide resolved
@skriss skriss added this to the 1.24.0 milestone Oct 13, 2022
@skriss skriss added the release-note/minor A minor change that needs about a paragraph of explanation in the release notes. label Oct 13, 2022
@skriss skriss marked this pull request as ready for review October 20, 2022 15:45
@skriss skriss requested a review from a team as a code owner October 20, 2022 15:45
@skriss skriss requested review from stevesloka and sunjayBhatia and removed request for a team October 20, 2022 15:45
@skriss
Copy link
Member Author

skriss commented Oct 20, 2022

Rebased, should be ready for review.

@skriss skriss requested a review from tsaarni October 20, 2022 15:45
Copy link
Member

@sunjayBhatia sunjayBhatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to summarize:

  • instead of saving secrets directly, we save secrets with their validation errors
  • secrets can simultaneously be of different types
  • secrets no longer validated when inserted to dag, which means more secret references
  • secrets validated when validating resources that reference them

got := Result{Valid: valid, Err: err}

assert.Equal(t, want, got)
assert.Equal(t, tc.tlsSecretError, validTLSSecret(tc.secret))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice way of testing this, since the way the Secret type is constructed it means we can have a single secret used in different ways

@@ -928,7 +928,10 @@ func (s *ServiceCluster) Rebalance() {
// Secret represents a K8s Secret for TLS usage as a DAG Vertex. A Secret is
// a leaf in the DAG.
type Secret struct {
Object *v1.Secret
Object *v1.Secret
ValidTLSSecret *SecretValidationStatus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this has the wrapper around the error since we need the distinction between not yet set, no error, and error

Signed-off-by: Steve Kriss <krisss@vmware.com>
Signed-off-by: Steve Kriss <krisss@vmware.com>
Signed-off-by: Steve Kriss <krisss@vmware.com>
Signed-off-by: Steve Kriss <krisss@vmware.com>
Signed-off-by: Steve Kriss <krisss@vmware.com>
Signed-off-by: Steve Kriss <krisss@vmware.com>
Signed-off-by: Steve Kriss <krisss@vmware.com>
Signed-off-by: Steve Kriss <krisss@vmware.com>
Signed-off-by: Steve Kriss <krisss@vmware.com>
@skriss skriss merged commit 07f0c60 into projectcontour:main Oct 21, 2022
@skriss skriss deleted the secrets branch October 21, 2022 16:40
moeyui1 pushed a commit to moeyui1/contour that referenced this pull request Oct 26, 2022
Changes Secret validation to happen when Secrets
are used by processors, rather than as they are added
to the DAG cache. Also caches the validation results
on the DAG objects for efficiency.

This eliminates the need for Contour to inspect Secrets
that are not used by an Ingress/HTTPProxy/Gateway. It
also replaces the confusing "Secret not found" errors
with specific validation errors when a Secret does exist,
but is not valid.

This also allows Secret validation logic to be more
precise, since we know what the processors are trying
to use them for.

Fixes projectcontour#2616.
Fixes projectcontour#2908.

Signed-off-by: Steve Kriss <krisss@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/minor A minor change that needs about a paragraph of explanation in the release notes.
Projects
None yet
3 participants