internal/dag: validate Secrets only when used #4788

skriss · 2022-10-12T20:46:39Z

Changes Secret validation to happen when Secrets
are used by processors, rather than as they are added
to the DAG cache. Also caches the validation results
on the DAG objects for efficiency.

This eliminates the need for Contour to inspect Secrets
that are not used by an Ingress/HTTPProxy/Gateway. It
also replaces the confusing "Secret not found" errors
with specific validation errors when a Secret does exist,
but is not valid.

This also allows Secret validation logic to be more
precise, since we know what the processors are trying
to use them for.

Fixes #2616.
Fixes #2908.

skriss · 2022-10-12T20:48:28Z

Still testing this, but putting up as a draft in case anyone wants to check out the approach.

codecov · 2022-10-12T20:56:19Z

Codecov Report

Merging #4788 (17d1286) into main (f9ae406) will increase coverage by 0.04%.
The diff coverage is 91.39%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4788      +/-   ##
==========================================
+ Coverage   76.16%   76.20%   +0.04%     
==========================================
  Files         140      140              
  Lines       16914    16921       +7     
==========================================
+ Hits        12882    12895      +13     
+ Misses       3780     3776       -4     
+ Partials      252      250       -2

Impacted Files	Coverage Δ
internal/dag/dag.go	`96.62% <ø> (ø)`
internal/dag/cache.go	`92.39% <85.00%> (+0.90%)`	⬆️
internal/dag/secret.go	`75.62% <95.45%> (+1.23%)`	⬆️
internal/dag/extension_processor.go	`95.79% <100.00%> (ø)`
internal/dag/gatewayapi_processor.go	`94.72% <100.00%> (ø)`
internal/dag/httpproxy_processor.go	`93.20% <100.00%> (ø)`
internal/dag/ingress_processor.go	`97.07% <100.00%> (ø)`

skriss · 2022-10-12T21:59:50Z

internal/dag/cache.go

-			kc.secrets[k8s.NamespacedNameOf(obj)] = obj
+			// Secret validation status is intentionally cleared, it needs
+			// to be re-validated after an insert.
+			kc.secrets[k8s.NamespacedNameOf(obj)] = &Secret{Object: obj}


we end up caching more Secrets this way, but we never look at them unless they're actually referenced. I'm not sure it matters all that much memory-wise, since these are all in the controller-runtime cache already anyway, and we're just storing pointers to them.

yeah even if it stores a few more references, we do gain some information since you have the validation status cached which stops redoing some work if a secret is used in multiple resources

@skriss @sunjayBhatia

Noticing a regression in Contour's resource usage after upgrading to contour-v1.24.1 (from contour-v1.22.x) . IMO, above change could be the reason ? Contour's memory usage in clusters having very large secret resources (like secrets of type helm.sh/release.v1 ) keeps growing and leading to OOM due to reaching container's resources.memory.limits. contour-v1.22.x on the same cluster (with similar K8s resources profile and large secrets) didn't had this issue.

1.) heap pprof from a Contour-v1.24.1 instance shows 3GB+ in use memory towards v1.Secrets resources unmarshalling ( the above change is holding back references to secrets in informer cache, preventing the unmarshall'ed bytes from being go-GC'ed ?) :

2.) memory usage graph (before 03/06 12 pm and after):

3.) Also, the cpu usage of Contour has gone up considerably since the v1.22.x to v1.24.1 upgrade (before 03/06 12pm and after) :

Also cc @tsaarni (who's related work on filtering out unwanted Secrets from the informer cache is also relevant, IMO).

@ravilr sounds like a definite possibility. One change we could make here is to re-introduce a filter to only cache Secrets of type Opaque and TLS. I will also take a further look at @tsaarni's work here.

On further debugging, this turned out to be a bug in a controller causing it to patch Secret resources in a tight loop(more than 20 times per second). since the cluster in question went through the upgrade of multiple components including Contour, the change in contour's resource usage was wrongly attributed to contour upgrade change.

but, IMO, contour should still filter out unrelated secrets like helm release secrets which are very large, from its secret informer cache, to prevent contour being affected with high resource usage and event processing overhead from those unrelated Secrets resources in a cluster. So, I hope we can take tsaarni's PR forward.

Thanks for that info @ravilr, and yes, I agree this still seems like a good optimization to make. Will continue to follow @tsaarni's work in #5099.

internal/dag/cache.go

skriss · 2022-10-20T15:45:46Z

Rebased, should be ready for review.

sunjayBhatia

to summarize:

instead of saving secrets directly, we save secrets with their validation errors
secrets can simultaneously be of different types
secrets no longer validated when inserted to dag, which means more secret references
secrets validated when validating resources that reference them

sunjayBhatia · 2022-10-20T21:36:20Z

internal/dag/secret_test.go

-			got := Result{Valid: valid, Err: err}
-
-			assert.Equal(t, want, got)
+			assert.Equal(t, tc.tlsSecretError, validTLSSecret(tc.secret))


nice way of testing this, since the way the Secret type is constructed it means we can have a single secret used in different ways

sunjayBhatia · 2022-10-20T21:49:34Z

internal/dag/dag.go

@@ -928,7 +928,10 @@ func (s *ServiceCluster) Rebalance() {
 // Secret represents a K8s Secret for TLS usage as a DAG Vertex. A Secret is
 // a leaf in the DAG.
 type Secret struct {
-	Object *v1.Secret
+	Object         *v1.Secret
+	ValidTLSSecret *SecretValidationStatus


I guess this has the wrapper around the error since we need the distinction between not yet set, no error, and error

Signed-off-by: Steve Kriss <krisss@vmware.com>

Changes Secret validation to happen when Secrets are used by processors, rather than as they are added to the DAG cache. Also caches the validation results on the DAG objects for efficiency. This eliminates the need for Contour to inspect Secrets that are not used by an Ingress/HTTPProxy/Gateway. It also replaces the confusing "Secret not found" errors with specific validation errors when a Secret does exist, but is not valid. This also allows Secret validation logic to be more precise, since we know what the processors are trying to use them for. Fixes projectcontour#2616. Fixes projectcontour#2908. Signed-off-by: Steve Kriss <krisss@vmware.com>

skriss commented Oct 12, 2022

View reviewed changes

internal/dag/cache.go Outdated Show resolved Hide resolved

skriss added this to the 1.24.0 milestone Oct 13, 2022

skriss force-pushed the secrets branch from 849d3f6 to 51ef996 Compare October 13, 2022 19:44

skriss added the release-note/minor A minor change that needs about a paragraph of explanation in the release notes. label Oct 13, 2022

skriss mentioned this pull request Oct 17, 2022

allow TLS secrets to be of type Opaque #4799

Merged

skriss force-pushed the secrets branch from 9afd3e6 to 9fd5248 Compare October 20, 2022 15:45

skriss marked this pull request as ready for review October 20, 2022 15:45

skriss requested a review from a team as a code owner October 20, 2022 15:45

skriss requested review from stevesloka and sunjayBhatia and removed request for a team October 20, 2022 15:45

skriss requested a review from tsaarni October 20, 2022 15:45

sunjayBhatia approved these changes Oct 20, 2022

View reviewed changes

skriss added 9 commits October 21, 2022 10:01

move Secrets validation to be on read

aa01e05

Signed-off-by: Steve Kriss <krisss@vmware.com>

store Secret validation status in cache

cababaa

Signed-off-by: Steve Kriss <krisss@vmware.com>

split Secret validation by usage

9d61dfe

Signed-off-by: Steve Kriss <krisss@vmware.com>

update tests

267fd6e

Signed-off-by: Steve Kriss <krisss@vmware.com>

cleanup

42e7105

Signed-off-by: Steve Kriss <krisss@vmware.com>

simplify code structure

42f4f8e

Signed-off-by: Steve Kriss <krisss@vmware.com>

changelog

1e54b92

Signed-off-by: Steve Kriss <krisss@vmware.com>

add unit test case

4cec8cb

Signed-off-by: Steve Kriss <krisss@vmware.com>

fix merge conflict

17d1286

Signed-off-by: Steve Kriss <krisss@vmware.com>

skriss force-pushed the secrets branch from 9fd5248 to 17d1286 Compare October 21, 2022 16:01

skriss merged commit 07f0c60 into projectcontour:main Oct 21, 2022

skriss deleted the secrets branch October 21, 2022 16:40

This was referenced Mar 8, 2023

Reduce the memory footprint by optimizing Kubernetes informer cache #5099

Merged

Avoid comparing full objects during update to optimize performance #5064

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal/dag: validate Secrets only when used #4788

internal/dag: validate Secrets only when used #4788

skriss commented Oct 12, 2022

skriss commented Oct 12, 2022

codecov bot commented Oct 12, 2022 •

edited

Loading

skriss Oct 12, 2022

sunjayBhatia Oct 14, 2022

ravilr Mar 8, 2023 •

edited

Loading

skriss Mar 8, 2023

ravilr Mar 9, 2023 •

edited

Loading

skriss Mar 9, 2023

skriss commented Oct 20, 2022

sunjayBhatia left a comment

sunjayBhatia Oct 20, 2022

sunjayBhatia Oct 20, 2022

internal/dag: validate Secrets only when used #4788

internal/dag: validate Secrets only when used #4788

Conversation

skriss commented Oct 12, 2022

skriss commented Oct 12, 2022

codecov bot commented Oct 12, 2022 • edited Loading

Codecov Report

skriss Oct 12, 2022

Choose a reason for hiding this comment

sunjayBhatia Oct 14, 2022

Choose a reason for hiding this comment

ravilr Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

skriss Mar 8, 2023

Choose a reason for hiding this comment

ravilr Mar 9, 2023 • edited Loading

Choose a reason for hiding this comment

skriss Mar 9, 2023

Choose a reason for hiding this comment

skriss commented Oct 20, 2022

sunjayBhatia left a comment

Choose a reason for hiding this comment

sunjayBhatia Oct 20, 2022

Choose a reason for hiding this comment

sunjayBhatia Oct 20, 2022

Choose a reason for hiding this comment

codecov bot commented Oct 12, 2022 •

edited

Loading

ravilr Mar 8, 2023 •

edited

Loading

ravilr Mar 9, 2023 •

edited

Loading