Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/cache expiration #31785

Merged
merged 17 commits into from
Jun 17, 2022
Merged

Conversation

gsantoro
Copy link
Contributor

@gsantoro gsantoro commented May 31, 2022

What does this PR do?

Make the internal cache PerfMetricsCache used by various metricsets in the module kubernetes in metricbeat to never expire. This avoid missing metrics or calculating incorrect metrics as explained in this issue #31778.

This PR dynamically changes the timeout of the internal cache PerfMetricsCache in the kubernetes module so that it's always larger than the scraping period set by the users in the configs. In particular, the cache timeout is always set to 2x the largest period setting for any of the metricsets.

Why is it important?

It fixes a bug reported by a customer in an SDH that prevented some metrics like kubernetes.container.cpu.usage.limit.pct from being populated.

A customer first reported this issue when setting in the kubernetes module settings a period to be 15 minutes. This bug was caused by the cache timeout to be hardcoded in code to 2 minutes.

This PR fixes the missing metrics issue and make possible to set any value for period.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  • Configure metricbeat to have any of the period setting in the kubernetes module larger than 2 minutes. Observe the following metrics to be correctly computed (values >=0 but NOT missing):
    • kubernetes.container.cpu.usage.limit.pct
    • kubernetes.container.memory.usage.limit.pct
    • kubernetes.pod.cpu.usage.limit.pct
    • kubernetes.pod.memory.usage.limit.pct

Before the code change:
Some memory metrics for kubernetes.container.memory.usage.limit.pct are computed but most of them are missing.
Screenshot 2022-06-17 at 12 18 06
Screenshot 2022-06-17 at 12 17 58

Some metrics for kubernetes.container.cpu.usage.limit.pct are correctly computed but most of them are missing.
Screenshot 2022-06-17 at 12 17 32
Screenshot 2022-06-17 at 12 17 12

The percentage of metrics missing depends on the different configs for the period of the various metricsets. In order to have this fixed, the metrics should be there all the time with a value >= 0.

After the code changes:
Screenshot 2022-06-17 at 12 14 34
Screenshot 2022-06-17 at 12 15 08

Related issues

Link related issues below. Insert the issue link or reference after the word "Closes" if merging this should automatically close it.

@gsantoro gsantoro added bug enhancement in progress Pull request is currently in progress. Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team labels May 31, 2022
@gsantoro gsantoro requested review from a team as code owners May 31, 2022 15:09
@gsantoro gsantoro self-assigned this May 31, 2022
@gsantoro gsantoro requested review from cmacknz and fearful-symmetry and removed request for a team May 31, 2022 15:09
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 31, 2022
@gsantoro gsantoro added discuss Issue needs further discussion. Draft labels May 31, 2022
@elasticmachine
Copy link
Collaborator

elasticmachine commented May 31, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-06-17T12:20:42.851+0000

  • Duration: 90 min 6 sec

Test stats 🧪

Test Results
Failed 0
Passed 21685
Skipped 1798
Total 23483

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@@ -83,7 +84,7 @@ metadata:
data:
system.yml: |-
- module: system
period: 10s
period: 1m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert those values to original ones before merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments we already discussed offline:

  1. Can you revert the changes on the config files that are not needed for the patch itself?
  2. In a quick view, I think we can use your logic to set the expiration’s default value and make it also available as an optional setting. In this, our users can just ignore this setting and the defaults will do the trick. If in any case someone wants to specify this setting there will be an option available.

Some key code parts that might help in adding the setting:

  1. type kubernetesConfig struct {
  2. most probably you will need to unpack the config at the metricset level like what we do inside the Enricher's implementation:
    config := ValidatedConfig(base)
  3. Check Enricher's initialisation:
    enricher: util.NewContainerMetadataEnricher(base, true),

@@ -72,6 +72,12 @@ func New(base mb.BaseMetricSet) (mb.MetricSet, error) {
if !ok {
return nil, fmt.Errorf("must be child of kubernetes module")
}

newTimeout := base.Module().Config().Period * 2
Copy link
Contributor

@MichaelKatsoulis MichaelKatsoulis Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a logic that in case the base.Module().Config().Period is >= to 120 seconds, only then set the newTimeout base.Module().Config().Period * 2. And that should happen only if the user hasn't specified a value of the timeout in the config.
If we always set the newTimeout to scraping_period*2 , then in most often cases where period is 10s , the cache will expire too soon and maybe that defeats its purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to explain the current behaviour in a PR comment at #31785 (comment). Happy to have a longer chat about this.
In short with a 10s period and 20s cache timeout, the cache expires every 20s but it is still replaces every 10s (if all scrapes work).

I would suggest to have period * 5 to have a longer cache and account for possible failures during scraping.

@gsantoro
Copy link
Contributor Author

gsantoro commented Jun 7, 2022

I put scraping.period * 2 as an initial placeholder. I wasn't sure about 2 as a final value. Maybe we should have scraping.period * 5 to be more resilient. I don't have a strong opinion on the multiplier value.
My point is though, that this behaviour should be the same no matter the length of the scraping period. The value in the cache is replaced at every successful scrape and the expiration updated to a full timeout (cache period), even if the entries are not expired, as long as there has been at least 1 full period from last scraping or it is the first scraping

if statsCache.lastFetchTimestamp.IsZero() || now.Sub(statsCache.lastFetchTimestamp) > m.Config().Period { ... http.FetchContent() ... }

A cache expiration time of 5 * scraping.Period also prevent that the cache expires if one or more scraping do NOT take place.

scenarios with current code (before modifications in this PR):

  • Example 1

    • scrape every 10s
    • cache expiration at 2m
    • cache is replaced every 10s (at every scrape)
    • cache expires 2m in the future of the last scrape
    • metrics are correctly calculated since the cache is always full
  • Example 2 (failure discovered in SDH)

    • scrape every 10m
    • cache expiration at 2m
    • cache expires 2m after the last scrape. it stays empty for 8m, metricbeat read an empty cache and it calculates wrong metrics
    • cache is refilled but only read at the next period after 10m when it is already empty again.

Scenarios with modifications in this PR:

  • Example 1

    • scrape at 10s
    • cache expiration at 50s
    • with no scrape failures, cache is replenished every 10s and it expires 50s in the future of the last scrape
    • if some scrapes fails to run, if there is at least 1 scrape every 50s, the cache expires 50s in the future of the last scrape and the right metrics are calculates. You can have up to 4/5 failed scrapes and still behave correctly
  • Example 2

    • scrape at 10m
    • cache expiration at 50m
    • similarly to example 1. Cache never expires, it is resilient to scraping failures
    • the user doesn't need to guess what it a good value for the cache expiration

There is a similar config for the expiration of the cache at beats/x-pack/metricbeat/module/prometheus/collector/data.go at line 29 counters := collector.NewCounterCache(base.Module().Config().Period * 5)

@gsantoro
Copy link
Contributor Author

gsantoro commented Jun 7, 2022

Comments on #31785 (review)

  • I will revert the changes. I left them there for you guys to check the behaviour of the new code
  • I agree we should expose an optional config for the user to customise the behaviour of the cache and do the trick with cache_expiration = period * 5 otherwise. Thanks for the suggestions. I'll have a look

@ChrsMark
Copy link
Member

ChrsMark commented Jun 8, 2022

Comments on #31785 (review)

  • I will revert the changes. I left them there for you guys to check the behaviour of the new code

In that case you can just add everything needed at How to test this PR locally section in the PR's description (see for example #25640).

@gsantoro gsantoro added the backport-7.17 Automated backport to the 7.17 branch with mergify label Jun 14, 2022
@mergify
Copy link
Contributor

mergify bot commented Jun 17, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature/cache_expiration upstream/feature/cache_expiration
git merge upstream/main
git push upstream feature/cache_expiration

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor suggestion/fix for the changelog. Keep in mind that changelog entries are user facing content and thus should be quite accurate and not super technical.

CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved
Co-authored-by: Chris Mark <chrismarkou92@gmail.com>
Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed work here!
Wait for green CI and ship it! :shipit:

Copy link
Contributor

@gizas gizas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only missing a screenshot after the implementation to see that final metrics actually appear.

Also is there a simple way to verify the value of cache timeout without debugging?

@@ -27,7 +27,7 @@ data:
- config:
- module: kubernetes
hosts: ["kube-state-metrics:8080"]
period: 10s
period: 3m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here revert before merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are looking at an outdated commit. this has been resolved already

@@ -156,12 +157,18 @@ spec:
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: metricbeat
image: docker.elastic.co/beats/metricbeat:8.4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add this back? Align with master

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are looking at an outdated commit. this has been resolved already

image: docker.elastic.co/beats/metricbeat:8.4.0
# image: docker.elastic.co/beats/metricbeat:8.3.0-SNAPSHOT
image: metricbeat-debugger-image:latest
imagePullPolicy: Never
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you remove all debug fields as they should not be merged.
I think you should create a new folder that will describe how to test.

Add test/ debug in the README maybe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are looking at an outdated commit. this has been resolved already

@@ -33,13 +33,19 @@ func init() {
const defaultTimeout = 120 * time.Second

// NewPerfMetricsCache initializes and returns a new PerfMetricsCache
func NewPerfMetricsCache() *PerfMetricsCache {
func NewPerfMetricsCache(timeout time.Duration) *PerfMetricsCache {
if timeout <= 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to check timeout != nil as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout cannot be nil since it is not a reference. It should always be initialized to something

func (c *PerfMetricsCache) GetTimeout() time.Duration {
var ans time.Duration = 0

nmATimeout := c.NodeMemAllocatable.GetTimeout()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of multiple if comparisons you can initialise an array and make only one assighnemnt eg. https://learningprogramming.net/golang/golang-golang/find-max-and-min-of-array-in-golang/

Just an idea to consider

@gsantoro gsantoro merged commit 4b625fc into elastic:main Jun 17, 2022
mergify bot pushed a commit that referenced this pull request Jun 17, 2022
* fixed cache expiration bug

(cherry picked from commit 4b625fc)

# Conflicts:
#	metricbeat/module/kubernetes/container/container.go
#	metricbeat/module/kubernetes/pod/pod.go
#	metricbeat/module/kubernetes/state_cronjob/state_cronjob.go
#	metricbeat/module/kubernetes/state_daemonset/state_daemonset.go
#	metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go
#	metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go
#	metricbeat/module/kubernetes/util/kubernetes.go
#	metricbeat/module/kubernetes/volume/volume.go
mergify bot pushed a commit that referenced this pull request Jun 17, 2022
* fixed cache expiration bug

(cherry picked from commit 4b625fc)

# Conflicts:
#	metricbeat/module/kubernetes/container/container.go
#	metricbeat/module/kubernetes/pod/pod.go
#	metricbeat/module/kubernetes/state_cronjob/state_cronjob.go
#	metricbeat/module/kubernetes/state_daemonset/state_daemonset.go
#	metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go
#	metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go
#	metricbeat/module/kubernetes/util/kubernetes.go
#	metricbeat/module/kubernetes/volume/volume.go
mergify bot pushed a commit that referenced this pull request Jun 17, 2022
* fixed cache expiration bug

(cherry picked from commit 4b625fc)

# Conflicts:
#	metricbeat/module/kubernetes/container/container.go
#	metricbeat/module/kubernetes/pod/pod.go
#	metricbeat/module/kubernetes/state_cronjob/state_cronjob.go
#	metricbeat/module/kubernetes/state_daemonset/state_daemonset.go
#	metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go
#	metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go
#	metricbeat/module/kubernetes/util/kubernetes.go
#	metricbeat/module/kubernetes/volume/volume.go
mergify bot pushed a commit that referenced this pull request Jun 17, 2022
* fixed cache expiration bug

(cherry picked from commit 4b625fc)

# Conflicts:
#	metricbeat/module/kubernetes/container/container.go
#	metricbeat/module/kubernetes/pod/pod.go
#	metricbeat/module/kubernetes/state_daemonset/state_daemonset.go
#	metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go
#	metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go
#	metricbeat/module/kubernetes/util/kubernetes.go
#	metricbeat/module/kubernetes/volume/volume.go
mergify bot pushed a commit that referenced this pull request Jun 17, 2022
* fixed cache expiration bug

(cherry picked from commit 4b625fc)
@gsantoro
Copy link
Contributor Author

gsantoro commented Jun 17, 2022

@gizas

Only missing a screenshot after the implementation to see that final metrics actually appear.

Sorry I missed to add these. I took them and forgot to add them. They are in the description now

Also is there a simple way to verify the value of cache timeout without debugging?

I have another Dockerfile I am using to test the backport to 7.17... without debugging. I haven't pushed it to git yet. I am not sure if there is a more official way to do this.

gsantoro pushed a commit that referenced this pull request Jun 20, 2022
* fixed cache expiration bug
gsantoro pushed a commit that referenced this pull request Jun 20, 2022
gsantoro pushed a commit that referenced this pull request Jun 20, 2022
gsantoro pushed a commit that referenced this pull request Jun 20, 2022
gsantoro pushed a commit that referenced this pull request Jun 20, 2022
@gsantoro gsantoro deleted the feature/cache_expiration branch July 12, 2022 12:52
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
* fixed cache expiration bug
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.0.0 Automated backport with mergify backport-v8.1.0 Automated backport with mergify backport-v8.2.0 Automated backport with mergify backport-v8.3.0 Automated backport with mergify bug discuss Issue needs further discussion. enhancement in progress Pull request is currently in progress. Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team test-plan Add this PR to be manual test plan
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants