Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert CPU <50% at the end of ducktape tests #10939

Merged
merged 1 commit into from
Jun 13, 2023

Conversation

rockwotj
Copy link
Contributor

@rockwotj rockwotj commented May 22, 2023

Assert that nodes have <50% CPU usage before teardown

In ducktape during test teardown, poll metrics to assert that we don't have a bunch of work that isn't being cleaned up properly.

We poll over a 1 second interval, then fallback to a 5 second interval in the case of compaction happening in the background.

These checks are disabled for now on a few tests that do not shutdown cleanly nodes, as the metrics requests fail. There should be followup work to enable the checks on those tests.

Fixes: #10837

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

  • none

@rockwotj
Copy link
Contributor Author

/ci-repeat

@rockwotj rockwotj force-pushed the rockwood/idle branch 20 times, most recently from 8fd7ac4 to d38293d Compare June 1, 2023 16:51
@rockwotj
Copy link
Contributor Author

rockwotj commented Jun 5, 2023

/ci-repeat 5
release
skip-unit
dt-repeat=100

@rockwotj
Copy link
Contributor Author

rockwotj commented Jun 5, 2023

/ci-repeat 5
release
skip-unit
dt-repeat=10
tests/rptest/test_suite_quick.yml

@rockwotj
Copy link
Contributor Author

rockwotj commented Jun 6, 2023

/ci-repeat 10
release
skip-unit
tests/rptest/test_suite_quick.yml

@rockwotj rockwotj marked this pull request as ready for review June 7, 2023 14:56
@rockwotj
Copy link
Contributor Author

rockwotj commented Jun 7, 2023

/ci-repeat 10
release
skip-unit
tests/rptest/test_suite_quick.yml

@rockwotj rockwotj requested review from andijcr and andrwng June 7, 2023 15:34
@@ -20,6 +21,15 @@ def __init__(self, test_context):
super(SimpleK8sTest, self).__init__(test_context)
self.redpanda = RedpandaServiceK8s(test_context, 1)

@property
def debug_mode(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are reading an environment variable, and this property is read only from the cluster decorator, do we need this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting moving the read from the environment check directly into the cluster decorator?

I think there really should probably be some BaseTest class that adds this to all our tests. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misread the code. I thought I was reading RedpandaTest class here.

Yeah, I'm not sure what would be best, probably it's appropriate to have this method/property in the base classes like you did. Personally, I would have just read the environment variable inside the cluster decorator, but it's not a great place either.

Did you try on a debug build and saw greater CPU utilization even when the test was done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try on a debug build and saw greater CPU utilization even when the test was done?

Yeah I added a comment in cluster, but a ton of debug build tests where triggering this check, even for tests that seemingly did very little.

@andijcr
Copy link
Contributor

andijcr commented Jun 7, 2023

looks good, nice idea to use the metric uptime

andijcr
andijcr previously approved these changes Jun 7, 2023
actual_utilization = (end_sample.value -
start_sample.value) / actual_period
shard_id = start_sample.labels["shard"]
assert actual_utilization < max_utilization, f"Node: {node.name} shard: {shard_id} cpu utilization too high, actual: {actual_utilization}, expected: {max_utilization}"
Copy link
Member

@dotnwat dotnwat Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha i'm surprised the linter was cool with this long line

@rockwotj
Copy link
Contributor Author

rockwotj commented Jun 9, 2023

/ci-repeat

In ducktape during test teardown, poll metrics to assert
that we don't have a bunch of work that isn't being cleaned
up properly.

We poll over a 1 second interval, then fallback to a 5 second interval
in the case of compaction happening in the background.

Fixes: redpanda-data#10837

Signed-off-by: Tyler Rockwood <rockwood@redpanda.com>
@rockwotj
Copy link
Contributor Author

Force pushed to disable the check on a config test that disables metrics

@rockwotj rockwotj requested a review from andijcr June 12, 2023 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tests: validate that all shards on idle nodes have <50% CPU utilization at end of tests
3 participants