Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky Handle EBS Ack timer test for windows #3904

Merged
merged 1 commit into from
Sep 20, 2023
Merged

Conversation

mye956
Copy link
Contributor

@mye956 mye956 commented Sep 12, 2023

Summary

The goal of this PR is to un-flake the TestHandleEBSAckTimeout test for windows due to timing issue. The current mitigation is to retry the windows unit test until it passes.

Implementation details

  • ebs/watcher_test.go: TestHandleEBSAckTimeout will now continuously sleep for 5 ms and iterate until the EBS ResourceAttachment has been removed from the local map of attachments within Agent's Docker task state engine once it hits the ack timeout.

Testing

Unit test

Created a dummy PR that runs TestHandleEBSAckTimeout 100x -> mye956#53

Description for the changelog

Fix flaky EBS watcher test for windows.

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mye956 mye956 changed the title Do not review, dummy PR [WIP] Fix flaky Handle EBS Ack timer test for windows Sep 13, 2023
@mye956 mye956 changed the title [WIP] Fix flaky Handle EBS Ack timer test for windows Fix flaky Handle EBS Ack timer test for windows Sep 13, 2023
@mye956 mye956 marked this pull request as ready for review September 13, 2023 18:49
@mye956 mye956 requested a review from a team as a code owner September 13, 2023 18:49
@danehlim
Copy link
Contributor

Just curious, do we know the root cause of the timing issue with Windows exactly? Is this expected behavior, or is it concerning that for Windows the EBS ResourceAttachment may not be removed from the local map of attachments within Agent's Docker task state engine even after double the specified timeout amount? (i.e., expiresAt is set to 5ms, but the ResourceAttachment is still not removed after 10ms)

Realmonia
Realmonia previously approved these changes Sep 13, 2023
@mye956
Copy link
Contributor Author

mye956 commented Sep 13, 2023

Just curious, do we know the root cause of the timing issue with Windows exactly? Is this expected behavior, or is it concerning that for Windows the EBS ResourceAttachment may not be removed from the local map of attachments within Agent's Docker task state engine even after double the specified timeout amount? (i.e., expiresAt is set to 5ms, but the ResourceAttachment is still not removed after 10ms)

This is for sure not the expected behavior, I suspect that while we're handling the attachment, we're also already sleeping for 10 ms (or have started to sleep). In a real world situation, I don't think 5ms would be an ideal timeout/wait duration for an Ack (mainly since it'll take a bit to actually attach an EBS volume onto the instance). Here's also an example of where the tests is being flaky -> https://github.com/aws/amazon-ecs-agent/actions/runs/6164709032/job/16731863418

Although thinking about it as well as per offline discussion with @singholt , it might be better to just increase the sleep timer to 30ms or something around that ballpark. We don't want to run into a situation where it never gets removed and we're stuck in this forever loop.

@danehlim
Copy link
Contributor

danehlim commented Sep 13, 2023

Just curious, do we know the root cause of the timing issue with Windows exactly? Is this expected behavior, or is it concerning that for Windows the EBS ResourceAttachment may not be removed from the local map of attachments within Agent's Docker task state engine even after double the specified timeout amount? (i.e., expiresAt is set to 5ms, but the ResourceAttachment is still not removed after 10ms)

This is for sure not the expected behavior, I suspect that while we're handling the attachment, we're also already sleeping for 10 ms (or have started to sleep). In a real world situation, I don't think 5ms would be an ideal timeout/wait duration for an Ack (mainly since it'll take a bit to actually attach an EBS volume onto the instance). Here's also an example of where the tests is being flaky -> https://github.com/aws/amazon-ecs-agent/actions/runs/6164709032/job/16731863418

Although thinking about it as well as per offline discussion with @singholt , it might be better to just increase the sleep timer to 30ms or something around that ballpark. We don't want to run into a situation where it never gets removed and we're stuck in this forever loop.

Thanks for the explanation! Proposed change to increase sleep timer in this test sounds good to me. I suggest maybe adding a code comment adjacent to the sleep timer providing context from your explanation as to why we chose 30ms.

Realmonia
Realmonia previously approved these changes Sep 14, 2023
@Realmonia Realmonia merged commit 19c3a32 into aws:dev Sep 20, 2023
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants