-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: #3607 Metrics data loss in K8S controller #3692
Conversation
Build Failed 😱 Build Id: 299f196e-0f1d-4108-a7c8-07a029fcb03b To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Thanks for taking a pass at this! Looks like some unit tests are failing. Let us know if you would like to some help debugging them. As part of the PR description and commit, can you provide a description of what your fix does and why its required? It would definitely help with review, as otherwise we have to reverse engineer the fix from the bug 😄 |
This PR addresses two issues. The first issue occurs when a Fleet is being deleted, which triggers two events: an Update (where the DeletionTimestamp is not nil) and a Delete. However, when the Update event is triggered, because the DeletionTimestamp is not nil, it eventually leads to resyncFleets. Inside resyncFleets, the Update function is called for each Fleet, resulting in a deadlock. The consequence of this is that metrics will not be recorded, as the lock cannot be acquired. The second issue arises after a Fleet is deleted, and only then does the deletion of GameServers commence. When a GameServer's status is modified, it attempts to write the Fleet's status, as recorded in memory, into Metrics, which includes Fleets that have already been deleted. To address this issue, I made modifications so that only the metrics corresponding to the GameServer's current Fleet are recorded during this process. |
sequenceDiagram
actor e as event
participant c as controller
participant l as locker
e->>c: on fleet be deleted
c->>+c: recordFleetChanges
c->>+c: recordFleetDeletion
c->>+c: resyncFleets
c->>+l: lock
alt In fleet.list, the deleted Fleet no longer exists.
c ->>- e: end function
else the deleted Fleet exists, Dead Lock
c->>+c: recordFleetChanges
c->>+c: recordFleetDeletion
c->>+c: resyncFleets
c->>+l: lock [dead lock]
end
|
TIL you can create mermaid diagrams in GitHub issues! |
Is it also possible to write a unit test, to ensure this issue doesn't show up again? |
I attempted to write or modify test cases to cover this scenario, but the challenge is that it would require special modifications to the existing fakeFleetLister.List in order to reproduce this issue, otherwise it cannot be recreated. |
What would you need to do to replicate it? I'm sure we could probably work it out. |
I trust these steps are clear and provide a solid framework for us to replicate the Fleet deletion scenario accurately. Should you have any questions, require further clarifications, or if there's anything more I can assist you with, please do not hesitate to reach out. |
Build Failed 😱 Build Id: 020786da-320b-4a37-9d9a-69a880fce667 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Succeeded 👏 Build Id: 169752ed-4559-472d-8fd2-f2ed4a0300b9 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
You could absolutely do this with a new Unit Test for sure. The Fake clients for Kubernetes don't have any smarts about them, and basically only do what they are told. Option No. 1 agones/pkg/metrics/controller_test.go Line 118 in 130c99a
Option No. 2 agones/pkg/gameserversets/controller_test.go Lines 573 to 575 in 130c99a
And that can provide you the state you need. It still will require a new test I expect, but it would be good to test things to ensure that this doesn't regress, as right now there is no guarantee -- you found a tricky bug 😄 , so want to make sure we don't have to call you back again! |
Thank you very much for your guidance. I have completed writing the test cases and have submitted them. Could you please review them? |
Build Failed 😱 Build Id: b70a2e76-7a62-4197-9e0f-8cd15259465a To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Heading to GDC, but will check when I can 👍🏻 |
Build Succeeded 👏 Build Id: db870010-ddf6-4a05-b4e7-9b750f60e3b8 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Build Failed 😱 Build Id: 5a2c484c-1003-49ec-8bac-4db0ace257c4 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Succeeded 👏 Build Id: 3126e07c-a5fd-4579-86e1-867d1cda0c5b The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Just letting you know I haven't forgotten about you! I'm still recovering from GDC but this is at the top of my queue. |
Thank you so much for the update! I hope you had a great time at GDC and are recovering well. Please take all the time you need, and let me know if there's anything I can do to assist with the review process. Looking forward to your feedback when you're ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of small tweaks, and this is good to merge 👍🏻 nice job!
Co-authored-by: Mark Mandel <markmandel@google.com>
Co-authored-by: Mark Mandel <markmandel@google.com>
Co-authored-by: Mark Mandel <markmandel@google.com>
Co-authored-by: Mark Mandel <markmandel@google.com>
Build Succeeded 👏 Build Id: 45c4b7ca-621a-4b25-a716-348c5f142a56 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Thank you very much for your feedback and suggestions! I have made the requested adjustments and ensured that all test cases have passed. |
Thanks! Approved and ready to go. |
Build Failed 😱 Build Id: 5f1629d4-3913-4d02-ab01-68a46b4d338a To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Succeeded 👏 Build Id: a8faf0fb-2034-493b-9f2d-ac7d79cd22e0 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Received,and issue #3486 has already been fixed by this PR. |
This PR contains the following updates: | Package | Update | Change | |---|---|---| | [agones](https://agones.dev) ([source](https://github.com/googleforgames/agones)) | minor | `1.39.0` -> `1.40.0` | --- > [!WARNING] > Some dependencies could not be looked up. Check the Dependency Dashboard for more information. --- ### Release Notes <details> <summary>googleforgames/agones (agones)</summary> ### [`v1.40.0`](https://github.com/googleforgames/agones/blob/HEAD/CHANGELOG.md#v1400-2024-04-23) [Compare Source](https://github.com/googleforgames/agones/compare/v1.39.0...v1.40.0) [Full Changelog](https://github.com/googleforgames/agones/compare/v1.39.0...v1.40.0) **Breaking changes:** - Counters and Lists: Remove Bool Returns by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3738](https://github.com/googleforgames/agones/pull/3738) **Implemented enhancements:** - Leader Election in Custom Controller by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3696](https://github.com/googleforgames/agones/pull/3696) - Migrating from generate-groups.sh to kube_codegen.sh by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3722](https://github.com/googleforgames/agones/pull/3722) - Move GKEAutopilotExtendedDurationPods to Alpha in 1.28+ by [@​zmerlynn](https://github.com/zmerlynn) in [https://github.com/googleforgames/agones/pull/3729](https://github.com/googleforgames/agones/pull/3729) - Move DisableResyncOnSDKServer to Beta by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3732](https://github.com/googleforgames/agones/pull/3732) - Counters & Lists landing page and doc improvements by [@​markmandel](https://github.com/markmandel) in [https://github.com/googleforgames/agones/pull/3649](https://github.com/googleforgames/agones/pull/3649) - Graduate FleetAllocationOverflow to Stable by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3733](https://github.com/googleforgames/agones/pull/3733) - Adds Counters and Lists to CSharp SDK by [@​igooch](https://github.com/igooch) in [https://github.com/googleforgames/agones/pull/3581](https://github.com/googleforgames/agones/pull/3581) - Feat/counter and list defaulting order to ascending by [@​lacroixthomas](https://github.com/lacroixthomas) in [https://github.com/googleforgames/agones/pull/3734](https://github.com/googleforgames/agones/pull/3734) - Add handling for StatusAddresses in GameServerStatus for the Unity SDK by [@​charlesvien](https://github.com/charlesvien) in [https://github.com/googleforgames/agones/pull/3739](https://github.com/googleforgames/agones/pull/3739) - Feat(gameservers): Shared pod IPs with GameServer Addresses by [@​lacroixthomas](https://github.com/lacroixthomas) in [https://github.com/googleforgames/agones/pull/3764](https://github.com/googleforgames/agones/pull/3764) - Be prescriptive about rotating regions when updating Kubernetes versions by [@​zmerlynn](https://github.com/zmerlynn) in [https://github.com/googleforgames/agones/pull/3716](https://github.com/googleforgames/agones/pull/3716) - Fix ensure-e2e-infra-state-bucket by [@​zmerlynn](https://github.com/zmerlynn) in [https://github.com/googleforgames/agones/pull/3719](https://github.com/googleforgames/agones/pull/3719) - Create Performance Cluster 1.28 by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3720](https://github.com/googleforgames/agones/pull/3720) - Optimise GameServer Sub-Controller Queues by [@​markmandel](https://github.com/markmandel) in [https://github.com/googleforgames/agones/pull/3781](https://github.com/googleforgames/agones/pull/3781) **Fixed bugs:** - Counters & Lists: Consolidate `priorities` sorting by [@​markmandel](https://github.com/markmandel) in [https://github.com/googleforgames/agones/pull/3690](https://github.com/googleforgames/agones/pull/3690) - Fix(Counter & Lists): Add validation for `priorities` by [@​lacroixthomas](https://github.com/lacroixthomas) in [https://github.com/googleforgames/agones/pull/3714](https://github.com/googleforgames/agones/pull/3714) - fix: [#​3607](https://github.com/googleforgames/agones/issues/3607) Metrics data loss in K8S controller by [@​alvin-7](https://github.com/alvin-7) in [https://github.com/googleforgames/agones/pull/3692](https://github.com/googleforgames/agones/pull/3692) - Deflake GameServerAllocationDuringMultipleAllocationClients by allowing errors by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3750](https://github.com/googleforgames/agones/pull/3750) **Security fixes:** - Bump protobufjs from 7.2.4 to 7.2.6 in /sdks/nodejs by [@​dependabot](https://github.com/dependabot) in [https://github.com/googleforgames/agones/pull/3755](https://github.com/googleforgames/agones/pull/3755) - Bump golang.org/x/net from 0.19.0 to 0.23.0 by [@​zmerlynn](https://github.com/zmerlynn) in [https://github.com/googleforgames/agones/pull/3793](https://github.com/googleforgames/agones/pull/3793) **Other:** - Flaky: TestGameServerCreationAfterDeletingOneExtensionsPod by [@​markmandel](https://github.com/markmandel) in [https://github.com/googleforgames/agones/pull/3699](https://github.com/googleforgames/agones/pull/3699) - Prep for release v1.40.0 by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3700](https://github.com/googleforgames/agones/pull/3700) - Bumps cpp-simple Image and Refactoring Example Makefiles by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3695](https://github.com/googleforgames/agones/pull/3695) - Upgrade Protobuf to 1.33.0 by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3711](https://github.com/googleforgames/agones/pull/3711) - Modify Script for Makefile Version Updates in Examples Directory by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3712](https://github.com/googleforgames/agones/pull/3712) - Adds simple genai server example documentation to the Agones site by [@​igooch](https://github.com/igooch) in [https://github.com/googleforgames/agones/pull/3713](https://github.com/googleforgames/agones/pull/3713) - Update Supported Kubernetes to 1.27, 1.28, 1.29 by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3654](https://github.com/googleforgames/agones/pull/3654) - fix: typo in docs by [@​qhyun2](https://github.com/qhyun2) in [https://github.com/googleforgames/agones/pull/3723](https://github.com/googleforgames/agones/pull/3723) - Tweak: Setting up the Game Server by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3717](https://github.com/googleforgames/agones/pull/3717) - Docs: gke.md - spelling by [@​daniellee](https://github.com/daniellee) in [https://github.com/googleforgames/agones/pull/3740](https://github.com/googleforgames/agones/pull/3740) - Aesthetic rearrangement of cloudbuild.yaml by [@​zmerlynn](https://github.com/zmerlynn) in [https://github.com/googleforgames/agones/pull/3741](https://github.com/googleforgames/agones/pull/3741) - Docs: Make hitting <enter> on connection explicit by [@​markmandel](https://github.com/markmandel) in [https://github.com/googleforgames/agones/pull/3743](https://github.com/googleforgames/agones/pull/3743) - CI: Don't check Unreal Link by [@​markmandel](https://github.com/markmandel) in [https://github.com/googleforgames/agones/pull/3745](https://github.com/googleforgames/agones/pull/3745) - New recommendation for multi-cluster allocation by [@​markmandel](https://github.com/markmandel) in [https://github.com/googleforgames/agones/pull/3744](https://github.com/googleforgames/agones/pull/3744) - Custom Controller Example Page on Agones Website by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3725](https://github.com/googleforgames/agones/pull/3725) - Add Nitrado logo by [@​towolf](https://github.com/towolf) in [https://github.com/googleforgames/agones/pull/3753](https://github.com/googleforgames/agones/pull/3753) - Remove unnecessary args from e2e-test-cloudbuild by [@​zmerlynn](https://github.com/zmerlynn) in [https://github.com/googleforgames/agones/pull/3754](https://github.com/googleforgames/agones/pull/3754) - Update Allocation from Fleet Documentation by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3761](https://github.com/googleforgames/agones/pull/3761) - Transform Lint Warnings into Errors by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3756](https://github.com/googleforgames/agones/pull/3756) - Update Canary Testing Documentation by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3760](https://github.com/googleforgames/agones/pull/3760) - Supertuxkart Example on Agones Site by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3728](https://github.com/googleforgames/agones/pull/3728) - Xonotic Example on Agones Site by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3742](https://github.com/googleforgames/agones/pull/3742) - nit documentation fix in kind cluster section when building Agones by [@​vicentefb](https://github.com/vicentefb) in [https://github.com/googleforgames/agones/pull/3770](https://github.com/googleforgames/agones/pull/3770) - Merged steps inside documentation about webhook certificate creation by [@​vicentefb](https://github.com/vicentefb) in [https://github.com/googleforgames/agones/pull/3768](https://github.com/googleforgames/agones/pull/3768) - Example Images: Increment Tags by [@​Kalaiselvi84](https://github.com/Kalaiselvi84) in [https://github.com/googleforgames/agones/pull/3796](https://github.com/googleforgames/agones/pull/3796) - Update simple game server example documentation by [@​vicentefb](https://github.com/vicentefb) in [https://github.com/googleforgames/agones/pull/3776](https://github.com/googleforgames/agones/pull/3776) **New Contributors:** - [@​lacroixthomas](https://github.com/lacroixthomas) made their first contribution in [https://github.com/googleforgames/agones/pull/3714](https://github.com/googleforgames/agones/pull/3714) - [@​daniellee](https://github.com/daniellee) made their first contribution in [https://github.com/googleforgames/agones/pull/3740](https://github.com/googleforgames/agones/pull/3740) - [@​charlesvien](https://github.com/charlesvien) made their first contribution in [https://github.com/googleforgames/agones/pull/3739](https://github.com/googleforgames/agones/pull/3739) - [@​vicentefb](https://github.com/vicentefb) made their first contribution in [https://github.com/googleforgames/agones/pull/3770](https://github.com/googleforgames/agones/pull/3770) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zNTYuMSIsInVwZGF0ZWRJblZlciI6IjM3LjM1Ni4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZS9oZWxtIiwidHlwZS9taW5vciJdfQ==-->
This PR addresses two issues.
The first issue occurs when a Fleet is being deleted, which triggers two events: an Update (where the DeletionTimestamp is not nil) and a Delete. However, when the Update event is triggered, because the DeletionTimestamp is not nil, it eventually leads to resyncFleets. Inside resyncFleets, the Update function is called for each Fleet, resulting in a deadlock. The consequence of this is that metrics will not be recorded, as the lock cannot be acquired.
The second issue arises after a Fleet is deleted, and only then does the deletion of GameServers commence. When a GameServer's status is modified, it attempts to write the Fleet's status, as recorded in memory, into Metrics, which includes Fleets that have already been deleted. To address this issue, I made modifications so that only the metrics corresponding to the GameServer's current Fleet are recorded during this process.
Fixes #3607