-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Prevent data race from global metrics round-tripper #13641
Conversation
Signed-off-by: Andrew Melnick <meln5674@kettering.edu>
Can anyone comment on this test failure? Why is it expecting 404? That metric seems to be reporting healthily here. Was this maybe a hack added to work around something related to this same issue? |
I believe this is a flakey test that does just sometimes fail. I don't remember introducing it when the roundtripper was added. I've kicked it to run again. Anyway, that's for finding this and testing the RC. I will take a proper look when I have access to a proper computer. |
Regarding the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investigating and your detailed root cause analysis!
If you want to improve on anything else in the repo, like other data races, flakey tests (which are themselves sometimes due to races), race detecting tests, or anything else, that'd be great 🙂
This LGTM but I'll defer to Alan for final approval and merge since he wrote this code recently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find, thank you for fixing it.
Fixes #13637
Motivation
This line introduces a data race by globally storing a round-tripper used by the kubernetes client. If a new request starts before the first one completes, and the first request attempts to use the same round-tripper it originally created before the second finishes upgrading its connection, it will result in a panic due to a nil
net.Conn
in the underlying SPDY implementation. As a result, if the controller makes too many new connections to the API server in a short-enough period of time, it will crash and restart.While it is sensible to store a global handle to the metrics that this round-tripper records, storing the round-tripper itself is not.
Modifications
This patch retains the "context" of the metrics (i.e. the actual
ctx
value, plus the handle to the metrics themselves) as global, but scopes the round-tripper to each connection by refactoring the context to its own type and global variable, and uses an embedded pointer to it in the round-tripper implementation to avoid needing downstream changes.Verification
E2E functional tests were run locally. Additionally, the PR tests identified in the issue as failing as a result of this race now pass, and the controller logs from running without the race detector were visually inspected to confirm that no panics occurred after multiple re-runs of the originally failing test.