-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shuffle hardware inventory for tinkerbell before reservation #8264
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Rahul Ganesh <rahulgab@amazon.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #8264 +/- ##
=======================================
Coverage 73.42% 73.42%
=======================================
Files 578 578
Lines 36054 36054
=======================================
Hits 26471 26471
Misses 7905 7905
Partials 1678 1678 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm, only nit comments
Have we planned already the work to clean the boot entries? I'm totally ok merging this, it's a good patch, but it doesn't guarantee the problem won't happen again. In fact if I'm understanding this correctly, it will 100% happen, it will just take longer. And it doesn't seem like an easy issue to diagnose.
@@ -592,3 +596,10 @@ func logTinkerbellTestHardwareInfo(conf *instanceRunConf, action string) { | |||
} | |||
conf.Logger.V(1).Info(action+" hardware for TestRunner", "hardwarePool", strings.Join(hardwareInfo, ", ")) | |||
} | |||
|
|||
func shuffleHardwareInventory(invCatalogue *hardwareCatalogue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why the use of inventory and catalogue? aren't they representing the same thing?
@@ -217,6 +218,9 @@ func RunTests(conf instanceRunConf, inventoryCatalogue map[string]*hardwareCatal | |||
} else { | |||
hardwareCatalogue = inventoryCatalogue[nonAirgappedHardware] | |||
} | |||
conf.Logger.Info("Shuffling hardware inventory for tinkerbell") | |||
// shuffle hardware to introduce randomness during hardware reservation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would expand more on why randomness is desired. We don't do this to introduce randomness, we do this to avoid picking up the same machines on every run. Randomness is just the mechanism to achieve that goal.
@@ -592,3 +596,10 @@ func logTinkerbellTestHardwareInfo(conf *instanceRunConf, action string) { | |||
} | |||
conf.Logger.V(1).Info(action+" hardware for TestRunner", "hardwarePool", strings.Join(hardwareInfo, ", ")) | |||
} | |||
|
|||
func shuffleHardwareInventory(invCatalogue *hardwareCatalogue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make this a method in hardwareCatalogue
? It's manipulating the internal extructure, it seems like a good idea to abstract that in a method instead of exposing it like this.
I fact, don't you need to use the mutex? If I'm not mistaken the hardwareCatalogue
is shared between runner threads and all of them are going to try to call this method concurrently.
Description of changes:
Shuffle hardware inventory before reserving hardware for Tinkerbell E2E tests. As we run quick e2e more frequently the boot entries on the boot list get populated quickly leading to an error when there's no space left to add to that boot list. Ideally we should have an automation around removing the boot entries periodically on the BMCs but until then we should try reserving the hardware in random order for quick test to not burden the boot entries on the first few hardware. Also, with randomness we reduce the likelihood of picking up an erroneous hardware in case during repetitive quick E2E runs.
Testing (if applicable):
Kicked of run against my branch and verified that the hardware reserved for the test were different from the regular hardware (
eksa-ci01 to eksa-ci12
) that gets reserved at present.Documentation added/planned (if applicable):
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.