-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry GPU devices check during env vars load if instance supports GPU #4387
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mye956
reviewed
Oct 8, 2024
mye956
previously approved these changes
Oct 8, 2024
Yiyuanzzz
previously approved these changes
Oct 8, 2024
singholt
reviewed
Oct 8, 2024
danehlim
force-pushed
the
retry-gpu-devices-check
branch
from
October 8, 2024 23:02
a5062b9
to
096d34a
Compare
danehlim
force-pushed
the
retry-gpu-devices-check
branch
from
October 8, 2024 23:11
096d34a
to
3711ed1
Compare
mye956
approved these changes
Oct 8, 2024
singholt
approved these changes
Oct 8, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you Dane! 🚀
Yiyuanzzz
approved these changes
Oct 8, 2024
danehlim
force-pushed
the
retry-gpu-devices-check
branch
from
October 11, 2024 19:08
3711ed1
to
ef69a25
Compare
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Current behavior of ecs-init during loading environment variables is that if environment variable with key
config.GPUSupportEnvVar
has valuetrue
, then check if NVIDIA GPU devices are present on the instance. If they are not yet present at the time of the check, then ecs-init does not load in the aforementioned environment variable and effectively later gives up on attempting any PRESTART GPU setup (such as NVIDIA GPU info file creation).Although under usual circumstances NVIDIA GPU devices are present by the time ecs-init checks for the same on a container instance that supports GPU workloads, there is nothing in place that guarantees this. Thus, there could be a race condition where NVIDIA GPU devices are not yet present on the instance at the time that ecs-init checks (perhaps due to delays in device initialization), but then do become present shortly afterwards. In such a case, ecs-init GPU setup is skipped and ECS Agent later runs into error when attempting to initialize its NVIDIA GPU manager.
Implement a retry mechanism for checking NVIDIA GPU device availability during the time ecs-init loads environment variables in the event that the underlying container instance supports GPU workloads. This ensures that ecs-init at least waits a reasonable amount of time for NVIDIA GPU devices to be present in this case before continuing. If NVIDIA GPU devices still are not yet present after the maximum number of retries, then ecs-init will give up and continue as per the current behavior. This shall not impact existing scenarios where NVIDIA GPU device is already available by the time ecs-init checks for the first time.
Implementation details
If environment variable with key
config.GPUSupportEnvVar
has valuetrue
, it is expected that NVIDIA GPU devices should eventually be present on the instance. Thus, do not give up and continue right away in the event that NVIDIA GPU devices are not yet present.Define
nvidiaGPUDevicesPresentWithRetries()
to be called in this case which will retry and wait for some time for NVIDIA GPU devices to be present before continuing. The retry logic is configured to:This results in 30 seconds of retry in the worst case scenario.
Also, rename some existing constants in the
docker
package of ecs-init to improve readability.Testing
New tests cover the changes: yes
Also manually test using a custom ecs-init built with the changes in this pull request against an internal repro environment that reliably reproduces the race condition when using the current ecs-init. With the custom ecs-init, race condition behavior is no longer observed.
Partial ecs-init logs for the above:
Description for the changelog
Bugfix: Retry GPU devices check during env vars load if instance supports GPU
Additional Information
Does this PR include breaking model changes? If so, Have you added transformation functions?
No
Does this PR include the addition of new environment variables in the README?
No
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.