Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry GPU devices check during env vars load if instance supports GPU #4387

Merged
merged 1 commit into from
Oct 11, 2024

Conversation

danehlim
Copy link
Contributor

@danehlim danehlim commented Oct 8, 2024

Summary

Current behavior of ecs-init during loading environment variables is that if environment variable with key config.GPUSupportEnvVar has value true, then check if NVIDIA GPU devices are present on the instance. If they are not yet present at the time of the check, then ecs-init does not load in the aforementioned environment variable and effectively later gives up on attempting any PRESTART GPU setup (such as NVIDIA GPU info file creation).

Although under usual circumstances NVIDIA GPU devices are present by the time ecs-init checks for the same on a container instance that supports GPU workloads, there is nothing in place that guarantees this. Thus, there could be a race condition where NVIDIA GPU devices are not yet present on the instance at the time that ecs-init checks (perhaps due to delays in device initialization), but then do become present shortly afterwards. In such a case, ecs-init GPU setup is skipped and ECS Agent later runs into error when attempting to initialize its NVIDIA GPU manager.

Implement a retry mechanism for checking NVIDIA GPU device availability during the time ecs-init loads environment variables in the event that the underlying container instance supports GPU workloads. This ensures that ecs-init at least waits a reasonable amount of time for NVIDIA GPU devices to be present in this case before continuing. If NVIDIA GPU devices still are not yet present after the maximum number of retries, then ecs-init will give up and continue as per the current behavior. This shall not impact existing scenarios where NVIDIA GPU device is already available by the time ecs-init checks for the first time.

Implementation details

If environment variable with key config.GPUSupportEnvVar has value true, it is expected that NVIDIA GPU devices should eventually be present on the instance. Thus, do not give up and continue right away in the event that NVIDIA GPU devices are not yet present.

Define nvidiaGPUDevicesPresentWithRetries() to be called in this case which will retry and wait for some time for NVIDIA GPU devices to be present before continuing. The retry logic is configured to:

  • retry on failure
  • every 3 seconds
  • for a maximum of 10 attempts

This results in 30 seconds of retry in the worst case scenario.

Also, rename some existing constants in the docker package of ecs-init to improve readability.

Testing

New tests cover the changes: yes

Also manually test using a custom ecs-init built with the changes in this pull request against an internal repro environment that reliably reproduces the race condition when using the current ecs-init. With the custom ecs-init, race condition behavior is no longer observed.

Partial ecs-init logs for the above:

./ecs-init.log:level=warn time=2024-10-07T23:27:18Z msg="NVIDIA GPU devices are not yet present, retrying (attempt 1/10) in 3000000000 nanoseconds"
./ecs-init.log:level=info time=2024-10-07T23:27:21Z msg="pre-start: setting up GPUs"
./ecs-init.log:level=info time=2024-10-07T23:27:21Z msg="By using the GPU Optimized AMI, you agree to Nvidia’s End User License Agreement: https://www.nvidia.com/en-us/about-nvidia/eula-agreement/"
./ecs-init.log:level=info time=2024-10-07T23:27:21Z msg="pre-start: done setting up GPUs"

Description for the changelog

Bugfix: Retry GPU devices check during env vars load if instance supports GPU

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?

No

Does this PR include the addition of new environment variables in the README?

No

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@danehlim danehlim requested a review from a team as a code owner October 8, 2024 19:20
@danehlim danehlim marked this pull request as draft October 8, 2024 19:20
@danehlim danehlim marked this pull request as ready for review October 8, 2024 21:23
ecs-init/docker/docker.go Outdated Show resolved Hide resolved
mye956
mye956 previously approved these changes Oct 8, 2024
Yiyuanzzz
Yiyuanzzz previously approved these changes Oct 8, 2024
@danehlim danehlim dismissed stale reviews from Yiyuanzzz and mye956 via 096d34a October 8, 2024 23:02
@danehlim danehlim force-pushed the retry-gpu-devices-check branch from a5062b9 to 096d34a Compare October 8, 2024 23:02
@danehlim danehlim force-pushed the retry-gpu-devices-check branch from 096d34a to 3711ed1 Compare October 8, 2024 23:11
Copy link
Contributor

@singholt singholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you Dane! 🚀

@danehlim danehlim force-pushed the retry-gpu-devices-check branch from 3711ed1 to ef69a25 Compare October 11, 2024 19:08
@danehlim danehlim merged commit bf9c2a7 into aws:dev Oct 11, 2024
40 checks passed
@mye956 mye956 mentioned this pull request Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants