-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(eks): pods become CrashLoopBackOff when using INFERENTIA or TRAINIUM instance type #29651
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.
A comment requesting an exemption should contain the text Exemption Request
. Additionally, if clarification is needed add Clarification Request
to a comment.
private addNeuronDevicePluginRbac() { | ||
if (!this._neuronDevicePluginRbacClusterRole) { | ||
const clusterRoleFileContents = fs.readFileSync(path.join(__dirname, 'addons', 'neuron-device-plugin-rbac-cluster-role.yaml'), 'utf8'); | ||
const sanitizedClusterRole = YAML.parse(clusterRoleFileContents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use parseAllDocuments, I don't need to divide k8s-neuron-device-plugin-rbac.yml
into three files but the return type of parseAllDocuments
is not equal to the return type of parse so addManifest
function cannot handle parsed yaml.
I think divide k8s-neuron-device-plugin-rbac.yml
into three files and use parse
is the simplest solution.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Exemption Request: I updated |
This PR has been in the CHANGES REQUESTED state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week. |
This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error. |
The pull request linter fails with the following errors:
PRs must pass status checks before we can provide a meaningful review. If you would like to request an exemption from the status checks or clarification on feedback, please leave a comment on this PR containing ✅ A exemption request has been requested. Please wait for a maintainer's review. |
Issue # (if applicable)
#29262
Reason for this change
When we use INFERENTIA or TRAINIUM instance type, https://github.com/aws/aws-cdk/blob/main/packages/aws-cdk-lib/aws-eks/lib/addons/neuron-device-plugin.yaml is applied to cluster but Pod become CrashLoopBackOff (detail log #29262 (comment))
The current yaml https://github.com/aws-neuron/aws-neuron-sdk/blob/master/docs/neuron-container-tools/k8s-neuron-device-plugin.yml is File not found now.
aws-cdk/packages/aws-cdk-lib/aws-eks/lib/addons/neuron-device-plugin.yaml
Line 1 in dffedca
Description of changes
Download k8s-neuron-device-plugin.yml and k8s-neuron-device-plugin-rbac.yml from https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-setup.html and copy & paste
Add function to apply yaml file for RBAC
Add unit tests
Update
integ.eks-inference-nodegroup
andinteg.eks-inference
Description of how you validated changes
Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license