Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclean exit of bootstrap.sh on Neuron instances #1826

Closed
GyandeepKalita opened this issue May 30, 2024 · 6 comments
Closed

Unclean exit of bootstrap.sh on Neuron instances #1826

GyandeepKalita opened this issue May 30, 2024 · 6 comments
Assignees

Comments

@GyandeepKalita
Copy link

Hi,
So I was trying to create an instance group with inf2.xlarge instance type in an eks cluster. According to the AWS docs: here & AWS Neuron Docs: here, the EKS optimized accelarated AMIs should support it. I tried creating this using /aws/service/eks/optimized-ami/1.28/amazon-linux-2-gpu/recommended/image_id as the ssm parameter for the AMI.

But the creation of the instance group failed with the following error message in cloudformation stacks:
Received 1 FAILURE signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement.

To troubleshoot it a bit further, I SSHed into the ec2 instance and found the following errors in the cloud-init.log and the cloud-init-output.log:

  • cloud-init.log:
May 30 07:02:00 cloud-init[3101]: util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/part-001'] with allowed return codes [0] (shell=True, capture=False)
May 30 07:02:13 cloud-init[3101]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
May 30 07:02:13 cloud-init[3101]: util.py[DEBUG]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 913, in runparts
    subp(prefix + [exe_path], capture=False, shell=True)
  File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 2108, in subp
    cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/part-001']
Exit code: 1
Reason: -
Stdout: -
Stderr: -
May 30 07:02:13 cloud-init[3101]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
May 30 07:02:13 cloud-init[3101]: handlers.py[DEBUG]: finish: modules-final/config-scripts-user: FAIL: running config-scripts-user with frequency once-per-instance
May 30 07:02:13 cloud-init[3101]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
May 30 07:02:13 cloud-init[3101]: util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cloudinit/stages.py", line 851, in _run_modules
    freq=freq)
  File "/usr/lib/python2.7/site-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python2.7/site-packages/cloudinit/helpers.py", line 187, in run
    results = functor(*args)
  File "/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.py", line 45, in handle
    util.runparts(runparts_path)
  File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 920, in runparts
    % (len(failed), len(attempted)))
RuntimeError: Runparts: 1 failures in 1 attempted commands
May 30 07:02:13 cloud-init[3101]: stages.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.pyc'>) with frequency once-per-instance
  • cloud-init-output.log:
+ gpu-ami-util has-nvidia-devices
false
+ echo 'no NVIDIA devices are present, nothing to do!'
no NVIDIA devices are present, nothing to do!
+ exit 0
2024-05-30T07:02:10+0000 [eks-bootstrap] INFO: completed GPU bootstrap helper!
Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /etc/systemd/system/kubelet.service.
2024-05-30T07:02:10+0000 [eks-bootstrap] INFO: nvidia-smi found
Exited with error on line 649
++ /opt/aws/bin/cfn-signal --exit-code 1 --stack inf2-test --resource NodeGroup --region us-west-2
++ ec2-metadata -t
++ awk -F . '{print $2}'

And the line 649 where it fails in the /etc/eks/bootstrap.sh is:

nvidia-smi -q > /tmp/nvidia-smi-check

Please let me know how I can resolve this.

@cartermckinnon
Copy link
Member

The bootstrap script shouldn't exit with a non-zero code in this way -- we need to move this nvidia-smi bit into the GPU helper script to resolve that -- but the use of cfn-signal is what's ultimately causing your CFN stack to fail. I'll get a PR out to fix the unclean termination of the bootstrap script; in the meantime you can change/disable the cfn-signal bit as a workaround 👍

@GyandeepKalita
Copy link
Author

Hey, thanks for the prompt response!
I am hoping to see the issue being fixed soon.

@cartermckinnon cartermckinnon changed the title Unable to create Instance Group with 'Inf2.xlarge' instance type. Shows error as "Received 1 FAILURE signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement" Unclean exit of bootstrap.sh on Neuron instances Jun 4, 2024
@GyandeepKalita
Copy link
Author

Hi, when can I expect this fix to be reflected in the actual systems? I was actually kinda blocked on it as the workaround is not applicable for my project.

@cartermckinnon
Copy link
Member

This will land in an AMI build next week. 👍

@GyandeepKalita
Copy link
Author

Thanks!

@cartermckinnon
Copy link
Member

This has been resolved 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants