Unclean exit of bootstrap.sh on Neuron instances #1826

GyandeepKalita · 2024-05-30T16:38:19Z

Hi,
So I was trying to create an instance group with inf2.xlarge instance type in an eks cluster. According to the AWS docs: here & AWS Neuron Docs: here, the EKS optimized accelarated AMIs should support it. I tried creating this using /aws/service/eks/optimized-ami/1.28/amazon-linux-2-gpu/recommended/image_id as the ssm parameter for the AMI.

But the creation of the instance group failed with the following error message in cloudformation stacks:
Received 1 FAILURE signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement.

To troubleshoot it a bit further, I SSHed into the ec2 instance and found the following errors in the cloud-init.log and the cloud-init-output.log:

cloud-init.log:

May 30 07:02:00 cloud-init[3101]: util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/part-001'] with allowed return codes [0] (shell=True, capture=False)
May 30 07:02:13 cloud-init[3101]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
May 30 07:02:13 cloud-init[3101]: util.py[DEBUG]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 913, in runparts
    subp(prefix + [exe_path], capture=False, shell=True)
  File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 2108, in subp
    cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/part-001']
Exit code: 1
Reason: -
Stdout: -
Stderr: -
May 30 07:02:13 cloud-init[3101]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
May 30 07:02:13 cloud-init[3101]: handlers.py[DEBUG]: finish: modules-final/config-scripts-user: FAIL: running config-scripts-user with frequency once-per-instance
May 30 07:02:13 cloud-init[3101]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
May 30 07:02:13 cloud-init[3101]: util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cloudinit/stages.py", line 851, in _run_modules
    freq=freq)
  File "/usr/lib/python2.7/site-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python2.7/site-packages/cloudinit/helpers.py", line 187, in run
    results = functor(*args)
  File "/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.py", line 45, in handle
    util.runparts(runparts_path)
  File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 920, in runparts
    % (len(failed), len(attempted)))
RuntimeError: Runparts: 1 failures in 1 attempted commands
May 30 07:02:13 cloud-init[3101]: stages.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_ssh_authkey_fingerprints.pyc'>) with frequency once-per-instance

cloud-init-output.log:

+ gpu-ami-util has-nvidia-devices
false
+ echo 'no NVIDIA devices are present, nothing to do!'
no NVIDIA devices are present, nothing to do!
+ exit 0
2024-05-30T07:02:10+0000 [eks-bootstrap] INFO: completed GPU bootstrap helper!
Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /etc/systemd/system/kubelet.service.
2024-05-30T07:02:10+0000 [eks-bootstrap] INFO: nvidia-smi found
Exited with error on line 649
++ /opt/aws/bin/cfn-signal --exit-code 1 --stack inf2-test --resource NodeGroup --region us-west-2
++ ec2-metadata -t
++ awk -F . '{print $2}'

And the line 649 where it fails in the /etc/eks/bootstrap.sh is:

amazon-eks-ami/templates/al2/runtime/bootstrap.sh

Line 649 in 0fdc793

nvidia-smi -q > /tmp/nvidia-smi-check

Please let me know how I can resolve this.

The text was updated successfully, but these errors were encountered:

cartermckinnon · 2024-05-30T17:40:12Z

The bootstrap script shouldn't exit with a non-zero code in this way -- we need to move this nvidia-smi bit into the GPU helper script to resolve that -- but the use of cfn-signal is what's ultimately causing your CFN stack to fail. I'll get a PR out to fix the unclean termination of the bootstrap script; in the meantime you can change/disable the cfn-signal bit as a workaround 👍

GyandeepKalita · 2024-05-31T11:21:52Z

Hey, thanks for the prompt response!
I am hoping to see the issue being fixed soon.

GyandeepKalita · 2024-06-05T08:50:28Z

Hi, when can I expect this fix to be reflected in the actual systems? I was actually kinda blocked on it as the workaround is not applicable for my project.

cartermckinnon · 2024-06-06T20:35:08Z

This will land in an AMI build next week. 👍

GyandeepKalita · 2024-06-07T05:51:41Z

Thanks!

cartermckinnon · 2024-07-15T17:58:56Z

This has been resolved 👍

cartermckinnon assigned wwvela May 30, 2024

wwvela mentioned this issue May 31, 2024

move gpu boost clock to gpu boostrap helper #1827

Merged

cartermckinnon changed the title ~~Unable to create Instance Group with 'Inf2.xlarge' instance type. Shows error as "Received 1 FAILURE signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement"~~ Unclean exit of bootstrap.sh on Neuron instances Jun 4, 2024

cartermckinnon closed this as completed Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclean exit of bootstrap.sh on Neuron instances #1826

Unclean exit of bootstrap.sh on Neuron instances #1826

GyandeepKalita commented May 30, 2024

cartermckinnon commented May 30, 2024

GyandeepKalita commented May 31, 2024

GyandeepKalita commented Jun 5, 2024

cartermckinnon commented Jun 6, 2024

GyandeepKalita commented Jun 7, 2024

cartermckinnon commented Jul 15, 2024

Unclean exit of bootstrap.sh on Neuron instances #1826

Unclean exit of bootstrap.sh on Neuron instances #1826

Comments

GyandeepKalita commented May 30, 2024

cartermckinnon commented May 30, 2024

GyandeepKalita commented May 31, 2024

GyandeepKalita commented Jun 5, 2024

cartermckinnon commented Jun 6, 2024

GyandeepKalita commented Jun 7, 2024

cartermckinnon commented Jul 15, 2024