Skip to content

(2.11.2 and earlier) Custom AMI creation (pcluster createami) fails when building SGE

Luca Carrogu edited this page Oct 19, 2021 · 1 revision

The issue

Custom AMI creation with command pcluster createami or cluster creation using custom AMI with no pre-baked ParallelCluster software stack are failing with the following error:

    ======================================================================..
    error executing action `run` on resource 'execute[sge_preinstall]'
    ======================================================================..     
    mixlib::shellout::shellcommandfailed
    ------------------------------------
    expected process to exit with [0], but received '28'
    ---- begin output of sh /tmp/sge_preinstall.sh ----
    stdout: downloading and extracting source packages for sge-8.1.9
    stderr: % total    % received % xferd  average speed   time    time   ..
                                     dload  upload   total   spent    left..
  0     0    0     0    0     0      0      0 --:--:--  0:02:10 --:--:--  ..
    curl: (28) failed to connect to arc.liv.ac.uk port 443: connection tim..
    warning: problem : timeout. will retry in 5 seconds. 3 retries left.
  0     0    0     0    0     0      0      0 --:--:--  0:02:10 --:--:--  ..
    curl: (28) failed to connect to arc.liv.ac.uk port 443: connection tim..
    warning: problem : timeout. will retry in 5 seconds. 2 retries left.
  0     0    0     0    0     0      0      0 --:--:--  0:02:10 --:--:--  ..
    curl: (28) failed to connect to arc.liv.ac.uk port 443: connection tim..
    warning: problem : timeout. will retry in 5 seconds. 1 retries left.
  0     0    0     0    0     0      0      0 --:--:--  0:02:10 --:--:--  ..
    curl: (28) failed to connect to arc.liv.ac.uk port 443: connection tim..
    ---- end output of sh /tmp/sge_preinstall.sh ----
    ran sh /tmp/sge_preinstall.sh returned 28

Root cause is because arc.liv.ac.uk is down and it’s not possible to download SGE sources from https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-8.1.9.tar.gz

Affected ParallelCluster versions and OS

2.11.2 and older are affected, OS affected according to the following table:

version\os AL1 AL2 CentOS6 CentOS7 CentOS8 Ubuntu16 Ubuntu18 Ubuntu20
2.11.2 not supported AFFECTED not supported AFFECTED AFFECTED not supported OK OK
2.10.4 AFFECTED OK not supported AFFECTED AFFECTED AFFECTED OK not supported
2.9.1 AFFECTED OK AFFECTED AFFECTED not supported AFFECTED OK not supported

Bug affects the following functionalities

  • create custom AMI using command pcluster createami
  • SGE cluster creation with runtime bake (HeadNode cannot be created)
  • existing SGE clusters with version older than 2.3.0 (included) using runtime bake

Existing clusters since 2.4.0 (included) aren't affected because SGE isn't installed on compute nodes, but it's shared from HeadNode

The workarounds

For createami

  • For custom AMI creation, instead of using the pcluster createami command, follow the guide "Building a Custom AWS ParallelCluster AMI -> Modify an AWS ParallelCluster AMI" from this doc

  • As alternative, If you aren't using SGE and still want to use the pcluster createami command, do the following

    • Create a script named patch_skip_sge_installation.sh with the following contents. Replace ParallelCluster version with the version you want to patch:
    #!/bin/bash
    
    VERSION="<parallelcluster-version>"
    git clone https://github.com/aws/aws-parallelcluster-cookbook.git
    cd ./aws-parallelcluster-cookbook
    git fetch origin
    git checkout v$VERSION
    
    sed -i "s/^include_recipe 'aws-parallelcluster::sge_install'/#&/g" recipes/default.rb
    # if you are using "BSD sed" replace the above sed line with the following one
    # sed -i "" "s/^include_recipe 'aws-parallelcluster::sge_install'/#&/g" recipes/default.rb
    
    cd ..
    tar zcvf "$(pwd)/aws-parallelcluster-cookbook-${VERSION}.tgz" ./aws-parallelcluster-cookbook
    
    • Execute the script
    bash patch_skip_sge_installation.sh
    
    • A tar file with name aws-parallelcluster-cookbook-$VERSION.tgz will be generated. Upload the generated cookbook to your S3 bucket.
    • Use the custom cookbook as parameter for the createami command, e.g.
    pcluster createami -ai $BASE_AMI_ID -os $BASE_AMI_OS -cc <custom-cookbook-s3-url>
    
  • If, instead, you are required to use SGE with one of the affected OS, do the following:

    • Spin up an instance from an official ParalleCLuster AMI, OS doesn't matter, version not greater than 2.11.2, e.g. ami-05c34afa8785834ae in us-east-1
    • From the instance, retrieve the file /opt/parallelcluster/sources/sge-8.1.9.tar.gz and upload it into a S3 bucket
    • Create a script named patch_sge_url.sh with the following contents. Replace ParallelCluster version with the version you want to patch and SGE source with the S3 URL where the SGE sources were uploaded:
    #!/bin/bash
    
    VERSION="<parallelcluster-version>"
    git clone https://github.com/aws/aws-parallelcluster-cookbook.git
    cd ./aws-parallelcluster-cookbook
    git fetch origin
    git checkout v$VERSION
    
    sed -i "s/default\['cfncluster'\]\['sge'\]\['url'\]/#&/g" attributes/default.rb
    # if you are using "BSD sed" replace the above sed line with the following one
    # sed -i "" "s/default\['cfncluster'\]\['sge'\]\['url'\]/#&/g" attributes/default.rb
    
    echo "default['cfncluster']['sge']['url'] = '<sge-sources-s3-url>'" >> attributes/default.rb
    
    cd ..
    tar zcvf "$(pwd)/aws-parallelcluster-cookbook-${VERSION}.tgz" ./aws-parallelcluster-cookbook
    
    • Execute the script
    bash patch_sge_url.sh
    
    • A tar file with name aws-parallelcluster-cookbook-$VERSION.tgz will be generated. Upload the generated cookbook to your S3 bucket.
    • Use the custom cookbook as parameter for the createami command, e.g.
    pcluster createami -ai $BASE_AMI_ID -os $BASE_AMI_OS -cc <custom-cookbook-s3-url>
    

For cluster creation For SGE cluster creation, avoid using an AMI not backed with ParallelCluster software stack and build an AMI in advance following one of the above workaround.

Clone this wiki locally