Skip to content

(3.3.0‐3.9.0) Potential data loss issue when removing storage with update‐cluster in AWS ParallelCluster 3.3.0‐3.9.0

nihits edited this page Apr 19, 2024 · 1 revision

The issue

Starting with ParallelCluster 3.3.0, users can add and remove shared storage from a cluster with a pcluster update-cluster operation. While unmounting a filesystem, ParallelCluster normally performs a lazy unmount operation of the filesystem and then proceeds to clean up the mount point by deleting the mountdir and all subfolders under the mountdir. We identified an issue with AWS ParallelCluster versions 3.3.0 to 3.9.0 that could lead to a race condition which may result in unintended data loss if appropriate backup policies are not in place.

Affected versions (OSes, schedulers)

This issue impacts all ParallelCluster versions from 3.3.0 to 3.9.0, across all the OSes, schedulers and shared storage types.

Mitigation

To mitigate this issue on your existing cluster, we suggest you choose one of the options below based on your use case and the ParallelCluster version you are using. If you choose not to use either of the options below, we recommend you avoid unmounting your filesystems but if you decide to, please apply backup policies to avoid any unintended data loss.

Option 1: Upgrade to the patch release v3.9.1

On 2024-04-11, we published a patch release v3.9.1 that is designed to prevent this issue from occurring by deleting the mountdir but not the subfolders. This mechanism will delete the mountdir only if it's empty and thereby prevent unintended loss of data. Follow these instructions to upgrade your cluster to ParallelCluster 3.9.1.

Option 2: Apply in-place patch for clusters deployed using AWS ParallelCluster v3.3.0 - v3.8.0

If upgrading your clusters is not the right option, you can apply the patch to the head node using the following instructions:

  • Download the script to a working directory in your head node using one of the following commands:

    • curl https://us-east-1-aws-parallelcluster.s3.amazonaws.com/patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh -o patch-recursive-delete.sh

      OR

    • aws s3api get-object --bucket us-east-1-aws-parallelcluster --key patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh patch-recursive-delete.sh

      Note: You will need S3 GetObject permissions on the EC2 instance you're running the s3api get-object command from.

  • Make the script executable using the following command:

    • chmod +x patch-recursive-delete.sh
  • Choose your desired request type (https or s3) and execute one of the commands below. Run the script with sudo privileges in order to modify the files in /etc/chef:

    • with https: sudo ./patch-recursive-delete.sh https
    • with s3 (GetObject permissions and AWS credentials are required): sudo ./patch-recursive-delete.sh s3
  • Following the successful execution of the script, you'll see a message stating the Cookbook successfully patched.

Option 3: Apply in-place patch for clusters deployed using AWS ParallelCluster v3.9.0

As ParallelCluster 3.9.0 allows updating shared storage without requiring a compute fleet stop, in-place patch needs to be applied at new nodes and running nodes:

Patching new nodes

To patch new compute nodes, execute the patching script with an OnNodeStart custom action.

  • Add either one of the below configurations to the cluster configuration based on your choice:
# Using S3
CustomActions:
  OnNodeStart:
    Script: s3://us-east-1-aws-parallelcluster/patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh
      Args: 
        - s3

OR

# Using https       
CustomActions:
  OnNodeStart:
    Script: https://us-east-1-aws-parallelcluster.s3.us-east-1.amazonaws.com/patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh
     Args: 
       - https
  • Set QueueUpdateStrategy to COMPUTE_FLEET_STOP in cluster configuration to prevent the replacement of running compute nodes during the update. You can revert the strategy to the one you were using only once the forced update completes successfully. If the replacement of compute nodes is acceptable in your case, you can set the strategy to DRAIN or TERMINATE, so that running compute nodes will be replaced by new ones with the patch applied.
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    QueueUpdateStrategy: COMPUTE_FLEET_STOP
  • Request a forced update, submitting the pcluster update-cluster command as follows:
pcluster update-cluster \
--region REGION \
--cluster-name CLUSTER_NAME \
--cluster-configuration CONFIG_PATH \
--force-update True

Patching running nodes

To patch running nodes you must execute the patching script on the whole fleet, either using SSM or manually leveraging the scheduler.

With SSM (recommended approach) This procedure requires the user to have permissions arn:aws:iam::aws:policy/AmazonSSMFullAccess and cluster nodes to have policy arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore.

  1. Take note of the cluster name, as you will use it in the command below

  2. Create an S3 bucket in the same region where the cluster is deployed, that will be used to store logs generated by SSM.

  3. Execute the patching script on your running fleet using either one of the below commands:

    • Using HTTPS to download objects from S3
# Set variables with your values
CLUSTER_NAME="dloss-0417-1"
PATCHING_SCRIPT_HTTPS_URL="https://us-east-1-aws-parallelcluster.s3.amazonaws.com/patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh"
BUCKET="mgiacomo-workspace-eu-west-1"
REGION="eu-west-1"

aws ssm send-command \
--document-name "AWS-RunShellScript" \
--document-version "1" \
--targets "[{\"Key\":\"tag:parallelcluster:cluster-name\",\"Values\":[\"$CLUSTER_NAME\"]}]" \
--parameters "{\"workingDirectory\":[\"\"],\"executionTimeout\":[\"3600\"],\"commands\":[\"curl $PATCHING_SCRIPT_HTTPS_URL -o patch-recursive-delete.sh\",\"chmod +x patch-recursive-delete.sh\",\"sudo ./patch-recursive-delete.sh https\"]}" \
--comment "pcluster-patch-recursive-delete" \
--timeout-seconds 600 \
--max-concurrency "50" \
--max-errors "0" \
--output-s3-bucket-name "$BUCKET" \
--output-s3-key-prefix "ssm/run-command/pcluster-patch-recursive-delete" \
--cloud-watch-output-config '{"CloudWatchOutputEnabled":true}' \
--region $REGION
    • Using AWS CLI to download objects from S3
# Set variables with your values
CLUSTER_NAME="dloss-0417-1"
PATCHING_SCRIPT_BUCKET="us-east-1-aws-parallelcluster"
PATCHING_SCRIPT_KEY="patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh"
BUCKET="mgiacomo-workspace-eu-west-1"
REGION="eu-west-1"

aws ssm send-command \
--document-name "AWS-RunShellScript" \
--document-version "1" \
--targets "[{\"Key\":\"tag:parallelcluster:cluster-name\",\"Values\":[\"$CLUSTER_NAME\"]}]" \
--parameters "{\"workingDirectory\":[\"\"],\"executionTimeout\":[\"3600\"],\"commands\":[\"aws s3api get-object --bucket $PATCHING_SCRIPT_BUCKET --key $PATCHING_SCRIPT_KEY patch-recursive-delete.sh\",\"chmod +x patch-recursive-delete.sh\",\"sudo ./patch-recursive-delete.sh s3\"]}" \
--comment "pcluster-patch-recursive-delete" \
--timeout-seconds 600 \
--max-concurrency "50" \
--max-errors "0" \
--output-s3-bucket-name "$BUCKET" \
--output-s3-key-prefix "ssm/run-command/pcluster-patch-recursive-delete" \
--cloud-watch-output-config '{"CloudWatchOutputEnabled":true}' \
--region $REGION
  1. Monitor the execution on the SSM Console.

Without SSM

  1. Patch the head node by executing Usage Instructions on it.

  2. Patch the login nodes by executing Usage Instructions on each one of them.

  3. Patch the compute nodes by running the patching script as a SLURM job, as follows:

    • sbatch -w NODE_LIST --wrap "curl https://us-east-1-aws-parallelcluster.s3.amazonaws.com/patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh -o patch-recursive-delete.sh ; chmod +x patch-recursive-delete.sh ; sudo ./patch-recursive-delete.sh https"

Note

New login nodes cannot be patched at launch because they do not support OnNodeStart actions. Every new login node must be manually patched following the procedure describe earlier under Patching Running Nodes section.

Clone this wiki locally