Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EMR] ValidationException: An instance group may only be modified when the cluster is running or waiting #9400

Closed
ghost opened this issue Jul 18, 2019 · 6 comments · Fixed by #10425
Labels
bug Addresses a defect in current functionality. service/emr Issues and PRs that pertain to the emr service.
Milestone

Comments

@ghost
Copy link

ghost commented Jul 18, 2019

This issue was originally opened by @hdryx as hashicorp/terraform#22116. It was migrated here as a result of the provider split. The original body of the issue is below.


Terraform Version

v0.12.3

Terraform Configuration Files

resource "aws_emr_cluster" "cluster" {
  name          = "${var.project}-${var.cluster_name}-${var.environment}"
  release_label = "${var.emr_release_label}"
  applications  = "${var.emr_application}"

  ec2_attributes {
    subnet_id                         = "${data.aws_subnet.private_1.id}"
    emr_managed_master_security_group = "${data.aws_security_group.sg_emr_master.id}"
    emr_managed_slave_security_group  = "${data.aws_security_group.sg_emr_slave.id}"
    instance_profile                  = "${data.aws_iam_instance_profile.iam_emr_instance_profile.arn}"


    # Because we are launching the EMR in a private subnet we should use a service_access_security_group
    # service_access_security_group = "${aws_security_group.emr_service_access.id}"
    service_access_security_group = "${data.aws_security_group.sg_emr_service_access.id}"
  }
  master_instance_group {
    instance_type = "${var.master_instance_type}"
    bid_price      = "${var.spot_core_bid_price}"
    
    ebs_config {
      size                 = "${var.ebs_config_size}"
      type                 = "${var.ebs_config_type}"
      volumes_per_instance = "${var.ebs_config_volume_per_instance}"
    }
  }


  core_instance_group {
    instance_type  = "${var.core_instance_type}"
    instance_count = "${var.core_instance_count}"
    bid_price      = "${var.spot_core_bid_price}"

    ebs_config {
      size                 = "${var.ebs_config_size}"
      type                 = "${var.ebs_config_type}"
      volumes_per_instance = "${var.ebs_config_volume_per_instance}"
    }
  }

  ebs_root_volume_size = "${var.ebs_root_volume_size}"

  tags = "${merge(var.resource_tagging, map("Name", "${var.project}-${var.environment}-emr-cluster"))}"

  lifecycle {
    create_before_destroy = true
  }

  log_uri = "s3://${data.aws_s3_bucket.logs.id}/emr"

  # Terminate cluster when steps are done
  keep_job_flow_alive_when_no_steps = "${var.keep_job_no_steps}"

  bootstrap_action {
    path = "s3://${data.aws_s3_bucket.sources.id}/src/shell/${var.bootstrap_file}"
    name = "${var.bootstrap_name}"
    args = "${var.bootstrap_args}"
  }

  # Configuration of the cluster
  configurations_json = "${var.configuration_file != "" ? file("config/${var.configuration_file}") : ""}"

  # Role for the cluster
  service_role = "${data.aws_iam_role.iam_emr_service_role.arn}"


  # Steps to be executed by the cluster
  dynamic "step" {
    # for_each = jsondecode(templatefile("${path.module}/steps/${var.step_file}", {
    for_each = jsondecode(templatefile("../03_emr/steps/${var.step_file}", {
      # General Variables
      s3_sources      = "${data.aws_s3_bucket.sources.id}"
    }))

    content {
      action_on_failure = step.value.action_on_failure
      name              = step.value.name
      hadoop_jar_step {
        jar  = step.value.hadoop_jar_step.jar
        args = step.value.hadoop_jar_step.args
      }
    }
  }

}


resource "aws_emr_instance_group" "task" {
  name           = "${var.project}-${var.cluster_name}-instance-${var.environment}"
  cluster_id     = "${aws_emr_cluster.cluster.id}"
  instance_count = "${var.spot_core_instance_count}"
  instance_type  = "${var.spot_core_instance_type}"
  bid_price      = "${var.spot_core_bid_price}"

  depends_on = ["aws_emr_cluster.cluster"]
}

...

Crash Output

Error: error draining EMR Instance Group (ig-24CP9QNA1THDI): ValidationException: An instance group may only be modified when the cluster is running or waiting.
status code: 400, request id: f6b923cc-a935-11e9-97e8-993b41767f35

Expected Behavior

I expected that the Task Group should be created and added to the EMR

Actual Behavior

EMR is launched but the Task Group is not created (see error above)

@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Jul 18, 2019
@aeschright aeschright added the service/emr Issues and PRs that pertain to the emr service. label Aug 2, 2019
@klsnreddy
Copy link

I am also facing this same issue, do we have any work around?

@rlvrs
Copy link

rlvrs commented Sep 15, 2019

Terraform Version: 0.11.14
Terraform AWS provider Version: 2.25

I have the same problem. Whilst performing my investigation, I will update this answer with more details.
I found two workarounds as of now, but I don't like any of them. I am counting on you to help me reach a neater solution :)
On the other hand, they might help you better understand the problem.

If you need more details than below, please let me know.

In our scenario, we use an EMR cluster with instance groups. This cluster creation is defined in a Terraform module so we try to be as DRY as possible.
Each time we want to run a Spark job, it will create an EMR cluster for itself, run and die (in the case of a batch job).

After the job finishes, the EMR cluster itself will terminate as well.
At this point, the terraform state file will have declared some aws_emr_instance_group resource, say ig-24CP9QNA1THDI. However, as you can see here, the aws_emr_instance_group should be deleted after termintation:

Instance Groups are destroyed when the EMR Cluster is destroyed

The next plan of Terraform will say that a new resource is required (-/+ module.emr_cluster_module.aws_emr_instance_group.task (new resource required)), because the instance group is in the state file. However, it will fail to apply with the above error:

Error: error draining EMR Instance Group (ig-24CP9QNA1THDI): ValidationException: An instance group may only be modified when the cluster is running or waiting.
status code: 400, request id: f6b923cc-a935-11e9-97e8-993b41767f35

This happens because the cluster is obviously terminated, thus it is neither in "running" nor in "waiting" state. I presume that this error is due to the new resource aws_emr_instance_group in the state file that was not there in the previous syntax (but this is just a hunch).
However, I was expecting it to create a new aws_emr_instance_group for the tasks, if the previous one is destroyed already as opposed to throwing this error.

The following workarounds work for me, but as previous stated, I am looking for a better solution.
Workaround 1: Fallback to the previous syntax and rollback the changes here. This should not be advised, since the previous syntax will be deprecated as seen here and here.
Workaround 2: Remove the instance group directly from the state file. This is even worse, but I put it here for debugging purposes.

This feature was introduced here and here. I just skim-through the code, but I don't see this behavior being tested (I must say that it was my first time looking at this code).

For more information on the issue:
https://www.terraform.io/docs/providers/aws/guides/version-3-upgrade.html
#8245
https://github.com/terraform-providers/terraform-provider-aws/pull/8459/files
https://github.com/terraform-providers/terraform-provider-aws/pull/8078/files

@joelthompson
Copy link
Contributor

I think this is a duplicate of #1355

@bflad bflad added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Oct 10, 2019
@bflad bflad added this to the v2.32.0 milestone Oct 10, 2019
bflad pushed a commit that referenced this issue Oct 10, 2019
In situations where Terraform needs to replace an aws_emr_cluster
resource that has aws_emr_instance_group resources associated with it,
Terraform tries to execute a destroy on the instance group, but it fails
as the notion of a "destroy" on an instance group is to set the number
of instances to zero, but AWS doesn't let you modify the count of
instances in an instance group on an EMR cluster. This fixes the issue
by treating an instance group that has been terminated as no longer
existing, so Terraform won't try to execute a "destroy" and not error
out.

Fixes #1355
Fixes #9400
@bflad
Copy link
Contributor

bflad commented Oct 10, 2019

The fix for this has been merged and will release with version 2.32.0 of the Terraform AWS Provider, later today. Thanks to @joelthompson for the implementation. 🎉

@ghost
Copy link
Author

ghost commented Oct 10, 2019

This has been released in version 2.32.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

@ghost
Copy link
Author

ghost commented Nov 9, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Nov 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/emr Issues and PRs that pertain to the emr service.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants