Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All vpc_security_group_ids being removed when updating instance Security Groups #9043

Closed
scottgeary opened this issue Sep 26, 2016 · 17 comments · Fixed by #12897
Closed

All vpc_security_group_ids being removed when updating instance Security Groups #9043

scottgeary opened this issue Sep 26, 2016 · 17 comments · Fixed by #12897
Assignees

Comments

@scottgeary
Copy link

When trying to modify AWS Security Groups on an already Terraformed EC2 instance, it seems to forget about all existing SGs (even though they appear in the statefile), and the plan output initially looks legit.

We originally found this issue, when trying to add a brand-new SG. Terraform removed all existing SGs (and caused some chaos!), but successfully added the new one. We've since reduced the problem down to just removing an existing SG, or somewhere parsing the existing SGs from the vpc_security_group_ids field.

In the following example:

  • One EC2 instance called ganglia-a
  • 4 existing Security Groups AAAAAA, BBBBBB, CCCCCC, & DDDDDD
  • Trying to remove BBBBBB.
  • Statefile shows the existing 4 SGs, (and all have their definitions too).
  • Terraform plan shows the reduction in count from 4=>3
  • Terraform plan shows the SG being removed
  • Terraform apply attempts to clear out all SGs. Only limited in this case by the AWS API complaining about lack of parameter payload.

Terraform Version

Terraform v0.7.4

Affected Resource(s)

  • aws_instance
  • vpc_security_group_ids

Terraform Configuration Files

Existing Statefile showing existing 4 SGs:

                "aws_instance.ganglia-a": {
                    "type": "aws_instance",
                    "depends_on": [
                        "aws_security_group.CCCCCC",
                        "aws_security_group.DDDDDD",
                        "aws_security_group.AAAAAA",
                        "aws_subnet.blah-public",
                        "data.template_file.ganglia-cloudinit"
                    ],
                    "primary": {
                        "id": "i-INSTANCE-ID-HERE",
                        "attributes": {
                            "ami": "ami-AMI-ID",
                            "associate_public_ip_address": "false",
                            "availability_zone": "us-west-2a",
                            "disable_api_termination": "false",
                            "ebs_block_device.#": "0",
                            "ebs_optimized": "false",
                            "ephemeral_block_device.#": "0",
                            "iam_instance_profile": "",
                            "id": "i-INSTANCE-ID-HERE",
                            "instance_state": "running",
                            "instance_type": "m4.large",
                            "key_name": "dom",
                            "monitoring": "false",
                            "network_interface_id": "eni-xxxxxx",
                            "private_dns": "blah.compute.internal",
                            "private_ip": "SOME_IP",
                            "public_dns": "blah.compute.amazonaws.com",
                            "public_ip": "SOME_IP",
                            "root_block_device.#": "1",
                            "root_block_device.0.delete_on_termination": "true",
                            "root_block_device.0.iops": "192",
                            "root_block_device.0.volume_size": "64",
                            "root_block_device.0.volume_type": "gp2",
                            "security_groups.#": "0",
                            "source_dest_check": "true",
                            "subnet_id": "subnet-ID",
                            "tags.%": "2",
                            "tags.Cost Center": "Platform",
                            "tags.Name": "ganglia",
                            "tenancy": "default",
                            "user_data": "SOME_HASH_HERE",
                            "vpc_security_group_ids.#": "4",
                            "vpc_security_group_ids.1389827218": "sg-AAAAAA",
                            "vpc_security_group_ids.258711328": "sg-DDDDDD",
                            "vpc_security_group_ids.4034046821": "sg-CCCCCCC",
                            "vpc_security_group_ids.4249215419": "sg-BBBBBB"
                        },
                        "meta": {
                            "schema_version": "1"
                        },
                        "tainted": false
                    },
                    "deposed": [],
                    "provider": ""
                }

Existing Terraform file:

# ganglia
resource "aws_instance" "ganglia-a" {
  ami = "${var.ubuntu_ami_id}"
  availability_zone = "${var.vpc_region}a"
  instance_type = "m4.large"
  key_name = "dom"
  vpc_security_group_ids = ["${aws_security_group.AAAAAA.id}", "${aws_security_group.BBBBBB.id}", "${aws_security_group.CCCCCC.id}", "${aws_security_group.DDDDDD.id}" ]
  subnet_id = "${aws_subnet.blah-public.0.id}"
  user_data = "${data.template_file.SOMEDATA.rendered}"

  lifecycle {
    ignore_changes = ["user_data"]
  }

  root_block_device {
    volume_type = "gp2"
    volume_size = "${var.default_volume_size}"
  }

  provisioner "local-exec" {
    command = "../../../scripts/autosign ${var.foreman_proxy_url} ganglia-${replace(aws_instance.ganglia-a.id, "i-", "")}.${var.vpc_domain}"
  }

  tags {
    Name = "ganglia"
    "Cost Center" = "Platform"
  }
}

Actual change we're trying to apply (removal of BBBBBB)

-  vpc_security_group_ids = ["${aws_security_group.AAAAAA.id}", "${aws_security_group.BBBBBB.id}", "${aws_security_group.CCCCCC.id}", "${aws_security_group.DDDDDD.id}"]
+  vpc_security_group_ids = ["${aws_security_group.AAAAAA.id}", "${aws_security_group.CCCCCC.id}", "${aws_security_group.DDDDDD.id}"]

Terraform Plan

~ aws_instance.ganglia-a
    vpc_security_group_ids.#:          "4" => "3"
    vpc_security_group_ids.4249215419: "sg-BBBBBB" => ""

Debug Output

Expected Behavior

  • A single SG aws_security_group.BBBBBB being removed
  • Terraform plan usually shows elements to be kept.

Actual Behavior

  • Attempts to removes all SGs
  • Causes AWS API error, about lack of parameters.

Steps to Reproduce

  1. terraform plan
  2. terraform apply

Important Factoids

  • I've removed a lot of IDs from the various output.
  • If we try adding a brand new SG, it removes all existing SGs, and leaves the new one as the only one. (This is how we discovered the bug)

References

  • None that I could find that clear the existing SGs.
@kwilczynski
Copy link
Contributor

@scottgeary hi there! I am very sorry that you are affected by this!

We are having a look, as this is a very hard to catch issue, as I personally could not reproduce it successfully. I will keep you posted once we find out what might be creeping there!

@rafaelmagu
Copy link

I walked @stack72 through this on my machine, but we still couldn't figure out why. Does not appear to be an API-limit-related issue.

@br0ch0n
Copy link
Contributor

br0ch0n commented Oct 26, 2016

@scottgeary After two days of head banging I just realized that I'm having this exact same problem. Mine manifested when I tried to add 1 SG to an existing array of 4. Repeated applies flip-flop between just the new 1 and just the original 4. When doing a delete it's the same behavior except aws won't let you send it 0 SGs. Otherwise yours would flip-flop too between setting your instance to 4 then 0 then 4 then 0.

I believe it's yet another ignore_changes bug. I had just recently added ignore_changes['user_data'] to my resource just like above. I thought it was working now (to be fair, it does ignore the user_data now) :(

One can reliably reproduce the issue like below. First, create an instance:

resource "aws_instance" "mystery_test" {
  provider = "aws.mystery"
  instance_type = "t2.nano"
  user_data = "foo"
  lifecycle {
    ignore_changes = [ "user_data" ]
  }
  vpc_security_group_ids = ["sg-AAAAAAAA", "sg-BBBBBBBB"]
  ami       = "ami-a1288ec2"
  subnet_id = "${var.subnetid}"
}

Then, change user_data "foo" to "bar" and also add a new SG to the array.

user_data = "bar"
  lifecycle {
    ignore_changes = [ "user_data" ]
  }
  vpc_security_group_ids = ["sg-AAAAAAAA", "sg-BBBBBBBB", "sg-CCCCCCCC]

a plan will give you strange output:

~ module.mysterymod.aws_instance.mystery_test
    vpc_security_group_ids.#:          "2" => "3"
    vpc_security_group_ids.3333333333: "" => "sg-CCCCCCCC"

Note how the plan doesn't show the expected

vpc_security_group_ids.1111111111: "sg-AAAAAAAA" => "sg-AAAAAAAA"
vpc_security_group_ids.222222222: "sg-BBBBBBBB" => "sg-BBBBBBBB"

entries. This is bad :)
In the trace log of the apply, you'll see it POST just

Action=ModifyInstanceAttribute&GroupId.1=sg-CCCCCCCC&InstanceId=i123455678&Version=2016-09-15

The action should be all 3 groups (GroupId.1 and GroupId.2 and GroupId.3)
What seems to be happening is that ignore_changes catches the user_data change (foo => bar) which then makes it ignore other diffs (or at least mess things up)

2016/10/26 14:35:37 [DEBUG] root.mystery: eval: *terraform.EvalDiff
2016/10/26 14:35:37 [DEBUG] Removing 'id' diff and setting Destroy to false because after ignore_changes, this diff no longer requires replacement
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: user_data
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: vpc_security_group_ids.11111111
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: security_groups.#
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: root_block_device.#
2016/10/26 14:35:37 [DEBUG] [EvalIgnoreChanges] aws_instance.mystery - Ignoring diff attribute: ebs_block_device.#
...

If you then remove the ignore_changes line and run a plan at this point, you'll get a plan that of course wants to recreate the instance, but the sg change will at least look correct:

-/+ module.mystery.aws_instance.mystery
    ami:                               "ami-a1288ec2" => "ami-a1288ec2"
...
    user_data:                         "0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33" => "62cdb7020ff920e5aa642c3d4066950dd1f01f4d" (forces new resource)
    vpc_security_group_ids.#:          "2" => "3"
    vpc_security_group_ids.3333333333: "" => "sg-CCCCCCCC"
    vpc_security_group_ids.11111111111: "sg-AAAAAAAA" => "sg-AAAAAAAA"
    vpc_security_group_ids.2222222222: "sg-BBBBBBBB" => "sg-BBBBBBBB"

I of course added ignore_changes recently because I needed to update my user_data without reprovisioning lots of nodes, thus I'm now stuck in a state where I get to pick between adding a SG and reprovisioning lots of nodes :(

I will add that as a workaround, you can manually attach the new SG to your instance via console/cli and Terraform will not complain about it. Probably works for removing one also.

Anyway, seems ignore_changes needs to be examined further. Hope this helps!
@phinze

@br0ch0n
Copy link
Contributor

br0ch0n commented Oct 28, 2016

Just to add a bit more proof, I played a bit last night with eval_diff.go and tried (around line 240) :

   // If we didn't hit any of our early exit conditions, we can filter the diff.
    for k := range ignorableAttrKeys {
        if !(strings.Contains(k, "vpc_security_group_ids")) {
            log.Printf("[DEBUG] [EvalIgnoreChanges] %s - Ignoring diff attribute: %s",
                n.Resource.Id(), k)
            diff.DelAttribute(k)
        }
    }

running this against my earlier example resource results in the correct plan (didn't try applying). This suggests that one approach might be to make sure we're not deleting diff attributes that we care about in this particular run. Note that if I run a plan using the above hacked TF against other resources that have a pending recreate due to user data, are blocked from doing so by ignore_changes, but have no other changes (vpc_security_group_ids or otherwise), I get a useless diff in my plan for all of them, e.g.

"A" => "A"
"B" => "B"
"C" => <computed>

Looking further it seems the problem is a bit earlier where we do:

   if changeType == DiffDestroyCreate {
        for k, v := range diff.CopyAttributes() {
            if v.Empty() || v.NewComputed {
                ignorableAttrKeys[k] = true
            }
        }

problem being that this ends up deleting part of our diff, e.g. A => A, since that's matched by .Empty(). This is likely a problem for all list types of attributes. Seems that we should look for the existence of a parent counter key, e.g. "vpc_security_group_ids.#" and if it's changing, we should exclude all child entries from the ignorableAttrKeys map.

@br0ch0n
Copy link
Contributor

br0ch0n commented Nov 2, 2016

What we actually need to do is check if any siblings are changing. Potential fix in #9791

@ekristen
Copy link

ekristen commented Jan 8, 2017

I'm seeing this bug too, I'm adding a new security group to the instance and the original 2 are disappearing and it everytime I run plan it flips flops back and forth.

~ aws_instance.management.0
    vpc_security_group_ids.#:          "2" => "3"
    vpc_security_group_ids.1111564906: "" => "sg-aaaa4197"

then

~ aws_instance.management.0
    vpc_security_group_ids.#:          "1" => "3"
    vpc_security_group_ids.111142371:  "" => "sg-aaaa417a"
    vpc_security_group_ids.1111450837: "" => "sg-aaaa4184"

@Gary-Armstrong
Copy link

Gary-Armstrong commented Jan 13, 2017

I'm seeing this as well. Earlier in the week it was that a vpc_security_group_ids entry was not being removed, and so I removed it via console. The plan output looked much like ekristen's comment. This was relatively benign and so I didn't bother to enter an issue on it.

Today however, this same issue had some destructive behavior. It removed all SG except the one I was adding; luckily I was doing this on a dev machine. But I must proceed with modification of SG across many more instances.

Early warning from plan:

~ aws_instance.compute_100_dev
    vpc_security_group_ids.#:         "3" => "4"
    vpc_security_group_ids.423628628: "" => "sg-asdfasdf"

Count goes to 4 so I'm sure everything will be OK and so I apply.

aws_instance.compute_100_dev: Modifying...
  vpc_security_group_ids.#:         "3" => "4"
  vpc_security_group_ids.423628628: "" => "sg-asdfasdf"
aws_instance.compute_100_dev: Modifications complete

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

Close enough, right? Probably just some display error.

Then I validate in console and the only SG attached to the instance is sg-asdfasdf, so I begin to cry again. I wipe my face and then plan:

~ aws_instance.compute_100_dev
    vpc_security_group_ids.#:          "1" => "4"
    vpc_security_group_ids.1967715036: "" => "sg-12341234"
    vpc_security_group_ids.2402598217: "" => "sg-qwerqwer"
    vpc_security_group_ids.257549500:  "" => "sg-zxcvzxcv"


Plan: 0 to add, 1 to change, 0 to destroy.

Hooray. This looks more correct. I mean, it isn't showing sg-asdfasdf but surely this is just a display bug.
Apply:

aws_instance.compute_100_dev: Modifying...
  vpc_security_group_ids.#:          "1" => "4"
  vpc_security_group_ids.1967715036: "" => "sg-12341234"
  vpc_security_group_ids.2402598217: "" => "sg-qwerqwer"
  vpc_security_group_ids.257549500:  "" => "sg-zxcvzxcv"
aws_instance.compute_100_dev: Modifications complete

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

Console reveals that I again have only 3 SG and the new one isn't being put into place. It is too early in the day to start drinking.

Maybe I can make a workaround. Advice appreciated.

@br0ch0n
Copy link
Contributor

br0ch0n commented Jan 13, 2017

@ekristen @Gary-Armstrong can you guys test my fix from #9791 ? Maybe it'll get more traction if it works for you.

@Gary-Armstrong
Copy link

Gary-Armstrong commented Jan 13, 2017

Update:

  vpc_security_group_ids = [
    "${aws_security_group.wxmix_http_sg.id}",
    "${aws_security_group.wxmix_admin_sg.id}"
  ]

Using this HCL where the 2 SG are completely unrelated, when I do a plan/apply cycle it will switch to whichever SG isn't applied to the instance.

] aws ec2 describe-instances --instance-id i-xxxxxxxx|jq .Reservations[].Instances[].SecurityGroups[]
{
  "GroupName": "wxmix_http_sg",
  "GroupId": "sg-aaaaaaaa"
}
...TF APPLY...
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
...
] aws ec2 describe-instances --instance-id i-xxxxxxxx|jq .Reservations[].Instances[].SecurityGroups[]
{
  "GroupName": "wxmix_admin_sg",
  "GroupId": "sg-bbbbbbbb"
}

Just by doing an apply, it will flip back and forth between the two SG.

@Gary-Armstrong
Copy link

More experiments reveal the pattern (in my case) is to replace existing list items with new list items, rather than merge the list and replace existing list items with new merged list items.

@Gary-Armstrong
Copy link

@br0ch0n If it involves compiling a new terraform, I'll not participate. I have to manually change SG assignments on many instances today.

@mpasternacki
Copy link

Still happens in 0.8.6

@br0ch0n
Copy link
Contributor

br0ch0n commented Mar 2, 2017

@stack72 is the prototype you mentioned in #1887 meant to work around this as well (i.e. you're planning on completely refactoring ignore_changes)? Otherwise can we move forward with #9791? Happy to tweak it if it's not an ideal solution.

@cswilliams
Copy link

just got bit by this bug as well today. Added a new security group to my vpc_security_group_ids list and it removed all existing security groups from my instance and only added the new security group.

@catsby catsby added core and removed provider/aws labels Mar 17, 2017
@catsby
Copy link
Contributor

catsby commented Mar 17, 2017

Thanks for the research here. I've reproduced this issue with the following gist and instructions above:

I'm relabeling this as a core bug, the root of it seems to involve ignore_changes as mentioned.

@catsby
Copy link
Contributor

catsby commented Mar 21, 2017

Hey all – looks like this should be patched up with #12897 , will go out in the next release

@ghost
Copy link

ghost commented Apr 15, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.