*: Assorted fixes to get e2e-aws working again (Calico -> Flannel, etc.) #151

wking · 2018-08-20T23:03:54Z

Patterned on the existing worker_ingress_kubelet_insecure_from_master from b620c16 (coreos/tectonic-installer#264).

This should address errors like:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/150/pull-ci-origin-installer-e2e-aws/546/artifacts/e2e-aws/nodes/ip-10-0-52-134.ec2.internal/journal.gz | zcat >journal
$ journalread journal | grep 'current config label' | head -n3
2018-08-20T20:30:08.000827895Z I0820 20:30:08.826840       1 tnc.go:375] Node ip-10-0-134-147.ec2.internal does not have a current config label
2018-08-20T20:30:08.00082814Z  I0820 20:30:08.826860       1 tnc.go:375] Node ip-10-0-153-195.ec2.internal does not have a current config label
2018-08-20T20:30:08.000828371Z I0820 20:30:08.826866       1 tnc.go:375] Node ip-10-0-166-239.ec2.internal does not have a current config label

on the master node:

$ journalread journal | grep -A15 'Starting Ignition' | grep -v INFO
2018-08-20T20:21:40.00097323Z  Starting Ignition (files)...
2018-08-20T20:21:40.000991225Z DEBUG    : parsed url from cmdline: ""
2018-08-20T20:21:41.000010266Z DEBUG    : parsing config: {
2018-08-20T20:21:41.0000122Z   "ignition": {
2018-08-20T20:21:41.000014165Z "config": {
2018-08-20T20:21:41.000016186Z "append": [
2018-08-20T20:21:41.000018193Z {
2018-08-20T20:21:41.000023296Z "source": "http://ci-op-imi5mbig-68485-tnc.origin-ci-int-aws.dev.rhcloud.com:80/config/master",

which were resulting in:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/150/pull-ci-origin-installer-e2e-aws/546/build-log.txt | grep Ginkgo
   |  Ginkgo timed out waiting for all parallel nodes to report back!  |
 Ginkgo ran 1 suite in 10m6.626061944s

Inbound 10250 is the kubelet API used by the control plane. @smarterclayton suspects the e2e-aws tests are trying to get metrics from the kubelets, and hanging on the etcd kubelet because this rule was missing. I'm not clear why we've only been seeing this issue for the last week though.

The third commit in this PR adds the new rule. The previous two commits pivot from inline ingress and egress rules to stand-alone aws_security_group_rule resources, finishing a transition away from inline ingress/egress rules begun by coreos/tectonic-installer#264. More details in the first two commit messages.

wking · 2018-08-21T01:59:28Z

The e2e-aws error was:

Waiting for API at https://ci-op-ng25jj4j-68485-api.origin-ci-int-aws.dev.rhcloud.com:6443 to respond ...
...
Waiting for API at https://ci-op-ng25jj4j-68485-api.origin-ci-int-aws.dev.rhcloud.com:6443 to respond ...
Interrupted

I don't know if that's further along than the Ginkgo error or not, but I'll check the node logs later.

wking · 2018-08-21T16:13:19Z

I don't know if that's further along than the Ginkgo error or not, but I'll check the node logs later.

Unfortunately, it looks like job 553 failed to capture node logs, so I don't know if this has addressed the "does not have a current config label" issue or not. On the off chance that the failed-log-capture was a flake, I'll try the test again:

/retest

eparis · 2018-08-21T18:18:40Z

modules/aws/vpc/sg-etcd.tf

-    self      = true
-  }
+  protocol  = "tcp"
+  from_port = 10250


Are all of the 'from' correct? don't most things trying to connect to here use a random high port?

Are all of the 'from' correct? don't most things trying to connect to here use a random high port?

Yeah, that doesn't make sense to me either. But it's what we have had in master since forever; see, for example, here. Still, I'll drop them and see if that helps.

Are all of the 'from' correct? don't most things trying to connect to here use a random high port?

Yeah, that doesn't make sense to me either...

Ah, from_port and to_port are for a range of ports on a single host, not the ports for both hosts involved in the connection. So having from_port makes sense.

wking · 2018-08-21T21:07:30Z

I got the same error again with build 555, so probably not a flake. Still not sure what's going on there...

wking · 2018-08-21T21:25:30Z

I've spun off some tangential changes into #155 and #156. I don't think they were making semantic changes, but they were at least touching the API ingress rule (on port 6443) and therefore might have been causing our API timeouts.

wking · 2018-08-21T23:55:50Z

I've spun off some tangential changes into #155 and #156.

And we're still getting:

Waiting for API at https://ci-op-hpjfcvwz-68485-api.origin-ci-int-aws.dev.rhcloud.com:6443 to respond ...
...
Waiting for API at https://ci-op-hpjfcvwz-68485-api.origin-ci-int-aws.dev.rhcloud.com:6443 to respond ...
Interrupted

wking · 2018-08-22T16:37:44Z

Looks like I may have broken the Terraform ignition setup? The e2e-aws test failed with:


* module.assets_base.data.ignition_config.etcd: 3 error(s) occurred:

* module.assets_base.data.ignition_config.etcd[2]: data.ignition_config.etcd.2: unexpected EOF
* module.assets_base.data.ignition_config.etcd[0]: data.ignition_config.etcd.0: unexpected EOF
* module.assets_base.data.ignition_config.etcd[1]: data.ignition_config.etcd.1: unexpected EOF
* module.assets_base.module.bootkube.data.ignition_file.bootkube_sh: 1 error(s) occurred:

* module.assets_base.module.bootkube.data.ignition_file.bootkube_sh: data.ignition_file.bootkube_sh: unexpected EOF

wking · 2018-08-22T16:46:25Z

Looks like I may have broken the Terraform ignition setup?

Ah, I'd missed a worker -> etcd replacement in a copy/paste. Fixed with bcb86ea -> 7fc19bd.

This includes the fixes needed to work with SELinux.

wking · 2018-08-22T17:26:13Z

And we're back to:

Ginkgo timed out waiting for all parallel nodes to report back!

Which is where we were before this PR, and distinct from the:

Waiting for API at https://ci-op-ng25jj4j-68485-api.origin-ci-int-aws.dev.rhcloud.com:6443...

we saw during earlier versions of this PR (that just tried to open master -> etcd:10250).

wking · 2018-08-22T19:51:05Z

I've fixed a:

* aws_security_group_rule.worker_ingress_flannel_from_worker: [WARN] A duplicate Security Group rule was found on (sg-008d3915f65c9e488). This may be
a side effect of a now-fixed Terraform issue causing two security groups with
identical attributes but different source_security_group_ids to overwrite each
other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
information and instructions for recovery. Error message: the specified rule “peer: sg-03a1212a84b765ddf, UDP, from port: 4789, to port: 4789, ALLOW” already exists

reported by @smarterclayton with 7fc19bd -> 81373c3. Looks like I had accidentally pasted some worker stuff into the master rules.

wking · 2018-08-22T21:00:35Z

And we're still getting Ginkgo timeouts with 81373c3fb8.

wking · 2018-08-23T03:11:58Z

And we're still getting Ginkgo timeouts with cd83ddb.

wking · 2018-08-23T19:16:44Z

Hooray, with b17dedc19c6a6aa5432675f5418c023962ae437d we're off the Ginkgo timeouts, and are only getting:

• Failure [34.901 seconds]
[Conformance][Area:Networking][Feature:Router]
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:21
  The HAProxy router
  /tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:40
    should set Forwarded headers appropriately [Suite:openshift/conformance/parallel] [It]
    /tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:41

    Aug 23 18:38:47.491: Unexpected header: '10.2.6.0' (expected 10.0.192.68); All headers: http.Header{"X-Forwarded-Port":[]string{"8080"}, "X-Forwarded-Proto":[]string{"http"}, "Forwarded":[]string{"for=10.2.6.0;host=router-headers.example.com;proto=http"}, "X-Forwarded-For":[]string{"10.2.6.0"}, "User-Agent":[]string{"curl/7.61.0"}, "Accept":[]string{"*/*"}, "X-Forwarded-Host":[]string{"router-headers.example.com"}}

    /tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:103

With b17dedc -> 2f1d81c, I've:

Touched up the Calico -> Flannel commit to also update some fixtures,
Returned to respecting user AMI overrides, and
Rebased this on top of Fix perm errors with selinux enabled #134 to pick up the SELinux changes needed to work with new RHCOS AMIs.

We're having trouble accessing service IPs from pods with host network namespaces, which was keeping the metrics API from coming up (I think that's what the problem is ;) and eventually blocking namespace deletion. Defaulting to Flannel fixes that issue.

wking · 2018-08-23T19:28:41Z

I've pushed 2f1d81c -> b67f809 to leave the Calico fixtures alone. They seem to be for internal unit tests unaffected by the default change, or (for tests/smoke/aws/vars/aws-basic.yaml) they may be dead code (I'm dropping tests/smoke/aws in #143).

wking · 2018-08-23T19:43:18Z

And we've hit our quotas in the Jenkins account...

Error: Error applying plan:

7 error(s) occurred:

* module.vpc.aws_nat_gateway.nat_gw[1]: 1 error(s) occurred:

* aws_nat_gateway.nat_gw.1: Error creating NAT Gateway: NatGatewayLimitExceeded: Performing this operation would exceed the limit of 5 NAT gateways
	status code: 400, request id: fe819d69-b3d5-44da-b71d-5d0113f5d7fd
* module.vpc.aws_elb.api_internal: 1 error(s) occurred:

* aws_elb.api_internal: TooManyLoadBalancers: Exceeded quota of account 846518947292
	status code: 400, request id: 34db82f2-a70b-11e8-b742-172fdc8dbc5d
* module.vpc.aws_elb.api_external: 1 error(s) occurred:

* aws_elb.api_external: TooManyLoadBalancers: Exceeded quota of account 846518947292
	status code: 400, request id: 34e016ce-a70b-11e8-ac02-dd9cffd48ed9
* module.vpc.aws_nat_gateway.nat_gw[0]: 1 error(s) occurred:

* aws_nat_gateway.nat_gw.0: Error creating NAT Gateway: NatGatewayLimitExceeded: Performing this operation would exceed the limit of 5 NAT gateways
	status code: 400, request id: a6ac1e8c-e85d-4578-8966-57c92850e5f9
* module.vpc.aws_nat_gateway.nat_gw[2]: 1 error(s) occurred:

* aws_nat_gateway.nat_gw.2: Error creating NAT Gateway: NatGatewayLimitExceeded: Performing this operation would exceed the limit of 5 NAT gateways
	status code: 400, request id: a409d983-ed38-4b80-83fe-1796e5f383c0
* module.vpc.aws_elb.tnc: 1 error(s) occurred:

* aws_elb.tnc: TooManyLoadBalancers: Exceeded quota of account 846518947292
	status code: 400, request id: 351fde8c-a70b-11e8-8aa8-937689137735
* module.vpc.aws_elb.console: 1 error(s) occurred:

* aws_elb.console: TooManyLoadBalancers: Exceeded quota of account 846518947292
	status code: 400, request id: 352695ff-a70b-11e8-85bf-3f8eb4d4beb0

This happened before here. I'll see about reaping leaked resources in that account.

wking · 2018-08-23T20:23:45Z

The most recent e2e-aws job (based on b67f809) got:


• Failure [43.888 seconds]
[Conformance][Area:Networking][Feature:Router]
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:21
  The HAProxy router
  /tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:40
    should set Forwarded headers appropriately [Suite:openshift/conformance/parallel] [It]
    /tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:41

    Aug 23 19:47:24.715: Unexpected header: '10.2.6.0' (expected 10.0.161.5); All headers: http.Header{"X-Forwarded-Proto":[]string{"http"}, "Forwarded":[]string{"for=10.2.6.0;host=router-headers.example.com;proto=http"}, "X-Forwarded-For":[]string{"10.2.6.0"}, "User-Agent":[]string{"curl/7.61.0"}, "Accept":[]string{"*/*"}, "X-Forwarded-Host":[]string{"router-headers.example.com"}, "X-Forwarded-Port":[]string{"8080"}}

    /tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:103

again. Do we need to tweak something in the release repo to catch up with Calico -> Flannel?

smarterclayton · 2018-08-23T20:26:32Z

Yay! This is an actual bug in the way that openshift/installer configures the router, because source ip isn't preserved.

I'll probably switch to another test in the job definition.

smarterclayton · 2018-08-23T20:51:48Z

I switched to a kube conformance test that should pass and added a new job that runs all tests but is optional.

/retest

wking · 2018-08-23T21:05:41Z

I'll see about reaping leaked resources in that account.

I've removed a lot of cruft from our Jenkins smoke-test account. Let's kick that off again:

retest this please

wking · 2018-08-23T21:07:24Z

I switched to a kube conformance test that should pass...

Cross-linking openshift/release#1271.

eparis · 2018-08-23T21:12:44Z

/me hopes and hopes and hopes...

smarterclayton · 2018-08-23T21:28:03Z

2018/08/23 21:12:28 Container test in pod e2e-aws completed successfully

!!!!

wking · 2018-08-23T21:38:08Z

e2e-aws is green :) (although the smoke tests are still running). Someone want to drop an /lgtm onto this?

eparis · 2018-08-23T21:38:59Z

/lgtm
assuming I count

openshift-ci-robot · 2018-08-23T21:39:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eparis, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2018-08-23T21:39:35Z

assuming I count

I think all members of GitHub's openshift org count for /lgtm.

eparis · 2018-08-23T21:47:05Z

Ohhh snap! Thank you for sticking with this @wking

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 20, 2018

wking force-pushed the etcd-kublet-ingress branch 2 times, most recently from 735c4ae to 1ea04c7 Compare August 20, 2018 23:05

wking mentioned this pull request Aug 20, 2018

Create a dependancy between the libvirt network and virsh command adding a dns record #150

Merged

wking force-pushed the etcd-kublet-ingress branch from 1ea04c7 to 8e751d7 Compare August 20, 2018 23:24

eparis reviewed Aug 21, 2018

View reviewed changes

wking force-pushed the etcd-kublet-ingress branch from e26e251 to 132c176 Compare August 21, 2018 21:14

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 21, 2018

wking mentioned this pull request Aug 21, 2018

modules/aws/vpc/sg-elb: Split out aws_security_group_rule #156

Merged

wking force-pushed the etcd-kublet-ingress branch from 132c176 to bcb86ea Compare August 22, 2018 16:19

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 22, 2018

wking force-pushed the etcd-kublet-ingress branch from bcb86ea to 7fc19bd Compare August 22, 2018 16:45

derekwaynecarr and others added 2 commits August 22, 2018 13:22

Fix perm errors with selinux enabled

0fa4eb1

config.tf: bump operator images

06ceaee

This includes the fixes needed to work with SELinux.

wking force-pushed the etcd-kublet-ingress branch 2 times, most recently from f2665fd to 81373c3 Compare August 22, 2018 19:50

wking force-pushed the etcd-kublet-ingress branch 2 times, most recently from 13ad9b1 to 59691f4 Compare August 23, 2018 19:24

wking force-pushed the etcd-kublet-ingress branch from 59691f4 to b67f809 Compare August 23, 2018 19:26

wking changed the title ~~modules/aws/vpc/sg-etcd: Add ingress 10250 from master~~ *: Assorted fixes to get e2e-aws working again (Calico -> Flannel, etc.) Aug 23, 2018

openshift-ci-robot assigned eparis Aug 23, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2018

This was referenced Aug 23, 2018

wip: steps: Ignore AMI overrides #157

Closed

steps: Assign master and etcd security groups to both node classes #161

Closed

openshift-merge-robot merged commit b8bf8ca into openshift:master Aug 23, 2018

This was referenced Aug 23, 2018

tests/run: Shift "Generating SSH key-pair..." message into if block #147

Merged

modules/aws/vpc/sg-elb: Fix name for TNC security group #155

Merged

Fix perm errors with selinux enabled #134

Merged

wking deleted the etcd-kublet-ingress branch August 23, 2018 22:12

abhinavdahiya mentioned this pull request Aug 28, 2018

Error "unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:" when trying to create a new-app #153

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: Assorted fixes to get e2e-aws working again (Calico -> Flannel, etc.) #151

*: Assorted fixes to get e2e-aws working again (Calico -> Flannel, etc.) #151

wking commented Aug 20, 2018

wking commented Aug 21, 2018

wking commented Aug 21, 2018

eparis Aug 21, 2018

wking Aug 21, 2018

wking Aug 21, 2018 •

edited

Loading

wking commented Aug 21, 2018

wking commented Aug 21, 2018

wking commented Aug 21, 2018

wking commented Aug 22, 2018

wking commented Aug 22, 2018

wking commented Aug 22, 2018

wking commented Aug 22, 2018

wking commented Aug 22, 2018

wking commented Aug 23, 2018

wking commented Aug 23, 2018

wking commented Aug 23, 2018 •

edited

Loading

wking commented Aug 23, 2018

wking commented Aug 23, 2018

smarterclayton commented Aug 23, 2018

smarterclayton commented Aug 23, 2018

wking commented Aug 23, 2018

wking commented Aug 23, 2018

eparis commented Aug 23, 2018

smarterclayton commented Aug 23, 2018

wking commented Aug 23, 2018

eparis commented Aug 23, 2018

openshift-ci-robot commented Aug 23, 2018

wking commented Aug 23, 2018

eparis commented Aug 23, 2018

*: Assorted fixes to get e2e-aws working again (Calico -> Flannel, etc.) #151

*: Assorted fixes to get e2e-aws working again (Calico -> Flannel, etc.) #151

Conversation

wking commented Aug 20, 2018

wking commented Aug 21, 2018

wking commented Aug 21, 2018

eparis Aug 21, 2018

Choose a reason for hiding this comment

wking Aug 21, 2018

Choose a reason for hiding this comment

wking Aug 21, 2018 • edited Loading

Choose a reason for hiding this comment

wking commented Aug 21, 2018

wking commented Aug 21, 2018

wking commented Aug 21, 2018

wking commented Aug 22, 2018

wking commented Aug 22, 2018

wking commented Aug 22, 2018

wking commented Aug 22, 2018

wking commented Aug 22, 2018

wking commented Aug 23, 2018

wking commented Aug 23, 2018

wking commented Aug 23, 2018 • edited Loading

wking commented Aug 23, 2018

wking commented Aug 23, 2018

smarterclayton commented Aug 23, 2018

smarterclayton commented Aug 23, 2018

wking commented Aug 23, 2018

wking commented Aug 23, 2018

eparis commented Aug 23, 2018

smarterclayton commented Aug 23, 2018

wking commented Aug 23, 2018

eparis commented Aug 23, 2018

openshift-ci-robot commented Aug 23, 2018

wking commented Aug 23, 2018

eparis commented Aug 23, 2018

wking Aug 21, 2018 •

edited

Loading

wking commented Aug 23, 2018 •

edited

Loading