-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: Assorted fixes to get e2e-aws working again (Calico -> Flannel, etc.) #151
*: Assorted fixes to get e2e-aws working again (Calico -> Flannel, etc.) #151
Conversation
735c4ae
to
1ea04c7
Compare
1ea04c7
to
8e751d7
Compare
The e2e-aws error was:
I don't know if that's further along than the Ginkgo error or not, but I'll check the node logs later. |
Unfortunately, it looks like job 553 failed to capture node logs, so I don't know if this has addressed the "does not have a current config label" issue or not. On the off chance that the failed-log-capture was a flake, I'll try the test again: /retest |
modules/aws/vpc/sg-etcd.tf
Outdated
self = true | ||
} | ||
protocol = "tcp" | ||
from_port = 10250 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all of the 'from' correct? don't most things trying to connect to here use a random high port?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all of the 'from' correct? don't most things trying to connect to here use a random high port?
Yeah, that doesn't make sense to me either. But it's what we have had in master since forever; see, for example, here. Still, I'll drop them and see if that helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all of the 'from' correct? don't most things trying to connect to here use a random high port?
Yeah, that doesn't make sense to me either...
Ah, from_port
and to_port
are for a range of ports on a single host, not the ports for both hosts involved in the connection. So having from_port
makes sense.
I got the same error again with build 555, so probably not a flake. Still not sure what's going on there... |
e26e251
to
132c176
Compare
I've spun off some tangential changes into #155 and #156. I don't think they were making semantic changes, but they were at least touching the API ingress rule (on port 6443) and therefore might have been causing our API timeouts. |
And we're still getting:
|
132c176
to
bcb86ea
Compare
Looks like I may have broken the Terraform ignition setup? The e2e-aws test failed with:
|
bcb86ea
to
7fc19bd
Compare
Ah, I'd missed a |
This includes the fixes needed to work with SELinux.
And we're back to:
Which is where we were before this PR, and distinct from the:
we saw during earlier versions of this PR (that just tried to open master -> etcd:10250). |
f2665fd
to
81373c3
Compare
I've fixed a:
reported by @smarterclayton with 7fc19bd -> 81373c3. Looks like I had accidentally pasted some worker stuff into the master rules. |
And we're still getting Ginkgo timeouts with 81373c3fb8. |
And we're still getting Ginkgo timeouts with cd83ddb. |
Hooray, with b17dedc19c6a6aa5432675f5418c023962ae437d we're off the Ginkgo timeouts, and are only getting:
With b17dedc -> 2f1d81c, I've:
|
13ad9b1
to
59691f4
Compare
We're having trouble accessing service IPs from pods with host network namespaces, which was keeping the metrics API from coming up (I think that's what the problem is ;) and eventually blocking namespace deletion. Defaulting to Flannel fixes that issue.
59691f4
to
b67f809
Compare
And we've hit our quotas in the Jenkins account...
This happened before here. I'll see about reaping leaked resources in that account. |
The most recent e2e-aws job (based on b67f809) got:
again. Do we need to tweak something in the release repo to catch up with Calico -> Flannel? |
Yay! This is an actual bug in the way that openshift/installer configures the router, because source ip isn't preserved. I'll probably switch to another test in the job definition. |
I switched to a kube conformance test that should pass and added a new job that runs all tests but is optional. /retest |
I've removed a lot of cruft from our Jenkins smoke-test account. Let's kick that off again: retest this please |
Cross-linking openshift/release#1271. |
/me hopes and hopes and hopes... |
2018/08/23 21:12:28 Container test in pod e2e-aws completed successfully !!!! |
e2e-aws is green :) (although the smoke tests are still running). Someone want to drop an /lgtm onto this? |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: eparis, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I think all members of GitHub's openshift org count for |
Ohhh snap! Thank you for sticking with this @wking |
Patterned on the existing
worker_ingress_kubelet_insecure_from_master
from b620c16 (coreos/tectonic-installer#264).This should address errors like:
on the master node:
which were resulting in:
Inbound 10250 is the kubelet API used by the control plane. @smarterclayton suspects the e2e-aws tests are trying to get metrics from the kubelets, and hanging on the etcd kubelet because this rule was missing. I'm not clear why we've only been seeing this issue for the last week though.
The third commit in this PR adds the new rule. The previous two commits pivot from inline
ingress
andegress
rules to stand-aloneaws_security_group_rule
resources, finishing a transition away from inline ingress/egress rules begun by coreos/tectonic-installer#264. More details in the first two commit messages.