Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended.[k8s.io] Services should be able to create a functioning NodePort service #13108

Closed
smarterclayton opened this issue Feb 24, 2017 · 8 comments · Fixed by #18355
Closed
Assignees
Labels
component/networking kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Milestone

Comments

@smarterclayton
Copy link
Contributor

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_gce/965/testReport/junit/(root)/Extended/_k8s_io__Services_should_be_able_to_create_a_functioning_NodePort_service/

Never seen this flake before (in roughly 600 runs), looks like the port never became available in 5 minutes. This is openshift default SDN setup

/tmp/openshift/tito/rpmbuild-originz5K5EQ/BUILD/origin-1.5.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/service.go:485
Feb 24 06:46:56.965: expected node port 30633 to be in use, stdout: . err: error running &{/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/kubectl [kubectl --server=https://internal-api.prtest965.origin-ci-int-gce.dev.rhcloud.com:8443 --kubeconfig=/tmp/cluster-admin.kubeconfig exec --namespace=e2e-tests-services-wd35q hostexec -- /bin/sh -c for i in $(seq 1 300); do if ss -ant46 'sport = :30633' | grep ^LISTEN; then exit 0; fi; sleep 1; done; exit 1] []  <nil>   [] <nil> 0xc421bed7d0 exit status 1 <nil> <nil> true [0xc420a1a628 0xc420a1a640 0xc420a1a658] [0xc420a1a628 0xc420a1a640 0xc420a1a658] [0xc420a1a638 0xc420a1a650] [0x986eb0 0x986eb0] 0xc421bdaea0 <nil>}:
Command stdout:

stderr:

error:
exit status 1

/tmp/openshift/tito/rpmbuild-originz5K5EQ/BUILD/origin-1.5.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/service.go:483

Very weird

STEP: Building a namespace api object
Feb 24 06:41:33.468: INFO: About to run a Kube e2e test, ensuring namespace is privileged
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [k8s.io] Services
  /tmp/openshift/tito/rpmbuild-originz5K5EQ/BUILD/origin-1.5.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/service.go:90
[It] should be able to create a functioning NodePort service
  /tmp/openshift/tito/rpmbuild-originz5K5EQ/BUILD/origin-1.5.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/service.go:485
STEP: creating service nodeport-test with type=NodePort in namespace e2e-tests-services-wd35q
STEP: creating pod to be part of service nodeport-test
Feb 24 06:41:34.470: INFO: Waiting up to 2m0s for 1 pods to be created
Feb 24 06:41:34.522: INFO: Found 0/1 pods - will retry
Feb 24 06:41:36.562: INFO: Found all 1 pods
Feb 24 06:41:36.562: INFO: Waiting up to 2m0s for 1 pods to be running and ready: [nodeport-test-dd39k]
Feb 24 06:41:36.562: INFO: Waiting up to 2m0s for pod nodeport-test-dd39k status to be running and ready
Feb 24 06:41:36.595: INFO: Waiting for pod nodeport-test-dd39k in namespace 'e2e-tests-services-wd35q' status to be 'running and ready'(found phase: "Pending", readiness: false) (32.700389ms elapsed)
Feb 24 06:41:38.682: INFO: Waiting for pod nodeport-test-dd39k in namespace 'e2e-tests-services-wd35q' status to be 'running and ready'(found phase: "Pending", readiness: false) (2.11980505s elapsed)
Feb 24 06:41:40.716: INFO: Wanted all 1 pods to be running and ready. Result: true. Pods: [nodeport-test-dd39k]
STEP: hitting the pod through the service's NodePort
Feb 24 06:41:40.716: INFO: Testing HTTP reachability of http://104.154.147.57:30633/echo?msg=hello
Feb 24 06:41:45.716: INFO: Got error testing for reachability of http://104.154.147.57:30633/echo?msg=hello: Get http://104.154.147.57:30633/echo?msg=hello: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Feb 24 06:41:47.717: INFO: Testing HTTP reachability of http://104.154.147.57:30633/echo?msg=hello
STEP: verifying the node port is locked
Feb 24 06:41:51.599: INFO: Running '/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/kubectl --server=https://internal-api.prtest965.origin-ci-int-gce.dev.rhcloud.com:8443 --kubeconfig=/tmp/cluster-admin.kubeconfig exec --namespace=e2e-tests-services-wd35q hostexec -- /bin/sh -c for i in $(seq 1 300); do if ss -ant46 'sport = :30633' | grep ^LISTEN; then exit 0; fi; sleep 1; done; exit 1'
Feb 24 06:46:56.965: INFO: rc: 127
Feb 24 06:46:56.965: INFO: expected node port 30633 to be in use, stdout: . err: error running &{/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/kubectl [kubectl --server=https://internal-api.prtest965.origin-ci-int-gce.dev.rhcloud.com:8443 --kubeconfig=/tmp/cluster-admin.kubeconfig exec --namespace=e2e-tests-services-wd35q hostexec -- /bin/sh -c for i in $(seq 1 300); do if ss -ant46 'sport = :30633' | grep ^LISTEN; then exit 0; fi; sleep 1; done; exit 1] []  <nil>   [] <nil> 0xc421bed7d0 exit status 1 <nil> <nil> true [0xc420a1a628 0xc420a1a640 0xc420a1a658] [0xc420a1a628 0xc420a1a640 0xc420a1a658] [0xc420a1a638 0xc420a1a650] [0x986eb0 0x986eb0] 0xc421bdaea0 <nil>}:
Command stdout:

stderr:

error:
exit status 1

[AfterEach] [k8s.io] Services
  /tmp/openshift/tito/rpmbuild-originz5K5EQ/BUILD/origin-1.5.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:142
STEP: Collecting events from namespace "e2e-tests-services-wd35q".
STEP: Found 11 events.
Feb 24 06:46:56.997: INFO: At 2017-02-24 06:41:34 -0500 EST - event for nodeport-test: {replication-controller } SuccessfulCreate: Created pod: nodeport-test-dd39k
Feb 24 06:46:56.997: INFO: At 2017-02-24 06:41:34 -0500 EST - event for nodeport-test-dd39k: {default-scheduler } Scheduled: Successfully assigned nodeport-test-dd39k to ci-prtest965-ig-n-fsv9
Feb 24 06:46:56.997: INFO: At 2017-02-24 06:41:36 -0500 EST - event for nodeport-test-dd39k: {kubelet ci-prtest965-ig-n-fsv9} Created: Created container with docker id 25b24e578d03; Security:[seccomp=unconfined]
Feb 24 06:46:56.997: INFO: At 2017-02-24 06:41:36 -0500 EST - event for nodeport-test-dd39k: {kubelet ci-prtest965-ig-n-fsv9} Started: Started container with docker id 25b24e578d03
Feb 24 06:46:56.997: INFO: At 2017-02-24 06:41:36 -0500 EST - event for nodeport-test-dd39k: {kubelet ci-prtest965-ig-n-fsv9} Pulled: Container image "gcr.io/google_containers/netexec:1.7" already present on machine
Feb 24 06:46:56.997: INFO: At 2017-02-24 06:41:47 -0500 EST - event for hostexec: {default-scheduler } Scheduled: Successfully assigned hostexec to ci-prtest965-ig-n-fsv9
Feb 24 06:46:56.997: INFO: At 2017-02-24 06:41:49 -0500 EST - event for hostexec: {kubelet ci-prtest965-ig-n-fsv9} Pulled: Container image "gcr.io/google_containers/hostexec:1.2" already present on machine
Feb 24 06:46:56.997: INFO: At 2017-02-24 06:41:50 -0500 EST - event for hostexec: {kubelet ci-prtest965-ig-n-fsv9} Created: Created container with docker id 86e8055fe394; Security:[seccomp=unconfined]
Feb 24 06:46:56.997: INFO: At 2017-02-24 06:41:50 -0500 EST - event for hostexec: {kubelet ci-prtest965-ig-n-fsv9} Started: Started container with docker id 86e8055fe394
Feb 24 06:46:56.997: INFO: At 2017-02-24 06:41:59 -0500 EST - event for nodeport-test-dd39k: {kubelet ci-prtest965-ig-n-fsv9} Unhealthy: Readiness probe failed: Get http://172.16.4.40:80/hostName: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Feb 24 06:46:56.997: INFO: At 2017-02-24 06:45:10 -0500 EST - event for nodeport-test-dd39k: {kubelet ci-prtest965-ig-n-fsv9} Unhealthy: Readiness probe failed: Get http://172.16.4.40:80/hostName: dial tcp 172.16.4.40:80: getsockopt: no route to host
Feb 24 06:46:57.090: INFO: POD      
@smarterclayton smarterclayton added component/networking kind/test-flake Categorizes issue or PR as related to test flakes. labels Feb 24, 2017
@smarterclayton
Copy link
Contributor Author

This is not #11016 because 2m < 5m.

@smarterclayton
Copy link
Contributor Author

P1 until triaged

@smarterclayton smarterclayton added this to the 1.5.0 milestone Feb 24, 2017
@smarterclayton smarterclayton modified the milestones: 1.5.0, 1.6.0 Mar 12, 2017
@bparees
Copy link
Contributor

bparees commented Mar 13, 2017

@bparees bparees changed the title flake: NodePort test failed, unable to listen Extended.[k8s.io] Services should be able to create a functioning NodePort service Mar 13, 2017
@smarterclayton
Copy link
Contributor Author

smarterclayton commented Mar 13, 2017 via email

@bparees
Copy link
Contributor

bparees commented Mar 13, 2017

link upated, sorry.

@bparees
Copy link
Contributor

bparees commented Jun 14, 2017

@bparees
Copy link
Contributor

bparees commented Jul 23, 2017

@danwinship
Copy link
Contributor

Hm... found an instance of this flake in a recent test_branch_origin_extended_conformance_gce:

STEP: verifying the node port is locked
Oct 25 16:44:20.632: INFO: Running '/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/kubectl --server=https://internal-api.prtest-3c63a87-1798.origin-ci-int-gce.dev.rhcloud.com:8443 --kubeconfig=/tmp/cluster-admin.kubeconfig exec --namespace=e2e-tests-services-l7272 hostexec -- /bin/sh -c for i in $(seq 1 300); do if ss -ant46 'sport = :31798' | grep ^LISTEN; then exit 0; fi; sleep 1; done; ss -ant46; ip a; exit 1'
Oct 25 16:44:21.057: INFO: rc: 127
Oct 25 16:44:21.057: INFO: expected node port 31798 to be in use, stdout: . err: error running &{/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/kubectl [kubectl --server=https://internal-api.prtest-3c63a87-1798.origin-ci-int-gce.dev.rhcloud.com:8443 --kubeconfig=/tmp/cluster-admin.kubeconfig exec --namespace=e2e-tests-services-l7272 hostexec -- /bin/sh -c for i in $(seq 1 300); do if ss -ant46 'sport = :31798' | grep ^LISTEN; then exit 0; fi; sleep 1; done; ss -ant46; ip a; exit 1] []  <nil>  Error from server: error dialing backend: dial tcp 10.142.0.4:10250: getsockopt: connection refused [] <nil> 0xc420b951d0 exit status 1 <nil> <nil> true [0xc42000f938 0xc42000f950 0xc42000f968] [0xc42000f938 0xc42000f950 0xc42000f968] [0xc42000f948 0xc42000f960] [0x17c27d0 0x17c27d0] 0xc42134e240 <nil>}:

The important bit there being dial tcp 10.142.0.4:10250: getsockopt: connection refused meaning that it failed to run kubectl exec, not that the "verifying the node port is locked" failed. I'm not sure if that's what was failing before (and the logs are just clearer about it now) or if this is a different problem.

openshift-merge-robot added a commit that referenced this issue Feb 2, 2018
Automatic merge from submit-queue (batch tested with PRs 18376, 18355).

Move pod-namespace calls out of process

As discussed in #15991, we need to move all operations in the pod's network namespace out of process, due to a golang issue that allows setns() calls in a locked thread to leak into other threads, causing random lossage as operations intended for the main network namespace end up running in other namespaces instead. (This is fixed in golang 1.10 but we need a fix before then.)

Fixes #15991
Fixes #14385
Fixes #13108
Fixes #18317
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/networking kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants