Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

github.com/openshift/origin/test/setup/start-master: /healthz failure #16273

Closed
0xmichalis opened this issue Sep 10, 2017 · 18 comments
Closed
Assignees
Labels
kind/test-flake Categorizes issue or PR as related to test flakes. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1 sig/master

Comments

@0xmichalis
Copy link
Contributor

0xmichalis commented Sep 10, 2017

=== BEGIN TEST CASE ===
hack/lib/start.sh:281: executing 'oc get --raw /healthz --config='/tmp/openshift/test-cmd/openshift.local.config/master/admin.kubeconfig'' expecting any result and text 'ok'; re-trying every 0.25s until completion or 160.000s
FAILURE after 159.412s: hack/lib/start.sh:281: executing 'oc get --raw /healthz --config='/tmp/openshift/test-cmd/openshift.local.config/master/admin.kubeconfig'' expecting any result and text 'ok'; re-trying every 0.25s until completion or 160.000s: the command timed out
Standard output from the command:
Standard error from the command:
The connection to the server 172.17.0.2:28443 was refused - did you specify the right host or port?
... repeated 11 times
Error from server (Forbidden): User "system:admin" cannot get path "/healthz": User "system:admin" cannot "get" on "/healthz"
... repeated 15 times
Unable to connect to the server: read tcp 172.17.0.2:43986->172.17.0.2:28443: read: connection reset by peer
The connection to the server 172.17.0.2:28443 was refused - did you specify the right host or port?
... repeated 353 times
=== END TEST CASE ===

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/15199/test_pull_request_origin_cmd/2492/
/kind test-flake

@openshift-ci-robot openshift-ci-robot added the kind/test-flake Categorizes issue or PR as related to test flakes. label Sep 10, 2017
@0xmichalis 0xmichalis changed the title github.com/openshift/origin/test/setup/start-master: hack/lib/start.sh:281: executing 'oc get --raw /healthz --config='/tmp/openshift/test-cmd/openshift.local.config/master/admin.kubeconfig'' expecting any result and text 'ok'; re-trying every 0.25s until completion or 160.000s github.com/openshift/origin/test/setup/start-master: /healthz failure Sep 10, 2017
@simo5
Copy link
Contributor

simo5 commented Sep 11, 2017

@pweil- does not looks like an auth issue, server refuses to respond.

@pweil-
Copy link
Contributor

pweil- commented Sep 12, 2017

yeah, strangely it seems to have both no response and then a Error from server (Forbidden): User "system:admin" cannot get path "/healthz": User "system:admin" cannot "get" on "/healthz". Sending this over to @mfojtik and the Master team.

@pweil- pweil- assigned mfojtik and unassigned simo5 Sep 12, 2017
@0xmichalis
Copy link
Contributor Author

@0xmichalis
Copy link
Contributor Author

This is getting flakier lately:
https://openshift-gce-devel.appspot.com/pr/16490
https://openshift-gce-devel.appspot.com/pr/16483

/priority P1
/remove-priority P2

@jim-minter
Copy link
Contributor

AFAICS, during startup, the following series happens when getting /healthz at startup:
1- until listener open, can't connect (obviously)
2- 403 until rbac roles set up
3- 500 until all post-start hooks done
4- 200

@jim-minter
Copy link
Contributor

jim-minter commented Oct 4, 2017

(I'm not saying that's what should happen, just that's what I'm seeing).

In https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16678/test_pull_request_origin_cmd/4092/ , the server never finishes all its post-start hooks (evidently including initialising RBAC); it dies after 18 seconds with

I1004 15:34:24.700820   20760 trace.go:76] Trace[279607293]: "GuaranteedUpdate etcd3: *api.RangeAllocation" (started: 2017-10-04 15:34:11.099325642 +0000 UTC) (total time: 13.601468908s):
Trace[279607293]: [6.600515232s] [6.600515232s] initial value restored
Trace[279607293]: [6.600596809s] [81.577µs] Transaction prepared
Trace[279607293]: [13.601468908s] [7.000872099s] END
F1004 15:34:24.700863   20760 controller.go:128] Unable to perform initial IP allocation check: unable to persist the updated service IP allocations: etcdserver: request timed out

The test waiting for healthz ok hangs around for 160 seconds. This explains the series of:

The connection to the server 172.17.0.2:28443 was refused - did you specify the right host or port?
... repeated 11 times
Error from server (Forbidden): User "system:admin" cannot get path "/healthz": User "system:admin" cannot "get" on "/healthz"
... repeated 15 times
Unable to connect to the server: read tcp 172.17.0.2:43986->172.17.0.2:28443: read: connection reset by peer
The connection to the server 172.17.0.2:28443 was refused - did you specify the right host or port?
... repeated 353 times

@jim-minter
Copy link
Contributor

Also

2017-10-04 15:34:17.699597 W | etcdserver: apply entries took too long [5.317380054s for 11 entries]
2017-10-04 15:34:17.699616 W | etcdserver: avoid queries with large range/delete range!
2017-10-04 15:34:19.382340 W | etcdserver: timed out waiting for read index response

@jim-minter
Copy link
Contributor

Current theory: I think we intend to run etcd on tmpfs, but we're not, and we're getting hit by underlying filesystem latency.

From scripts/env/logs/scripts.log:

[DEBUG] Creating container: `docker create  --privileged -v /var/run/docker.sock:/var/run/docker.sock -v origin-build-tmp-5ccc8607c03af855aa88252b227f97e888281083:/tmp -v origin-build-5ccc8607c03af855aa88252b227f97e888281083:/go/src/github.com/openshift/origin -e OS_VERSION_FILE= -e JUNIT_REPORT=true openshift/origin-release:golang-1.8 make test-cmd -k

@mfojtik
Copy link
Contributor

mfojtik commented Oct 6, 2017

yes, this is (known) etcd issue. i believe #16686 should make this better.

@sosiouxme
Copy link
Member

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 10, 2018
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/test-flake Categorizes issue or PR as related to test flakes. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1 sig/master
Projects
None yet
Development

No branches or pull requests

10 participants