github.com/openshift/origin/test/setup/start-master: /healthz failure #16273

0xmichalis · 2017-09-10T20:31:40Z

=== BEGIN TEST CASE ===
hack/lib/start.sh:281: executing 'oc get --raw /healthz --config='/tmp/openshift/test-cmd/openshift.local.config/master/admin.kubeconfig'' expecting any result and text 'ok'; re-trying every 0.25s until completion or 160.000s
FAILURE after 159.412s: hack/lib/start.sh:281: executing 'oc get --raw /healthz --config='/tmp/openshift/test-cmd/openshift.local.config/master/admin.kubeconfig'' expecting any result and text 'ok'; re-trying every 0.25s until completion or 160.000s: the command timed out
Standard output from the command:
Standard error from the command:
The connection to the server 172.17.0.2:28443 was refused - did you specify the right host or port?
... repeated 11 times
Error from server (Forbidden): User "system:admin" cannot get path "/healthz": User "system:admin" cannot "get" on "/healthz"
... repeated 15 times
Unable to connect to the server: read tcp 172.17.0.2:43986->172.17.0.2:28443: read: connection reset by peer
The connection to the server 172.17.0.2:28443 was refused - did you specify the right host or port?
... repeated 353 times
=== END TEST CASE ===

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/15199/test_pull_request_origin_cmd/2492/
/kind test-flake

The text was updated successfully, but these errors were encountered:

simo5 · 2017-09-11T19:55:30Z

@pweil- does not looks like an auth issue, server refuses to respond.

pweil- · 2017-09-12T13:27:56Z

yeah, strangely it seems to have both no response and then a Error from server (Forbidden): User "system:admin" cannot get path "/healthz": User "system:admin" cannot "get" on "/healthz". Sending this over to @mfojtik and the Master team.

bparees · 2017-09-13T04:50:27Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16314/test_pull_request_origin_verify/2554/

0xmichalis · 2017-09-14T08:33:01Z

https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_cmd/2733/

mrogers950 · 2017-09-19T19:21:29Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/15794/test_pull_request_origin_cmd/3144/

0xmichalis · 2017-09-21T08:12:05Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/batch/test_pull_request_origin_cmd/3258/

0xmichalis · 2017-09-21T08:14:43Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/batch/test_pull_request_origin_cmd/3272/

0xmichalis · 2017-09-24T12:48:44Z

This is getting flakier lately:
https://openshift-gce-devel.appspot.com/pr/16490
https://openshift-gce-devel.appspot.com/pr/16483

/priority P1
/remove-priority P2

jim-minter · 2017-10-04T16:07:04Z

AFAICS, during startup, the following series happens when getting /healthz at startup:
1- until listener open, can't connect (obviously)
2- 403 until rbac roles set up
3- 500 until all post-start hooks done
4- 200

jim-minter · 2017-10-04T16:10:28Z

(I'm not saying that's what should happen, just that's what I'm seeing).

In https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16678/test_pull_request_origin_cmd/4092/ , the server never finishes all its post-start hooks (evidently including initialising RBAC); it dies after 18 seconds with

I1004 15:34:24.700820   20760 trace.go:76] Trace[279607293]: "GuaranteedUpdate etcd3: *api.RangeAllocation" (started: 2017-10-04 15:34:11.099325642 +0000 UTC) (total time: 13.601468908s):
Trace[279607293]: [6.600515232s] [6.600515232s] initial value restored
Trace[279607293]: [6.600596809s] [81.577µs] Transaction prepared
Trace[279607293]: [13.601468908s] [7.000872099s] END
F1004 15:34:24.700863   20760 controller.go:128] Unable to perform initial IP allocation check: unable to persist the updated service IP allocations: etcdserver: request timed out

The test waiting for healthz ok hangs around for 160 seconds. This explains the series of:

The connection to the server 172.17.0.2:28443 was refused - did you specify the right host or port?
... repeated 11 times
Error from server (Forbidden): User "system:admin" cannot get path "/healthz": User "system:admin" cannot "get" on "/healthz"
... repeated 15 times
Unable to connect to the server: read tcp 172.17.0.2:43986->172.17.0.2:28443: read: connection reset by peer
The connection to the server 172.17.0.2:28443 was refused - did you specify the right host or port?
... repeated 353 times

jim-minter · 2017-10-04T16:13:45Z

Also

2017-10-04 15:34:17.699597 W | etcdserver: apply entries took too long [5.317380054s for 11 entries]
2017-10-04 15:34:17.699616 W | etcdserver: avoid queries with large range/delete range!
2017-10-04 15:34:19.382340 W | etcdserver: timed out waiting for read index response

jim-minter · 2017-10-04T16:52:05Z

Current theory: I think we intend to run etcd on tmpfs, but we're not, and we're getting hit by underlying filesystem latency.

From scripts/env/logs/scripts.log:

[DEBUG] Creating container: `docker create  --privileged -v /var/run/docker.sock:/var/run/docker.sock -v origin-build-tmp-5ccc8607c03af855aa88252b227f97e888281083:/tmp -v origin-build-5ccc8607c03af855aa88252b227f97e888281083:/go/src/github.com/openshift/origin -e OS_VERSION_FILE= -e JUNIT_REPORT=true openshift/origin-release:golang-1.8 make test-cmd -k

mfojtik · 2017-10-06T08:57:33Z

yes, this is (known) etcd issue. i believe #16686 should make this better.

bparees · 2017-11-03T17:36:59Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17149/test_pull_request_origin_cmd/5602/

sosiouxme · 2018-01-10T14:00:38Z

Think this is the same: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17773/test_pull_request_origin_cmd/8331/

openshift-bot · 2018-04-10T16:56:27Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2018-05-10T18:13:58Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2018-06-09T18:43:31Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot added the kind/test-flake Categorizes issue or PR as related to test flakes. label Sep 10, 2017

pweil- added component/auth priority/P2 labels Sep 11, 2017

pweil- assigned simo5 Sep 11, 2017

simo5 mentioned this issue Sep 11, 2017

Convert subjectchecker to use rbac.Subject #16286

Merged

pweil- added sig/master and removed component/auth labels Sep 12, 2017

pweil- assigned mfojtik and unassigned simo5 Sep 12, 2017

jsafrane mentioned this issue Sep 14, 2017

UPSTREAM: 49640: Run mount in its own systemd scope #15725

Merged

csrwng mentioned this issue Sep 19, 2017

Use writeable HOME directory for builder unit test #16420

Merged

mrogers950 mentioned this issue Sep 19, 2017

Add Prometheus metrics for authentication attempts #15794

Merged

openshift-ci-robot added priority/P1 and removed priority/P2 labels Sep 24, 2017

soltysh mentioned this issue Sep 29, 2017

Sleep at the end of every hack/env invocation for logs #16604

Merged

jim-minter mentioned this issue Oct 4, 2017

wait for builder service account on necessary templateinstance/tsb tests #16678

Merged

detiber mentioned this issue Oct 4, 2017

hack/env: Remove tmp volume #16686

Closed

This was referenced Oct 23, 2017

Improve the process of pod updates by preferring non-mutating SCCs and reducing pod mutations #16934

Merged

oc cluster up: fix "No log available from 'origin' container" #17006

Merged

php-coder mentioned this issue Jan 10, 2018

Improve ISSUE_TEMPLATE.md #18027

Merged

sosiouxme mentioned this issue Jan 10, 2018

diagnostics: individual parameters #17773

Merged

spadgett mentioned this issue Jan 24, 2018

cluster up: remove obsolete asset config properties #18191

Merged

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2018

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 10, 2018

openshift-ci-robot closed this as completed Jun 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

github.com/openshift/origin/test/setup/start-master: /healthz failure #16273

github.com/openshift/origin/test/setup/start-master: /healthz failure #16273

0xmichalis commented Sep 10, 2017 •

edited

Loading

simo5 commented Sep 11, 2017

pweil- commented Sep 12, 2017

bparees commented Sep 13, 2017

0xmichalis commented Sep 14, 2017

mrogers950 commented Sep 19, 2017

0xmichalis commented Sep 21, 2017

0xmichalis commented Sep 21, 2017

0xmichalis commented Sep 24, 2017

jim-minter commented Oct 4, 2017

jim-minter commented Oct 4, 2017 •

edited

Loading

jim-minter commented Oct 4, 2017

jim-minter commented Oct 4, 2017

mfojtik commented Oct 6, 2017

bparees commented Nov 3, 2017

sosiouxme commented Jan 10, 2018

openshift-bot commented Apr 10, 2018

openshift-bot commented May 10, 2018

openshift-bot commented Jun 9, 2018

github.com/openshift/origin/test/setup/start-master: /healthz failure #16273

github.com/openshift/origin/test/setup/start-master: /healthz failure #16273

Comments

0xmichalis commented Sep 10, 2017 • edited Loading

simo5 commented Sep 11, 2017

pweil- commented Sep 12, 2017

bparees commented Sep 13, 2017

0xmichalis commented Sep 14, 2017

mrogers950 commented Sep 19, 2017

0xmichalis commented Sep 21, 2017

0xmichalis commented Sep 21, 2017

0xmichalis commented Sep 24, 2017

jim-minter commented Oct 4, 2017

jim-minter commented Oct 4, 2017 • edited Loading

jim-minter commented Oct 4, 2017

jim-minter commented Oct 4, 2017

mfojtik commented Oct 6, 2017

bparees commented Nov 3, 2017

sosiouxme commented Jan 10, 2018

openshift-bot commented Apr 10, 2018

openshift-bot commented May 10, 2018

openshift-bot commented Jun 9, 2018

0xmichalis commented Sep 10, 2017 •

edited

Loading

jim-minter commented Oct 4, 2017 •

edited

Loading