Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hack/test-cmd.sh:114: executing 'oc new-project 'cmd-admin'' timeout #15900

Closed
mfojtik opened this issue Aug 22, 2017 · 13 comments
Closed

hack/test-cmd.sh:114: executing 'oc new-project 'cmd-admin'' timeout #15900

mfojtik opened this issue Aug 22, 2017 · 13 comments
Assignees
Labels
component/kubernetes dependency/etcd kind/test-flake Categorizes issue or PR as related to test flakes. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1

Comments

@mfojtik
Copy link
Contributor

mfojtik commented Aug 22, 2017

Seen here: https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_cmd/1307/console

Console:

hack/test-cmd.sh:114: executing 'oc new-project 'cmd-admin'' expecting success
FAILURE after 30.287s: hack/test-cmd.sh:114: executing 'oc new-project 'cmd-admin'' expecting success: the command returned the wrong error code
There was no output from the command.
Standard error from the command:

From the master server logs it seems like rolebindings creation failed:

I0822 11:26:47.002253   20501 wrap.go:42] POST /apis/rbac.authorization.k8s.io/v1beta1/namespaces/cmd-admin/rolebindings: (7.002020944s) 500
goroutine 32608 [running]:
github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc42628c310, 0x1f4)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:207 +0xdd
github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc42628c310, 0x1f4)

// tons of stracebacks....

logging error output: "k8s\x00\n\f\n\x02v1\x12\x06Status\x123\n\x04\n\x00\x12\x00\x12\aFailure\x1a\x1detcdserver: request timed out\"\x000\xf4\x03\x1a\x00\"\x00"

And that seems to be due to etcd timeout:

E0822 11:26:47.001865   20501 status.go:62] apiserver received an error that is not an metav1.Status: etcdserver: request timed out

Which seems to be related to:

etcdserver/api/v3rpc: Failed to dial 172.17.0.2:24001: connection error: desc = "transport: remote error: tls: bad certificate"; please retry

and

2017-08-22 11:26:26.957356 W | etcdserver: timed out waiting for read index response

@mfojtik
Copy link
Contributor Author

mfojtik commented Aug 22, 2017

@ironcladlou you was looking for something to look at? ;-)

@ironcladlou
Copy link
Contributor

Looks like etcd writes were timing out in general for ~30-40s between ~11:26:40 and 11:27:28.

@stevekuznetsov
Copy link
Contributor

This might be an issue with EBS block allocation -- @deads2k recently reconfigured the job and the etcd data dir may not be in tmpfs anymore

@ironcladlou
Copy link
Contributor

This old friend? #6542 😬

@deads2k
Copy link
Contributor

deads2k commented Aug 31, 2017

I0831 14:52:32.816638   20662 trace.go:76] Trace[73421993]: "GuaranteedUpdate etcd3: *api.ServiceAccount" (started: 2017-08-31 14:52:25.815545555 +0000 UTC) (total time: 7.001059231s):
Trace[73421993]: [31.869µs] [31.869µs] initial value restored
Trace[73421993]: [101.779µs] [69.91µs] Transaction prepared
Trace[73421993]: [7.001059231s] [7.000957452s] END
E0831 14:52:32.816669   20662 status.go:62] apiserver received an error that is not an metav1.Status: etcdserver: request timed out
I0831 14:52:32.816846   20662 trace.go:76] Trace[186387414]: "Update /api/v1/namespaces/kube-system/serviceaccounts/daemon-set-controller" (started: 2017-08-31 14:52:25.81547282 +0000 UTC) (total time: 7.001354058s):
Trace[186387414]: [15.846µs] [15.846µs] About to convert to expected version

@ironcladlou
Copy link
Contributor

ironcladlou commented Sep 8, 2017

@php-coder I'm not sure why the error in #15558 (comment) was attributed to this issue; I see no evidence provided to support the claim. I only mention it because we have no fancy flake analytics like upstream and so I don't want the frequency of this flake to be misrepresented.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 23, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 25, 2018
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/kubernetes dependency/etcd kind/test-flake Categorizes issue or PR as related to test flakes. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1
Projects
None yet
Development

No branches or pull requests

9 participants