Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OADM diagnostic SDN errors #18392

Closed
parthasarathi-DXC opened this issue Feb 1, 2018 · 7 comments
Closed

OADM diagnostic SDN errors #18392

parthasarathi-DXC opened this issue Feb 1, 2018 · 7 comments
Assignees
Labels
kind/question lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/networking

Comments

@parthasarathi-DXC
Copy link

parthasarathi-DXC commented Feb 1, 2018

After multiple reinstallation (on newly built RHEL7 VMs) oadm diagnostic shows 2 SDN error. Unable to create pods. Kindly help me to resolve this error.

###oc version

oc v3.6.173.0.49
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://master1.cat.test:8443
openshift v3.6.173.0.49
kubernetes v1.6.1+5115d708d7

Steps To Reproduce
  1. oadm diagnostic
Results

failed

       Info:  Service account token successfully authenticated to master
       ERROR: [DP1014 from diagnostic PodCheckAuth@openshift/origin/pkg/diagnostics/pod/auth.go:174]
              Request to integrated registry timed out; this typically indicates network or SDN problems.


ERROR: [DNet2005 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:119]
       Setting up test environment for network diagnostics failed: Failed to run network diags test pod and service: Failed to run network diags test pods, failed: 21, total: 24
Expected Result

Pass

Additional Information

Attached error and oc get all output.
putty.log
oadm_diag_error.txt

@parthasarathi-DXC
Copy link
Author

Hi Team,

Greetings. Any updates on this?

@sosiouxme
Copy link
Member

@parthasarathi-DXC are you noticing any problems aside from diagnostics reporting errors?

Diagnostics are intended to help point you in a helpful direction and give information when something is going wrong. It would be nice if they could say exactly what and why is going wrong, but it's unusual problems are so simply solved, so we kind of have to settle for pointers.

Here we're seeing a few things going wrong.

  1. Error logs in the registry indicate it's having trouble authorizing against the OpenShift API: Get user failed with error: Get https://172.30.0.1:443/oapi/v1/users/~: Service Unavailable -- that's pretty interesting, may want to rsh into the registry pod and try a connection manually. I'm honestly not sure what would cause that. Check registry logs around that as well as master logs / API audit logs. Service unavailable tends to point at something wrong with the API or a proxy in the middle or something -- you'd get a more helpful error if it was reaching the service.
  2. PodCheckAuth reporting Request to integrated registry timed out -- may be related to previous problem, or could be networking-related. Unfortunately when networking is broken the main symptom is that connections fail and you have to dig a lot deeper to figure out why.
  3. NetworkCheck failed Failed to run network diags test pods, failed: 21, total: 24 -- I would like this diagnostic to be a lot more helpful when the test pods fail to run; 3.7 should be a bit better about this so you might want to try a 3.7 client. Three out of 24 did run so it would be interesting to see what happened to the others. You could try getting a project list while NetworkCheck is running and seeing which pods are succeeding/failing in the network check projects and go from there.

@0xmichalis
Copy link
Contributor

@openshift/sig-networking

@sosiouxme
Copy link
Member

/unassign

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 8, 2018
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/networking
Projects
None yet
Development

No branches or pull requests

8 participants