Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetworkCheck blocks keyboard interrupt #16847

Closed
sosiouxme opened this issue Oct 12, 2017 · 3 comments
Closed

NetworkCheck blocks keyboard interrupt #16847

sosiouxme opened this issue Oct 12, 2017 · 3 comments
Assignees

Comments

@sosiouxme
Copy link
Member

There is some code in the NetworkCheck diagnostic to handle an interrupt; however it has the effect of blocking them until the diagnostic completes, which is not user friendly.

Version
$ oc version
oc v3.7.0-alpha.1+86fefdf-1060-dirty
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth

But this is probably true since the diagnostic was introduced.

Steps To Reproduce
  1. oc adm diagnostics NetworkCheck
  2. Ctrl-C
Current Result

Continues until the diagnostic completes

Expected Result

Aborts

@pravisankar
Copy link

pravisankar commented Oct 13, 2017

@sosiouxme I could not reproduce the issue on my local machine (using dind cluster). What's the issue with the current code?

[ravi@dhcp-16-230 origin]$ oc version
oc v3.7.0-alpha.1+c80317c-1034-dirty
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth

[ravi@dhcp-16-230 origin]$ oc adm diagnostics NetworkCheck
[Note] Determining if client configuration exists for client/cluster diagnostics
Info:  Successfully read a client config file at '/tmp/openshift-dind-cluster/openshift/openshift.local.config/master/admin.kubeconfig'

[Note] Running diagnostic: NetworkCheck
       Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint

^CERROR: [DNet2006 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:137]
       Creating network diagnostic pod "network-diag-pod-bg916" on node "openshift-node-1" with command "openshift infra network-diagnostic-pod -l 1" failed: namespaces "network-diag-ns-2ml4c" not found

[Note] Summary of diagnostics execution (version v3.7.0-alpha.1+c80317c-1034-dirty):
[Note] Errors seen: 1
[ravi@dhcp-16-230 origin]$

@sosiouxme
Copy link
Member Author

sosiouxme commented Oct 13, 2017

The current code does nothing to end the diagnostic when an interrupt is received. The goroutine performs cleanup and exits, but nothing else happens until the diagnostic completes whatever it was doing (in your case, it completed with an error because the namespace has been deleted out from under it; if the pods are already created it takes a lot longer to finish). Worse, the signal handler is still directing interrupts to the channel, but nothing is listening to that channel, so any further interrupts go to limbo.

$ oc adm diagnostics  NetworkCheck  ClusterRoleBindings
[...]
[Note] Running diagnostic: NetworkCheck
       Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint
       
^C^C^C^C^C^CERROR: [DNet2006 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:137]
       Creating network diagnostic pod "network-diag-pod-x1vhb" on node "ip-172-18-0-248.ec2.internal" with command "openshift infra network-diagnostic-pod -l 1" failed: pods "network-diag-pod-x1vhb" is forbidden: unable to create new content in namespace network-diag-ns-ss4qd because it is being terminated.
       
[Note] Running diagnostic: ClusterRoleBindings
       Description: Check that the default ClusterRoleBindings are present and contain the expected subjects
       
^C^C^C^C^C^C^C^CInfo:  clusterrolebinding/cluster-readers has more subjects than expected.
       
       Use the `oc adm policy reconcile-cluster-role-bindings` command to update the role binding to remove extra subjects.
[...]

@pravisankar
Copy link

@sosiouxme Thanks, now I understood the actual issue and the fix.
Main changes:

  • Running actual check in a go routine along with a channel helps to terminate the diagnostic faster.
  • Stopping the signal notification once we catch it in a diagnostic allows other diagnostic checks to react to the signal if needed.

openshift-merge-robot added a commit that referenced this issue Oct 16, 2017
Automatic merge from submit-queue (batch tested with PRs 16848, 16874).

Fix some diagnostic error handling (NetworkCheck and DiagnosticPod)

Fixes #16847

A keyboard interrupt on the NetworkCheck diagnostic will actually abort it (giving it a chance to clean up) and proceed to the next diagnostic.

The same is done for DiagnosticPod (which previously did not catch the signal and cleanup at all).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants