-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 1174 appears not to be resolved "Probe fails to find default gw with ubuntu NETLINK_GET_STRICT_CHK, loops" #1196
Comments
That image has nmstate-2.2.11-1.el9.x86_64 @cathay4t this is the correct version that has the nispor with NETLINK_GET_STRICT_CHK fix ? Looks like is still not showing routes properly routes:
config: []
running: [] |
Only 2.2.13 has the |
@cathay4t can we please expedite that? I reported the problem in 1174 back in April and need to get this as soon as possible. Thank you tremendously for your support. |
Looks is a matter of doing another release, we did it too early I think
|
@k8scoder192 can you try "v0.80.0" it contains |
@qinqon yes in the morning. Please let's not close until validated. |
Still seeing the issue; loops multiple times exec into the handler pod; I see new warnings with nispor :-/ and it fails to apply the vlan manifest
More info
Worker node OS / Kernel info
FYI this isn't the same exact cluster as the orig I posted previously, that cluster was taken down. This cluster is similar but only worker sandbox3 has networkmanager installed, hence my yaml is targeting that node only (the handler pods for the other nodes errored out due to no networkmanager) yaml
NNCE for failed apply
Target Worker Node Network Config
FYI all resources in the nmstate namespace
(again only sandbox3 (worker node) had networkmanger hence the rest of the handler pods failing; all testing only targeted sandbox3) EDIT
dnf list of packages in handler pod (exec -it) 1.2.10-1.el9 :-(
|
Hey @cathay4t we will have to take a look on it, Thanks @k8scoder192 |
@cathay4t can you please look into this? It's still failing with the kernel version listed |
The rpm of nispor does not matters any more as nmstate is bundle it up. @k8scoder192 |
Sure but the problem still exists in the latest nmstate release. Please
have someone look into this
…On Wed, Aug 23, 2023, 8:50 AM Gris Ge ***@***.***> wrote:
The rpm of nispor does not matters any more as nmstate is bundle it up.
@k8scoder192 <https://github.com/k8scoder192> Can you try to upgrade rpm
from https://people.redhat.com/fge/tmp/nmstate-2.2.15-1.el9_2.x86_64.rpm
in the nmstate operater?
—
Reply to this email directly, view it on GitHub
<#1196 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKYPL6YZWFUUALLWJQFTXRLXWYDCBANCNFSM6AAAAAA2K262SM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@k8scoder192 Can you share a way to setup a cluster on Ubuntu kernel? |
Never mind. The fix in nispor is incorrect(I was guessing. My bad). |
@k8scoder192 Please try https://people.redhat.com/fge/tmp/nmstate-2.2.15-2.ubuntu_fix.el9.x86_64.rpm I am also installing Ubuntu 20.04 in my VM to reproduce this problem. |
@cathay4t please ensure the kernel version is 4.15.x (example I had 4.15.0-041500-generic), the problem went away with 5.4.0-155-generic). Thank you |
The Ubuntu 20.04 ships with 5.4+ kernel. Why should I spend my PlayStation time on a old unsupported kernel shipped by Ubuntu 18.04? |
Anyway, above nmstate-2.2.15-2.ubuntu_fix.el9.x86_64.rpm should fixed the problem on old kernel. I will try Ubuntu 18.04 tomorrow. |
@cathay4t because the original issue #1174 which was closed prematurely was with 18.04 and 4.15... That cluster got torn down and was given a new cluster based on 20.04. To replicate the issue, I simply downgraded the Kernel to 4.15.0-041500-generic. This proved that there is something not working well with 4.x kernels. Anyway, ideally just install 18.04 and give it a try (less work than downgrading since it already ships with 4.15.x) |
I will try 18.04 tomorrow. Please try above rpm in the meantime. |
Will do. Thanks
…On Wed, Aug 23, 2023, 9:59 AM Gris Ge ***@***.***> wrote:
I will try 18.04 tomorrow. Please try above rpm in the meantime.
—
Reply to this email directly, view it on GitHub
<#1196 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKYPL67S5AQSUHGDUK4FCRDXWYLD3ANCNFSM6AAAAAA2K262SM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Issue reproduced on Ubuntu 18.04 and PR nispor/nispor#235 tested. Will do official build to openshift and CentOS Stream 9. |
The latest version of nmstate is 2.2.14-1.el9 per dnf (queried in the the handler). When is the fixed version going to be upstream? A new build of kubernetes-nmstate will also be needed to pull that in.
Edit more issues with v0.80.0, even with [nmstate-2.2.15-2.ubuntu_fix.el9.x86_64.rpmAlso did you verify the fix worked? v0.80.0 It doesn't seem to on my end at all, seems like a regression. I used to be able to successfully apply it would just poll for the gw forever. I can't even apply now: nnce shows FailedToConfigure Here is handler log
exec into handler and run nmstatectl show
dnf shows nispor is NOT installed and nmstate is at 2.2.13-1.el9
So then I installed nmstate-2.2.15-2.ubuntu_fix.el9.x86_64.rpm (per you link above)
And it still fails to apply
nnce output
handler pod log
yaml I am applying
So I then uninstalled v0.80.0 and tried v0.74.0 just to be thorough here is the handler pod output for v0.74.0
|
The error message Let me try the same in Ubuntu 18.04. |
Cannot reproduce this @k8scoder192 Can you share the steps for me to reproduce this problem locally? |
@cathay4t I just confirmed with the team we are migrating to Ubuntu 22.04 with 5.x Kernel so hopefully we don't need to worry about this issue. I did some prelim testing on 22.04 and I still ran into an issue with nmstate not picking up dns-resolver info so it would loop on the prob for very long time. I need to do more testing and ensure I'm on v.0.80 before I comment further If still interested in the issue I reported previously, I upped the version of Ubuntu on the worker node to 20.04 with kernel 4.15.0-041500-generic since that's the kernel I saw the issue with; master is still on 18.04 / 4.15.0-208-generic but this shouldn't matter version of nmcli on worker node (1.22.10)
k-nmstate
yaml for vlan
start fresh, delete vlan600 on worker node
delete all nncp/nnce
Start a log tail on the handler
Apply the vlan600 yaml
ResultsHandler shows very long looping on probe default gw Then
Then AND
|
What happened:
Please reference Issue 1174; I'm running into the same issue
Using the latest version
quay.io/nmstate/kubernetes-nmstate-handler:v0.79.0
As such it keeps looping on
{"level":"info","ts":"2023-07-14T21:47:42.674Z","logger":"probe","msg":"default gw missing","path":"routes.running.next-hop-address","table-id":254} {"level":"error","ts":"2023-07-14T21:47:42.674Z","logger":"probe","msg":"failed to retrieve default gw","error":"default gw missing","errorVerbose":"default gw missing\ngit.luolix.top/nmstate/kubernetes-nmstate/pkg/probe.defaultGw\n\t/opt/app-root/src/pkg/probe/probes.go:160\ngit.luolix.top/nmstate/kubernetes-nmstate/pkg/probe.runPing\n\t/opt/app-root/src/pkg/probe/probes.go:177\ngit.luolix.top/nmstate/kubernetes-nmstate/pkg/probe.pingCondition.func1\n\t/opt/app-root/src/pkg/probe/probes.go:167\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\t/opt/app-root/src/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:220\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\t/opt/app-root/src/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233\nk8s.io/apimachinery/pkg/util/wait.WaitForWithContext\n\t/opt/app-root/src/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660\nk8s.io/apimachinery/pkg/util/wait.poll\n\t/opt/app-root/src/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594\nk8s.io/apimachinery/pkg/util/wait.PollImmediateWithContext\n\t/opt/app-root/src/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:526\nk8s.io/apimachinery/pkg/util/wait.PollImmediate\n\t/opt/app-root/src/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:512\ngit.luolix.top/nmstate/kubernetes-nmstate/pkg/probe.Select\n\t/opt/app-root/src/pkg/probe/probes.go:264\ngit.luolix.top/nmstate/kubernetes-nmstate/pkg/client.ApplyDesiredState\n\t/opt/app-root/src/pkg/client/client.go:159\ngit.luolix.top/nmstate/kubernetes-nmstate/controllers/handler.(*NodeNetworkConfigurationPolicyReconciler).Reconcile\n\t/opt/app-root/src/controllers/handler/nodenetworkconfigurationpolicy_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1571","stacktrace":"github.com/nmstate/kubernetes-nmstate/pkg/probe.runPing\n\t/opt/app-root/src/pkg/probe/probes.go:179\ngit.luolix.top/nmstate/kubernetes-nmstate/pkg/probe.pingCondition.func1\n\t/opt/app-root/src/pkg/probe/probes.go:167\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\t/opt/app-root/src/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:220\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\t/opt/app-root/src/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233\nk8s.io/apimachinery/pkg/util/wait.WaitForWithContext\n\t/opt/app-root/src/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660\nk8s.io/apimachinery/pkg/util/wait.poll\n\t/opt/app-root/src/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594\nk8s.io/apimachinery/pkg/util/wait.PollImmediateWithContext\n\t/opt/app-root/src/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:526\nk8s.io/apimachinery/pkg/util/wait.PollImmediate\n\t/opt/app-root/src/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:512\ngit.luolix.top/nmstate/kubernetes-nmstate/pkg/probe.Select\n\t/opt/app-root/src/pkg/probe/probes.go:264\ngit.luolix.top/nmstate/kubernetes-nmstate/pkg/client.ApplyDesiredState\n\t/opt/app-root/src/pkg/client/client.go:159\ngit.luolix.top/nmstate/kubernetes-nmstate/controllers/handler.(*NodeNetworkConfigurationPolicyReconciler).Reconcile\n\t/opt/app-root/src/controllers/handler/nodenetworkconfigurationpolicy_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}
What you expected to happen:
Not to loop for over 4 min (for 1 interface on 1 node) before moving on to next probe. Configuring multiple nodes takes even longer
How to reproduce it (as minimally and precisely as possible):
Use the OS and Kernel version listed below and this manifest yaml
Anything else we need to know?:
OS and Kernel Version
Environment:
NodeNetworkState
on affected nodes (usekubectl get nodenetworkstate <node_name> -o yaml
):NodeNetworkConfigurationPolicy
:kubectl get nncp usb-int-v300 -o yam1
kubectl get pods --all-namespaces -l app=kubernetes-nmstate -o jsonpath='{.items[0].spec.containers[0].image}'
): quay.io/nmstate/kubernetes-nmstate-handler:v0.79.0nmcli --version
) nmcli tool, version 1.22.10kubectl version
): v1.26.0Routing table
nmstate namespace (NOTE: handler does not run on master node since I do not have networmanager installed there)
The text was updated successfully, but these errors were encountered: