Fix NudmSDM url after udm crash #155

ghislainbourgeois · 2023-11-30T13:28:22Z

Because the AMF is keeping a cache of all the UEs it sees and saves the NFs URI in it, when the UDM is restarted with a new IP for any reason, the first connection by a previously seen user would get rejected.

It is easy to test the previous error by deploying the core and gnbsim, configuring a subscriber and executing a succesful simulation. After that, deleting the UDM pod in Kubernetes and wait for the pod to be restarted properly. Run a new simulation, and it will fail. Running the simulation again after a failure will work, because it will have cleaned the cache.

This change makes it so that we always ask the NRF for the URI to use.

onf-bot · 2023-11-30T13:28:26Z

Can one of the admins verify this patch?

onf-bot · 2023-11-30T13:28:26Z

Can one of the admins verify this patch?

gab-arrobo · 2023-11-30T17:50:21Z

test this please

gab-arrobo · 2023-12-10T17:52:13Z

Because the AMF is keeping a cache of all the UEs it sees and saves the NFs URI in it, when the UDM is restarted with a new IP for any reason, the first connection by a previously seen user would get rejected.

It is easy to test the previous error by deploying the core and gnbsim, configuring a subscriber and executing a succesful simulation. After that, deleting the UDM pod in Kubernetes and wait for the pod to be restarted properly. Run a new simulation, and it will fail. Running the simulation again after a failure will work, because it will have cleaned the cache.

This change makes it so that we always ask the NRF for the URI to use.

Hi @ghislainbourgeois, I am not able to reproduce the issue you are indicating here. I am deploying the sd-core using AiaB. After deleting the UDM pod, the new test/simulation successfully completes. Any suggestion on how to reproduce the issue? or am I missing something?

ghislainbourgeois · 2023-12-11T18:22:22Z

Because the AMF is keeping a cache of all the UEs it sees and saves the NFs URI in it, when the UDM is restarted with a new IP for any reason, the first connection by a previously seen user would get rejected.
It is easy to test the previous error by deploying the core and gnbsim, configuring a subscriber and executing a succesful simulation. After that, deleting the UDM pod in Kubernetes and wait for the pod to be restarted properly. Run a new simulation, and it will fail. Running the simulation again after a failure will work, because it will have cleaned the cache.
This change makes it so that we always ask the NRF for the URI to use.

Hi @ghislainbourgeois, I am not able to reproduce the issue you are indicating here. I am deploying the sd-core using AiaB. After deleting the UDM pod, the new test/simulation successfully completes. Any suggestion on how to reproduce the issue? or am I missing something?

Hi @gab-arrobo, I tried deploying AiaB to test directly with it, but the UPF does not deploy properly, with an error on the arping container. I deployed OnRamp instead and tested the procedure above, and the first simulation indeed fails, as you can see here where 1 UE fails:

ubuntu@onramp:~/aether-onramp$ docker exec -it gnbsim-1 cat summary.log
time="2023-12-11T18:16:07Z" level=info msg="Profile Name: profile2 , Profile Type: pdusessest" category=Summary component=GNBSIM
time="2023-12-11T18:16:07Z" level=info msg="Ue's Passed: 4 , Ue's Failed: 1" category=Summary component=GNBSIM

The first try is taking a long time, because the AMF tries to contact the old UDM IP, and needs to timeout. At that point the cache on the AMF will be cleared and the next attempts work fine. This PR makes it so that the timeout will not need to happen.

I also just rebased this PR to be up-to-date with the master branch.

gab-arrobo · 2023-12-11T18:34:56Z

Hi @gab-arrobo, I tried deploying AiaB to test directly with it, but the UPF does not deploy properly, with an error on the arping container.

@ghislainbourgeois, FYI, a temporary solution to deploy AiaB is by making the following changes in the sd-core-5g-values.yaml file:

diff --git a/sd-core-5g-values.yaml b/sd-core-5g-values.yaml
index d4e145c..50e1312 100644
--- a/sd-core-5g-values.yaml
+++ b/sd-core-5g-values.yaml
@@ -234,9 +234,10 @@ omec-user-plane:
   images:
     repository: "registry.opennetworking.org/docker.io/"
     # uncomment below section to add update bess image tag
-    #tags:
+    tags:
     #  bess: <bess image tag>
     #  pfcpiface: <pfcp image tag>
+      tools: busybox:stable
   config:
     upf:
       name: "oaisim"

gab-arrobo · 2023-12-11T18:35:07Z

ok to test

gab-arrobo · 2023-12-11T18:40:43Z

Hi @gab-arrobo, .... I deployed OnRamp instead and tested the procedure above, and the first simulation indeed fails, as you can see here where 1 UE fails:
ubuntu@onramp:~/aether-onramp$ docker exec -it gnbsim-1 cat summary.log
time="2023-12-11T18:16:07Z" level=info msg="Profile Name: profile2 , Profile Type: pdusessest" category=Summary component=GNBSIM
time="2023-12-11T18:16:07Z" level=info msg="Ue's Passed: 4 , Ue's Failed: 1" category=Summary component=GNBSIM
The first try is taking a long time, because the AMF tries to contact the old UDM IP, and needs to timeout. At that point the cache on the AMF will be cleared and the next attempts work fine. This PR makes it so that the timeout will not need to happen.

I also just rebased this PR to be up-to-date with the master branch.

I am going to give it a try using OnRamp

ghislainbourgeois · 2023-12-11T19:05:08Z

Hi @gab-arrobo, I tried deploying AiaB to test directly with it, but the UPF does not deploy properly, with an error on the arping container.

@ghislainbourgeois, FYI, a temporary solution to deploy AiaB is by making the following changes in the sd-core-5g-values.yaml file:
diff --git a/sd-core-5g-values.yaml b/sd-core-5g-values.yaml
index d4e145c..50e1312 100644
--- a/sd-core-5g-values.yaml
+++ b/sd-core-5g-values.yaml
@@ -234,9 +234,10 @@ omec-user-plane:
   images:
     repository: "registry.opennetworking.org/docker.io/"
     # uncomment below section to add update bess image tag
-    #tags:
+    tags:
     #  bess: <bess image tag>
     #  pfcpiface: <pfcp image tag>
+      tools: busybox:stable
   config:
     upf:
       name: "oaisim"

Thanks, I tested with AiaB, and I cannot reproduce this issue with it. I have not figured out what is different in that deployment however.

gab-arrobo · 2023-12-11T19:42:52Z

Thanks, I tested with AiaB, and I cannot reproduce this issue with it. I have not figured out what is different in that deployment however.

I think the issue might be related to the image tag used for the AMF (the Helm Charts used by AiaB use this image: omecproject/5gc-amf:master-a4759db). What image tag is used by OnRamp?

ghislainbourgeois · 2023-12-11T20:20:00Z

Thanks, I tested with AiaB, and I cannot reproduce this issue with it. I have not figured out what is different in that deployment however.

I think the issue might be related to the image tag used for the AMF (the Helm Charts used by AiaB use this image: omecproject/5gc-amf:master-a4759db). What image tag is used by OnRamp?

It is using the same tag: omecproject/5gc-amf:master-a4759db

gab-arrobo · 2023-12-12T00:18:22Z

It is using the same tag: omecproject/5gc-amf:master-a4759db

How should we proceed to reproduce the issue? It would be good to understand why the issue shows up in OnRamp but not in AiaB

ghislainbourgeois · 2023-12-12T14:49:25Z

It is using the same tag: omecproject/5gc-amf:master-a4759db

How should we proceed to reproduce the issue? It would be good to understand why the issue shows up in OnRamp but not in AiaB

I am working on this today, trying to understand the difference there.

ghislainbourgeois · 2023-12-12T21:14:22Z

It is using the same tag: omecproject/5gc-amf:master-a4759db

How should we proceed to reproduce the issue? It would be good to understand why the issue shows up in OnRamp but not in AiaB

I am working on this today, trying to understand the difference there.

I cannot say for sure, but it looks like the biggest difference is that the onramp quick start guide does not deploy Aether, but only SD-Core. In our distribution, we also do not currently deploy Aether. I am not sure why Aether would prevent this bug from happening, but I would argue that SD-Core should not depend on additional software to do the right thing.

gab-arrobo

+1

gab-arrobo · 2023-12-12T21:24:38Z

gmm/handler.go

@@ -1173,18 +1173,16 @@ func communicateWithUDM(ue *context.AmfUe, accessType models.AccessType) error {

 func getSubscribedNssai(ue *context.AmfUe) {
 	amfSelf := context.AMF_Self()
-	if ue.NudmSDMUri == "" {


Given that you are removing this variable from here, would it be possible that you also remove from the other places? because I think it is actually used anywhere in the code (besides some "self-checks" as shown below). It can be done as part of this PR or in another PR.

ue.NudmSDMUri = sdmUri if ue.NudmUECMUri == "" || ue.NudmSDMUri == "" {

@ghislainbourgeois, please let me know what you think about my previous comment. Thanks!

I am not sure what the impact would be in removing it from the other places. I assume the URI of the UDM is saved in the UE context as a form of cache to make things faster, but I definitely do not have the whole context here.

I think I could propose a separate PR with this change, and it would make it safer to test.

If you have any input on running multiple UDMs, that would also be interesting, as I think this cache per UE would only really be useful in the case of multiple UDM instances.

What do you think?

Sure, opening another PR for the removal of NudmSDMUri would be fine.
For the deployment of multiple UDMs, I think @thakurajayL would be the best person to with it.

ghislainbourgeois added 2 commits December 11, 2023 13:15

Improve error logging when SDM cannot be contacted

208da5e

Get NudmSDM URI from NRF, not from cache

95f22e6

ghislainbourgeois force-pushed the fix-sdm-url-after-udm-crash branch from 9f0c5bc to 95f22e6 Compare December 11, 2023 18:16

gab-arrobo approved these changes Dec 12, 2023

View reviewed changes

gab-arrobo merged commit a16a52f into omec-project:master Dec 12, 2023
8 checks passed

ghislainbourgeois deleted the fix-sdm-url-after-udm-crash branch December 12, 2023 21:49

gab-arrobo mentioned this pull request Apr 1, 2024

Update documentation for release omec-project/sdcore-docs#18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NudmSDM url after udm crash #155

Fix NudmSDM url after udm crash #155

ghislainbourgeois commented Nov 30, 2023

onf-bot commented Nov 30, 2023

onf-bot commented Nov 30, 2023

gab-arrobo commented Nov 30, 2023

gab-arrobo commented Dec 10, 2023

ghislainbourgeois commented Dec 11, 2023

gab-arrobo commented Dec 11, 2023

gab-arrobo commented Dec 11, 2023

gab-arrobo commented Dec 11, 2023

ghislainbourgeois commented Dec 11, 2023

gab-arrobo commented Dec 11, 2023

ghislainbourgeois commented Dec 11, 2023

gab-arrobo commented Dec 12, 2023 •

edited

Loading

ghislainbourgeois commented Dec 12, 2023

ghislainbourgeois commented Dec 12, 2023

gab-arrobo left a comment

gab-arrobo Dec 12, 2023

gab-arrobo Dec 12, 2023

ghislainbourgeois Dec 12, 2023

gab-arrobo Dec 12, 2023

Fix NudmSDM url after udm crash #155

Fix NudmSDM url after udm crash #155

Conversation

ghislainbourgeois commented Nov 30, 2023

onf-bot commented Nov 30, 2023

onf-bot commented Nov 30, 2023

gab-arrobo commented Nov 30, 2023

gab-arrobo commented Dec 10, 2023

ghislainbourgeois commented Dec 11, 2023

gab-arrobo commented Dec 11, 2023

gab-arrobo commented Dec 11, 2023

gab-arrobo commented Dec 11, 2023

ghislainbourgeois commented Dec 11, 2023

gab-arrobo commented Dec 11, 2023

ghislainbourgeois commented Dec 11, 2023

gab-arrobo commented Dec 12, 2023 • edited Loading

ghislainbourgeois commented Dec 12, 2023

ghislainbourgeois commented Dec 12, 2023

gab-arrobo left a comment

Choose a reason for hiding this comment

gab-arrobo Dec 12, 2023

Choose a reason for hiding this comment

gab-arrobo Dec 12, 2023

Choose a reason for hiding this comment

ghislainbourgeois Dec 12, 2023

Choose a reason for hiding this comment

gab-arrobo Dec 12, 2023

Choose a reason for hiding this comment

gab-arrobo commented Dec 12, 2023 •

edited

Loading