Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NudmSDM url after udm crash #155

Merged

Conversation

ghislainbourgeois
Copy link
Contributor

Because the AMF is keeping a cache of all the UEs it sees and saves the NFs URI in it, when the UDM is restarted with a new IP for any reason, the first connection by a previously seen user would get rejected.

It is easy to test the previous error by deploying the core and gnbsim, configuring a subscriber and executing a succesful simulation. After that, deleting the UDM pod in Kubernetes and wait for the pod to be restarted properly. Run a new simulation, and it will fail. Running the simulation again after a failure will work, because it will have cleaned the cache.

This change makes it so that we always ask the NRF for the URI to use.

@onf-bot
Copy link
Contributor

onf-bot commented Nov 30, 2023

Can one of the admins verify this patch?

1 similar comment
@onf-bot
Copy link
Contributor

onf-bot commented Nov 30, 2023

Can one of the admins verify this patch?

@gab-arrobo
Copy link
Contributor

test this please

@gab-arrobo
Copy link
Contributor

Because the AMF is keeping a cache of all the UEs it sees and saves the NFs URI in it, when the UDM is restarted with a new IP for any reason, the first connection by a previously seen user would get rejected.

It is easy to test the previous error by deploying the core and gnbsim, configuring a subscriber and executing a succesful simulation. After that, deleting the UDM pod in Kubernetes and wait for the pod to be restarted properly. Run a new simulation, and it will fail. Running the simulation again after a failure will work, because it will have cleaned the cache.

This change makes it so that we always ask the NRF for the URI to use.

Hi @ghislainbourgeois, I am not able to reproduce the issue you are indicating here. I am deploying the sd-core using AiaB. After deleting the UDM pod, the new test/simulation successfully completes. Any suggestion on how to reproduce the issue? or am I missing something?

@ghislainbourgeois
Copy link
Contributor Author

Because the AMF is keeping a cache of all the UEs it sees and saves the NFs URI in it, when the UDM is restarted with a new IP for any reason, the first connection by a previously seen user would get rejected.
It is easy to test the previous error by deploying the core and gnbsim, configuring a subscriber and executing a succesful simulation. After that, deleting the UDM pod in Kubernetes and wait for the pod to be restarted properly. Run a new simulation, and it will fail. Running the simulation again after a failure will work, because it will have cleaned the cache.
This change makes it so that we always ask the NRF for the URI to use.

Hi @ghislainbourgeois, I am not able to reproduce the issue you are indicating here. I am deploying the sd-core using AiaB. After deleting the UDM pod, the new test/simulation successfully completes. Any suggestion on how to reproduce the issue? or am I missing something?

Hi @gab-arrobo, I tried deploying AiaB to test directly with it, but the UPF does not deploy properly, with an error on the arping container. I deployed OnRamp instead and tested the procedure above, and the first simulation indeed fails, as you can see here where 1 UE fails:

ubuntu@onramp:~/aether-onramp$ docker exec -it gnbsim-1 cat summary.log
time="2023-12-11T18:16:07Z" level=info msg="Profile Name: profile2 , Profile Type: pdusessest" category=Summary component=GNBSIM
time="2023-12-11T18:16:07Z" level=info msg="Ue's Passed: 4 , Ue's Failed: 1" category=Summary component=GNBSIM

The first try is taking a long time, because the AMF tries to contact the old UDM IP, and needs to timeout. At that point the cache on the AMF will be cleared and the next attempts work fine. This PR makes it so that the timeout will not need to happen.

I also just rebased this PR to be up-to-date with the master branch.

@gab-arrobo
Copy link
Contributor

Hi @gab-arrobo, I tried deploying AiaB to test directly with it, but the UPF does not deploy properly, with an error on the arping container.

@ghislainbourgeois, FYI, a temporary solution to deploy AiaB is by making the following changes in the sd-core-5g-values.yaml file:

diff --git a/sd-core-5g-values.yaml b/sd-core-5g-values.yaml
index d4e145c..50e1312 100644
--- a/sd-core-5g-values.yaml
+++ b/sd-core-5g-values.yaml
@@ -234,9 +234,10 @@ omec-user-plane:
   images:
     repository: "registry.opennetworking.org/docker.io/"
     # uncomment below section to add update bess image tag
-    #tags:
+    tags:
     #  bess: <bess image tag>
     #  pfcpiface: <pfcp image tag>
+      tools: busybox:stable
   config:
     upf:
       name: "oaisim"

@gab-arrobo
Copy link
Contributor

ok to test

@gab-arrobo
Copy link
Contributor

Hi @gab-arrobo, .... I deployed OnRamp instead and tested the procedure above, and the first simulation indeed fails, as you can see here where 1 UE fails:

ubuntu@onramp:~/aether-onramp$ docker exec -it gnbsim-1 cat summary.log
time="2023-12-11T18:16:07Z" level=info msg="Profile Name: profile2 , Profile Type: pdusessest" category=Summary component=GNBSIM
time="2023-12-11T18:16:07Z" level=info msg="Ue's Passed: 4 , Ue's Failed: 1" category=Summary component=GNBSIM

The first try is taking a long time, because the AMF tries to contact the old UDM IP, and needs to timeout. At that point the cache on the AMF will be cleared and the next attempts work fine. This PR makes it so that the timeout will not need to happen.

I also just rebased this PR to be up-to-date with the master branch.

I am going to give it a try using OnRamp

@ghislainbourgeois
Copy link
Contributor Author

Hi @gab-arrobo, I tried deploying AiaB to test directly with it, but the UPF does not deploy properly, with an error on the arping container.

@ghislainbourgeois, FYI, a temporary solution to deploy AiaB is by making the following changes in the sd-core-5g-values.yaml file:

diff --git a/sd-core-5g-values.yaml b/sd-core-5g-values.yaml
index d4e145c..50e1312 100644
--- a/sd-core-5g-values.yaml
+++ b/sd-core-5g-values.yaml
@@ -234,9 +234,10 @@ omec-user-plane:
   images:
     repository: "registry.opennetworking.org/docker.io/"
     # uncomment below section to add update bess image tag
-    #tags:
+    tags:
     #  bess: <bess image tag>
     #  pfcpiface: <pfcp image tag>
+      tools: busybox:stable
   config:
     upf:
       name: "oaisim"

Thanks, I tested with AiaB, and I cannot reproduce this issue with it. I have not figured out what is different in that deployment however.

@gab-arrobo
Copy link
Contributor

Thanks, I tested with AiaB, and I cannot reproduce this issue with it. I have not figured out what is different in that deployment however.

I think the issue might be related to the image tag used for the AMF (the Helm Charts used by AiaB use this image: omecproject/5gc-amf:master-a4759db). What image tag is used by OnRamp?

@ghislainbourgeois
Copy link
Contributor Author

Thanks, I tested with AiaB, and I cannot reproduce this issue with it. I have not figured out what is different in that deployment however.

I think the issue might be related to the image tag used for the AMF (the Helm Charts used by AiaB use this image: omecproject/5gc-amf:master-a4759db). What image tag is used by OnRamp?

It is using the same tag: omecproject/5gc-amf:master-a4759db

@gab-arrobo
Copy link
Contributor

gab-arrobo commented Dec 12, 2023

It is using the same tag: omecproject/5gc-amf:master-a4759db

How should we proceed to reproduce the issue? It would be good to understand why the issue shows up in OnRamp but not in AiaB

@ghislainbourgeois
Copy link
Contributor Author

It is using the same tag: omecproject/5gc-amf:master-a4759db

How should we proceed to reproduce the issue? It would be good to understand why the issue shows up in OnRamp but not in AiaB

I am working on this today, trying to understand the difference there.

@ghislainbourgeois
Copy link
Contributor Author

It is using the same tag: omecproject/5gc-amf:master-a4759db

How should we proceed to reproduce the issue? It would be good to understand why the issue shows up in OnRamp but not in AiaB

I am working on this today, trying to understand the difference there.

I cannot say for sure, but it looks like the biggest difference is that the onramp quick start guide does not deploy Aether, but only SD-Core. In our distribution, we also do not currently deploy Aether. I am not sure why Aether would prevent this bug from happening, but I would argue that SD-Core should not depend on additional software to do the right thing.

Copy link
Contributor

@gab-arrobo gab-arrobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -1173,18 +1173,16 @@ func communicateWithUDM(ue *context.AmfUe, accessType models.AccessType) error {

func getSubscribedNssai(ue *context.AmfUe) {
amfSelf := context.AMF_Self()
if ue.NudmSDMUri == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you are removing this variable from here, would it be possible that you also remove from the other places? because I think it is actually used anywhere in the code (besides some "self-checks" as shown below). It can be done as part of this PR or in another PR.

	ue.NudmSDMUri = sdmUri
	if ue.NudmUECMUri == "" || ue.NudmSDMUri == "" {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghislainbourgeois, please let me know what you think about my previous comment. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the impact would be in removing it from the other places. I assume the URI of the UDM is saved in the UE context as a form of cache to make things faster, but I definitely do not have the whole context here.

I think I could propose a separate PR with this change, and it would make it safer to test.

If you have any input on running multiple UDMs, that would also be interesting, as I think this cache per UE would only really be useful in the case of multiple UDM instances.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, opening another PR for the removal of NudmSDMUri would be fine.
For the deployment of multiple UDMs, I think @thakurajayL would be the best person to with it.

@gab-arrobo gab-arrobo merged commit a16a52f into omec-project:master Dec 12, 2023
8 checks passed
@ghislainbourgeois ghislainbourgeois deleted the fix-sdm-url-after-udm-crash branch December 12, 2023 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants