Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSM vl3 with selective composition errors #6305

Closed
yuraxdrumz opened this issue May 31, 2022 · 12 comments
Closed

NSM vl3 with selective composition errors #6305

yuraxdrumz opened this issue May 31, 2022 · 12 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request
Milestone

Comments

@yuraxdrumz
Copy link

yuraxdrumz commented May 31, 2022

Hello guys,
I have a requirement to do selective composition with VL3.

vl3-composition

While working on this solution, I encountered a few issues:

  • The example you provided on the website seems to not work with VL3 endpoint, because nse-vl3-vpp tries to connect to all other NSE'S in the name service, which leads to a passthrough I have (the IPS in your example), to get connected to.
  • Log level of INFO is not respected in nse-vl3-vpp due to hardcoded logrus.SetLevel(logrus.TraceLevel)
  • If I run X NSE's and I restart all of them, the newly created NSE's connect to the old NSE's as well as the new ones, which leads to the creation of wireguard and memif interfaces, that are not active and are never cleaned up. Checking the new NSE's, I see errors that the old NSE's cannot be found and no Close events or any other remediation is being called.
  • Passing boolean without "", for example secured: false and not secured: "false", results in nsc response of nsm-ns not found, which was caught only in registry after debug.

Here is the registry template error from the missing ""

May 30 14:55:53.573 [ERRO] [type:registry] (1.2)   Error returned from sdk/pkg/registry/switchcase/switchCaseNSServer.Find: v1.NetworkServiceList.Items: []v1.NetworkService: v1.NetworkService.Spec: v1.NetworkServiceSpec.Matches: []*registry.Match: registry.Match.SourceSelector: ReadString: expects " or n, but found f, error found in #10 byte of ...|secured":false}},{"r|..., bigger context ...|p":"passthrough"}}],"source_selector":{"secured":false}},{"routes":[{"destination_selector":{"app":"|...

To fix the NSE connecting to all other NSE's above, I forked cmd-nse-vl3-vpp and added a filter by NSM labels, so that only vl3 NSE's connect.

I have an example kustomize attached below.
vl3-composition.zip

Once I run everything, I immediately see various errors, attaching all logs here:
nsc-pt-7fb56c5fd5-9snqw.log
nsc-direct-54f685c899-nwdcr.log
nse-56fc9c6644-57cqc.log
registry-k8s-5dccfdb7dc-p6gsk.log
passthrough-6fc6b6dc84-lp8df.log
forwarder-vpp-wsr54.log
forwarder-vpp-bfd2k.log
nse-56fc9c6644-876vm.log

I just saw you published roadmap for 1.4.0, but the examples I tried are with v1.3.1, please tell me if some of the above can be fixed by upgrading to v1.4.0

Thanks

EDIT: forgot to mention

  • EKS 1.22
  • Cilium CNI
  • 3 Nodes of m5.large with NSM v1.3.1 and spire 1.3.0
  • VPP NSM forwarders with registry-k8s
@glazychev-art
Copy link
Contributor

Hi @yuraxdrumz,
Thanks for the report!
I've looked at your logs, and it seems we've already fixed a few of these errors. Could you please use the latest version? I think you can try v1.4.0-rc.1 or main branch.

As for the issues you found, I'm not sure if we can fix this in v1.4.0. If you have any progress (e.g. for cmd-nse-vl3-vpp repository) feel free to open the PR!

@denis-tingaikin
Copy link
Member

If I run X NSE's and I restart all of them, the newly created NSE's connect to the old NSE's as well as the new ones, which leads to the creation of wireguard and memif interfaces, that are not active and are never cleaned up. Checking the new NSE's, I see errors that the old NSE's cannot be found and no Close events or any other remediation is being called.

This sounds as a bug. Could you attach steps to reproduce and open the issue in cmd-nse-vl3-vpp repostiory?

@denis-tingaikin
Copy link
Member

To fix the NSE connecting to all other NSE's above, I forked cmd-nse-vl3-vpp and added a filter by NSM labels, so that only vl3 NSE's connect.

@yuraxdrumz Be free to open PR :)

@denis-tingaikin denis-tingaikin added bug Something isn't working enhancement New feature or request labels Jun 1, 2022
@yuraxdrumz
Copy link
Author

yuraxdrumz commented Jun 2, 2022

Thanks for the feedback, I will open a PR as soon as its working!
@denis-tingaikin Opened the issue here

I found the issue with my initial problem of connecting NSC -> PT -> NSE <- NSE <- NSC

The initial test run was with 2 NSE's and 2 NSC's, one NSC direct and one NSC through passthrough.
What I saw was:

  1. NSC Connects to Passthrough
  2. Passthrough asks to connect to NSE with label app:nse, which both NSE's have
  3. Passthrough connects to the NSE returned, let's say the NSE has subnet 169.254.0.0/24
  4. Connection is established
  5. Refresh occurs after 2 minutes
  6. Passthrough asks to connect to NSE with label app:nse, which both NSE's have
  7. Passthrough connects to the NSE returned, this time it receives the other NSE which has subnet 169.254.1.0/24
  8. Client sees destination ip address is changed, logs data plane is down and calls .Close

To confirm this issue, I ran the example above with 1 NSE, and everything works.

It seems like there is an issue with stickiness here.
I would think that as soon as I have a healthy e2e connection NSC -> PT -> NSE, the returned NSE that the PT requests should always be the same, unless the connection itself is down.

If I'm right, how does the pinger in the NSC work when there are multiple NSE's with different subnets?

Attaching the log from when the NSC connects to the 2nd NSE and everything fails.
I added logs to make sure the pinger works as expected, so dont mind that.
nsc-pt-9ff8cd865-nmsqw.log

Attaching log from when NSC works through passthrough, when there is only 1 NSE and everything works.
nsc-pt-9ff8cd865-mmhph.log

Thanks

EDIT: tested with regular composition with 2 NSE's with different subnets and I get the same Data plane is down error with .Close event

Attaching kustomize example
passthrough.zip

Attaching cmd-nsc log with Dataplane is down error
alpine-composition.log

Looks like the IP is getting excluded from the request

@edwarnicke
Copy link
Member

@yuraxdrumz

To fix the NSE connecting to all other NSE's above, I forked cmd-nse-vl3-vpp and added a filter by NSM labels, so that only vl3 NSE's connect.

Would it perhaps make sense to have the NSE's themselves advertise a distinct 'interconnect' network service that they use to interconnect between each other that is not the same as the 'vl3' that is advertised to NSCs ?

@yuraxdrumz
Copy link
Author

yuraxdrumz commented Jun 2, 2022

@edwarnicke I thought about it as well. Adding the labels gives enough room to interconnect however I want in the future and is almost zero code changes, do you see a specific use case for your suggestion?

Edit:
I guess we can do the interconnect @edwarnicke suggested and add override by labels if needed, that would cover all cases

@denis-tingaikin denis-tingaikin added this to the v1.5.0 milestone Jun 2, 2022
@denis-tingaikin
Copy link
Member

denis-tingaikin commented Jun 2, 2022

Small issues like hardcoded log level we'll fix in v1.4.0.

We plan to resolve other things in v1.5.0

@edwarnicke
Copy link
Member

@edwarnicke I thought about it as well. Adding the labels gives enough room to interconnect however I want in the future and is almost zero code changes, do you see a specific use case for your suggestion?

Yeah... what I was getting at was something that would just 'work' in an intuitive way. I value your hack to make it work with labels because it is the minimum set of changes that proves the root of the problem (super helpful).

Edit: I guess we can do the interconnect @edwarnicke suggested and add override by labels if needed, that would cover all cases

It would make 'composition' of things before the vL3 work roughly as one would expect.

@yuraxdrumz
Copy link
Author

@denis-tingaikin sounds good.
Can you perhaps point me to a guide / explain how the refresh and the excluded prefixes work, so I could debug the issue?

@edwarnicke I agree

@glazychev-art
Copy link
Contributor

glazychev-art commented Jul 27, 2022

Hi @yuraxdrumz,
We've pushed a few PRs. And they should fix the problems after the refresh.

Could you please check this on the latest main version or on v1.5.0-rc.1?

@denis-tingaikin
Copy link
Member

I hope this should be fixed with networkservicemesh/cmd-nse-firewall-vpp#286

@yuraxdrumz Feel free to reopen this one if we missed something!

@yuraxdrumz
Copy link
Author

Will check it out next week, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants