Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pings disappear from some peers to others after one week #3245

Open
netandreus opened this issue Jan 28, 2025 · 25 comments
Open

Pings disappear from some peers to others after one week #3245

netandreus opened this issue Jan 28, 2025 · 25 comments
Assignees

Comments

@netandreus
Copy link

netandreus commented Jan 28, 2025

Describe the problem

I have 2 peers:

  • "node-3" with netbird address 100.81.65.41
  • "node-2" with netbird address 100.81.94.114
    supporting HA-routes to multiple vlans.

Also I have:

  • "uk-node-1" with netboird address 100.81.73.30 and
  • ELK VM with HeartBeat and NetBird client installed with address 100.81.167.156.

My problem is that after a week these nodes (node-2 and node-3) losing connection to ELK node, and ELK node losing connection to node-2 and node-3, I can't ping them by netbird IP addresses. But in the same time I can ping other peers from ELK peer and ping other than ELK peers from node-2 and node-3.

To Reproduce

Steps to reproduce the behavior:

  1. Run two peers in HA mode.
  2. Run third peer on other location
  3. Wait for at least one week
  4. Peers connection lost

Expected behavior

Stable connection between peers.

Are you using NetBird Cloud?

No, I use self-hosted netbird.

NetBird version

0.36.3

NetBird status -dA output:

When I go by ssh to node-3 I see this:

 elk-huk-internal.netbird.selfhosted:
  NetBird IP: 100.81.167.156
  Public key: P0Xd+rb5EjqfaFXIL/KuQ0yGKHT4qTa99Mz4ABrshRA=
  Status: Connected
  -- detail --
  Connection type: Relayed
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: rels://gateway.xxx.com:443
  Last connection update: 1 day, 7 hours ago <<<<<<<<<<<<<<<<<
  Last WireGuard handshake: 1 day, 7 hours ago <<<<<<<<<<<<<<<<<
  Transfer status (received/sent) 110.7 MiB/838.7 MiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 129.747422ms

And same from node-2:

 elk-huk-internal.netbird.selfhosted:
  NetBird IP: 100.81.167.156
  Public key: P0Xd+rb5EjqfaFXIL/KuQ0yGKHT4qTa99Mz4ABrshRA=
  Status: Connected
  -- detail --
  Connection type: Relayed
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: rels://gateway.xxx.com:443
  Last connection update: 1 day, 7 hours ago <<<<<<<<<<<<<<<<<
  Last WireGuard handshake: 1 day, 7 hours ago <<<<<<<<<<<<<<<<<
  Transfer status (received/sent) 579.3 KiB/6.3 MiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 130.665153ms

Workaround

After I run:

netbird down
netbird up

WireGuard connection re-established.

Related issues

Can you please fix this?

@pappz
Copy link
Contributor

pappz commented Jan 28, 2025

Hello @netandreus

Could you reproduce the issue with verbose logging enabled and send me the logs?

@pappz pappz self-assigned this Jan 28, 2025
@pappz
Copy link
Contributor

pappz commented Jan 28, 2025

@netandreus Could you send me the public keys of the node-2 and node-3? In the connection mechanism has some logic that depends from the public keys between the peers. Maybe if we know which lane of the algorithm is running on your elk side and on the node side then we can go nearer to the root cause of the issue.

@netandreus
Copy link
Author

@pappz sure, how can I fetch the public keys of my nodes?

@pappz
Copy link
Contributor

pappz commented Jan 28, 2025

@pappz sure, how can I fetch the public keys of my nodes?

The netbird status -d command will print out. Looking for the "Public key" expression, just like in your example in your original report:

 elk-huk-internal.netbird.selfhosted:
  NetBird IP: 100.81.167.156
  Public key: P0Xd+rb5EjqfaFXIL/KuQ0yGKHT4qTa99Mz4ABrshRA=

@netandreus
Copy link
Author

@pappz here they are:

node-2:

 node-2.netbird.selfhosted:
  NetBird IP: 100.81.94.114
  Public key: A0/k9FWRkF+JspDjOVIhk0YaaRDvvTZo3C+kEL0feR0=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 10.30.200.47:51820/10.30.200.28:51820
  Relay server address: rels://gateway.xxx.com:443
  Last connection update: 9 hours, 45 minutes ago
  Last WireGuard handshake: 1 minute, 49 seconds ago
  Transfer status (received/sent) 97.1 KiB/40.5 KiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 493.405µs

node-3:

 node-3.netbird.selfhosted:
  NetBird IP: 100.81.65.41
  Public key: 9i5W38NvBSXk7oK+v0KeaXn0csEn5AOXIG2mg3TwOAo=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/host
  ICE candidate endpoints (Local/Remote): 172.30.1.97:51820/10.30.200.47:51820
  Relay server address: rels://gateway.xxx.com:443
  Last connection update: 9 hours, 47 minutes ago
  Last WireGuard handshake: 1 minute, 48 seconds ago
  Transfer status (received/sent) 25.2 KiB/93.4 KiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 502.284µs

@pappz
Copy link
Contributor

pappz commented Jan 28, 2025

Great, thank you!

Do you know which restart solved the issue:

  • Restarting the Netbird agent on the ELK side
  • Restarting the Netbird agent on the node side
  • Or did it not matter?

@netandreus
Copy link
Author

@pappz It does not matter, what I do at the ELK side, only restarting the Netbird agent (netbird stop && netbird start) at the node side affects the situation.

@netandreus
Copy link
Author

netandreus commented Jan 30, 2025

Good morning, @pappz ! Is there something from my side, that could help you?

@pappz
Copy link
Contributor

pappz commented Jan 30, 2025

Hello @netandreus,
Thank you for the information. I found a potential bug and I have prepared a fix for it. You can track the changes here. Nevertheless, I am not sure this is the root cause of your issue because I haven't been able to reproduce it on my machine.

How easy is it to reproduce the issue? Could you enable verbose logging on your agent and collect the logs?

@netandreus
Copy link
Author

Thank you for your efforts, @pappz !
I can only disable cron restart and wait for one week. What exact should I do if (or when) it will occurs?

@pappz
Copy link
Contributor

pappz commented Jan 30, 2025

With these commands, you can set the logging level:
netbird debug log level debug
or
netbird debug log level verbose
The verbose level may log too much data to your disk over a week, so choose the debug level based on your preference. When an issue occurs, please collect the log files and send them to me.

@pappz
Copy link
Contributor

pappz commented Jan 30, 2025

@netandreus
While you ping the unreachable server what error message do you see in the ping output? Is it something like this, or just a simple Destination Host Unreachable.

ubuntu@machine1:~$ ping 100.108.186.247
PING 100.108.186.247 (100.108.186.247) 56(84) bytes of data.
From 100.108.200.99 icmp_seq=1 Destination Host Unreachable
ping: sendmsg: Required key not available
From 100.108.200.99 icmp_seq=2 Destination Host Unreachable
ping: sendmsg: Required key not available

@pappz
Copy link
Contributor

pappz commented Jan 30, 2025

I am working on another logic that can better manage the possible anomalies. Is it an option for you to do tests with a custom build with the patches?

@netandreus
Copy link
Author

@pappz yes, I can. I should deploy custom build on both nodes? And how can I rollback if somethings go wrong? How can I collect logs?

@pappz
Copy link
Contributor

pappz commented Jan 31, 2025

@netandreus

I prepared the test version. Here is the package for Linux. If you are using a different OS, I will send you different artifacts.

Default installation path is /usr/bin/netbird. Create a backup for easy rollback:

netbird down
cp -a /usr/bin/netbird /usr/bin/netbird.bkp
cat /path/to/downloaded/netbird > /usr/bin/netbird
netbird up

Don't forget to set the proper debug level!

Logs are in /var/log/netbird. Clean them before testing for easier handling.

If you are testing with previous machines (node-2, node-3), no need to update ELK peer.

And do not forget, this is just a test version, be careful to use it in production env.

I hope this fix will solve your issue but meantime I will dig deeper into this topic.

@netandreus
Copy link
Author

Good morning, @pappz !
Can you please update the link, I can't download test version.

Image

@pappz
Copy link
Contributor

pappz commented Feb 3, 2025

Strange. Here is the updated link.

@pappz
Copy link
Contributor

pappz commented Feb 3, 2025

@netandreus could you send to me a debug bundle? You can generate it with this command: netbird debug bundle -S

I would like to get a better picture of your network-related settings. This package contains all the necessary information.

@netandreus
Copy link
Author

@pappz Sure, but I can`t downloaad it. May be some persmssions issue from your side?

Image

When I click to the file - I can only copy name. Maybe you need my google account or something from my side?

@pappz
Copy link
Contributor

pappz commented Feb 3, 2025

The file that I uploaded is a ZIP archive. I think you opened it by the browser. If you download the full zip and manage it on your machine it would be easier.

@netandreus
Copy link
Author

netandreus commented Feb 3, 2025

With these commands, you can set the logging level:
netbird debug log level debug
or
netbird debug log level verbose
The verbose level may log too much data to your disk over a week, so choose the debug level based on your preference. When an issue occurs, please collect the log files and send them to me.

Done.

@netandreus could you send to me a debug bundle? You can generate it with this command: netbird debug bundle -S

Done.

You can find all files (both logs when error occures on stable version and debug bundle for test version) here:

https://drive.google.com/drive/folders/1sRO8GprHSPS5LYgUqWg5iP2wCTHm7PHa?usp=drive_link

I hope this fix will solve your issue but meantime I will dig deeper into this topic.

I'm deployed test version on the both nodes.

node-2
Image

node-3
Image

Then I restarted netbird on elk node:

Image

And see this one for node-2 at elk

Image

and for node-3 at elk:

Image

@netandreus
Copy link
Author

@pappz and now I can't ping neither node-3 nor node-2 from elk.

@pappz
Copy link
Contributor

pappz commented Feb 3, 2025

@netandreus
Thank you for the logs. Can we schedule a call where we can better discover the situation together? It would be faster to resolve this issue than here.

@netandreus
Copy link
Author

netandreus commented Feb 3, 2025

@pappz sure, we can schedule a call for tomorrow 2025-02-04 from 10:00 GMT+4. I can give you access to Anydesk / ssh to these nodes. Please find me on Telegram - https://t.me/netandreus

@pappz
Copy link
Contributor

pappz commented Feb 4, 2025

Thank you. I think you need to accept my messages. https://t.me/pzolinb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants