Pings disappear from some peers to others after one week #3245

netandreus · 2025-01-28T07:23:02Z

Describe the problem

I have 2 peers:

"node-3" with netbird address 100.81.65.41
"node-2" with netbird address 100.81.94.114
supporting HA-routes to multiple vlans.

Also I have:

"uk-node-1" with netboird address 100.81.73.30 and
ELK VM with HeartBeat and NetBird client installed with address 100.81.167.156.

My problem is that after a week these nodes (node-2 and node-3) losing connection to ELK node, and ELK node losing connection to node-2 and node-3, I can't ping them by netbird IP addresses. But in the same time I can ping other peers from ELK peer and ping other than ELK peers from node-2 and node-3.

To Reproduce

Steps to reproduce the behavior:

Run two peers in HA mode.
Run third peer on other location
Wait for at least one week
Peers connection lost

Expected behavior

Stable connection between peers.

Are you using NetBird Cloud?

No, I use self-hosted netbird.

NetBird version

0.36.3

NetBird status -dA output:

When I go by ssh to node-3 I see this:

 elk-huk-internal.netbird.selfhosted:
  NetBird IP: 100.81.167.156
  Public key: P0Xd+rb5EjqfaFXIL/KuQ0yGKHT4qTa99Mz4ABrshRA=
  Status: Connected
  -- detail --
  Connection type: Relayed
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: rels://gateway.xxx.com:443
  Last connection update: 1 day, 7 hours ago <<<<<<<<<<<<<<<<<
  Last WireGuard handshake: 1 day, 7 hours ago <<<<<<<<<<<<<<<<<
  Transfer status (received/sent) 110.7 MiB/838.7 MiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 129.747422ms

And same from node-2:

 elk-huk-internal.netbird.selfhosted:
  NetBird IP: 100.81.167.156
  Public key: P0Xd+rb5EjqfaFXIL/KuQ0yGKHT4qTa99Mz4ABrshRA=
  Status: Connected
  -- detail --
  Connection type: Relayed
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: rels://gateway.xxx.com:443
  Last connection update: 1 day, 7 hours ago <<<<<<<<<<<<<<<<<
  Last WireGuard handshake: 1 day, 7 hours ago <<<<<<<<<<<<<<<<<
  Transfer status (received/sent) 579.3 KiB/6.3 MiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 130.665153ms

Workaround

After I run:

netbird down
netbird up

WireGuard connection re-established.

Related issues

Peer link is being dropped

Can you please fix this?

The text was updated successfully, but these errors were encountered:

pappz · 2025-01-28T08:05:35Z

Hello @netandreus

Could you reproduce the issue with verbose logging enabled and send me the logs?

pappz · 2025-01-28T11:24:41Z

@netandreus Could you send me the public keys of the node-2 and node-3? In the connection mechanism has some logic that depends from the public keys between the peers. Maybe if we know which lane of the algorithm is running on your elk side and on the node side then we can go nearer to the root cause of the issue.

netandreus · 2025-01-28T12:29:44Z

@pappz sure, how can I fetch the public keys of my nodes?

pappz · 2025-01-28T12:32:48Z

@pappz sure, how can I fetch the public keys of my nodes?

The netbird status -d command will print out. Looking for the "Public key" expression, just like in your example in your original report:

 elk-huk-internal.netbird.selfhosted:
  NetBird IP: 100.81.167.156
  Public key: P0Xd+rb5EjqfaFXIL/KuQ0yGKHT4qTa99Mz4ABrshRA=

netandreus · 2025-01-28T18:49:12Z

@pappz here they are:

node-2:

 node-2.netbird.selfhosted:
  NetBird IP: 100.81.94.114
  Public key: A0/k9FWRkF+JspDjOVIhk0YaaRDvvTZo3C+kEL0feR0=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 10.30.200.47:51820/10.30.200.28:51820
  Relay server address: rels://gateway.xxx.com:443
  Last connection update: 9 hours, 45 minutes ago
  Last WireGuard handshake: 1 minute, 49 seconds ago
  Transfer status (received/sent) 97.1 KiB/40.5 KiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 493.405µs

node-3:

 node-3.netbird.selfhosted:
  NetBird IP: 100.81.65.41
  Public key: 9i5W38NvBSXk7oK+v0KeaXn0csEn5AOXIG2mg3TwOAo=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/host
  ICE candidate endpoints (Local/Remote): 172.30.1.97:51820/10.30.200.47:51820
  Relay server address: rels://gateway.xxx.com:443
  Last connection update: 9 hours, 47 minutes ago
  Last WireGuard handshake: 1 minute, 48 seconds ago
  Transfer status (received/sent) 25.2 KiB/93.4 KiB
  Quantum resistance: false
  Routes: -
  Networks: -
  Latency: 502.284µs

pappz · 2025-01-28T22:52:56Z

Great, thank you!

Do you know which restart solved the issue:

Restarting the Netbird agent on the ELK side
Restarting the Netbird agent on the node side
Or did it not matter?

netandreus · 2025-01-29T02:44:18Z

@pappz It does not matter, what I do at the ELK side, only restarting the Netbird agent (netbird stop && netbird start) at the node side affects the situation.

netandreus · 2025-01-30T05:52:20Z

Good morning, @pappz ! Is there something from my side, that could help you?

pappz · 2025-01-30T09:40:19Z

Hello @netandreus,
Thank you for the information. I found a potential bug and I have prepared a fix for it. You can track the changes here. Nevertheless, I am not sure this is the root cause of your issue because I haven't been able to reproduce it on my machine.

How easy is it to reproduce the issue? Could you enable verbose logging on your agent and collect the logs?

netandreus · 2025-01-30T10:27:26Z

Thank you for your efforts, @pappz !
I can only disable cron restart and wait for one week. What exact should I do if (or when) it will occurs?

pappz · 2025-01-30T11:30:53Z

With these commands, you can set the logging level:
netbird debug log level debug
or
netbird debug log level verbose
The verbose level may log too much data to your disk over a week, so choose the debug level based on your preference. When an issue occurs, please collect the log files and send them to me.

pappz · 2025-01-30T18:13:30Z

@netandreus
While you ping the unreachable server what error message do you see in the ping output? Is it something like this, or just a simple Destination Host Unreachable.

ubuntu@machine1:~$ ping 100.108.186.247
PING 100.108.186.247 (100.108.186.247) 56(84) bytes of data.
From 100.108.200.99 icmp_seq=1 Destination Host Unreachable
ping: sendmsg: Required key not available
From 100.108.200.99 icmp_seq=2 Destination Host Unreachable
ping: sendmsg: Required key not available

pappz · 2025-01-30T18:15:16Z

I am working on another logic that can better manage the possible anomalies. Is it an option for you to do tests with a custom build with the patches?

netandreus · 2025-01-31T10:14:04Z

@pappz yes, I can. I should deploy custom build on both nodes? And how can I rollback if somethings go wrong? How can I collect logs?

pappz · 2025-01-31T16:43:00Z

@netandreus

I prepared the test version. Here is the package for Linux. If you are using a different OS, I will send you different artifacts.

Default installation path is /usr/bin/netbird. Create a backup for easy rollback:

netbird down
cp -a /usr/bin/netbird /usr/bin/netbird.bkp
cat /path/to/downloaded/netbird > /usr/bin/netbird
netbird up

Don't forget to set the proper debug level!

Logs are in /var/log/netbird. Clean them before testing for easier handling.

If you are testing with previous machines (node-2, node-3), no need to update ELK peer.

And do not forget, this is just a test version, be careful to use it in production env.

I hope this fix will solve your issue but meantime I will dig deeper into this topic.

netandreus · 2025-02-03T06:07:58Z

Good morning, @pappz !
Can you please update the link, I can't download test version.

pappz · 2025-02-03T11:33:50Z

Strange. Here is the updated link.

pappz · 2025-02-03T11:55:55Z

@netandreus could you send to me a debug bundle? You can generate it with this command: netbird debug bundle -S

I would like to get a better picture of your network-related settings. This package contains all the necessary information.

netandreus · 2025-02-03T13:04:34Z

@pappz Sure, but I can`t downloaad it. May be some persmssions issue from your side?

When I click to the file - I can only copy name. Maybe you need my google account or something from my side?

pappz · 2025-02-03T13:15:25Z

The file that I uploaded is a ZIP archive. I think you opened it by the browser. If you download the full zip and manage it on your machine it would be easier.

netandreus · 2025-02-03T14:08:20Z

With these commands, you can set the logging level:
netbird debug log level debug
or
netbird debug log level verbose
The verbose level may log too much data to your disk over a week, so choose the debug level based on your preference. When an issue occurs, please collect the log files and send them to me.

Done.

@netandreus could you send to me a debug bundle? You can generate it with this command: netbird debug bundle -S

Done.

You can find all files (both logs when error occures on stable version and debug bundle for test version) here:

https://drive.google.com/drive/folders/1sRO8GprHSPS5LYgUqWg5iP2wCTHm7PHa?usp=drive_link

I hope this fix will solve your issue but meantime I will dig deeper into this topic.

I'm deployed test version on the both nodes.

node-2

node-3

Then I restarted netbird on elk node:

And see this one for node-2 at elk

and for node-3 at elk:

netandreus · 2025-02-03T14:26:44Z

@pappz and now I can't ping neither node-3 nor node-2 from elk.

pappz · 2025-02-03T18:26:59Z

@netandreus
Thank you for the logs. Can we schedule a call where we can better discover the situation together? It would be faster to resolve this issue than here.

netandreus · 2025-02-03T20:02:08Z

@pappz sure, we can schedule a call for tomorrow 2025-02-04 from 10:00 GMT+4. I can give you access to Anydesk / ssh to these nodes. Please find me on Telegram - https://t.me/netandreus

pappz · 2025-02-04T07:50:15Z

Thank you. I think you need to accept my messages. https://t.me/pzolinb

netandreus added the triage-needed label Jan 28, 2025

pappz self-assigned this Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pings disappear from some peers to others after one week #3245

Pings disappear from some peers to others after one week #3245

netandreus commented Jan 28, 2025 •

edited

Loading

pappz commented Jan 28, 2025

pappz commented Jan 28, 2025

netandreus commented Jan 28, 2025

pappz commented Jan 28, 2025

netandreus commented Jan 28, 2025

pappz commented Jan 28, 2025

netandreus commented Jan 29, 2025

netandreus commented Jan 30, 2025 •

edited

Loading

pappz commented Jan 30, 2025

netandreus commented Jan 30, 2025

pappz commented Jan 30, 2025

pappz commented Jan 30, 2025

pappz commented Jan 30, 2025

netandreus commented Jan 31, 2025

pappz commented Jan 31, 2025

netandreus commented Feb 3, 2025

pappz commented Feb 3, 2025

pappz commented Feb 3, 2025

netandreus commented Feb 3, 2025

pappz commented Feb 3, 2025

netandreus commented Feb 3, 2025 •

edited

Loading

netandreus commented Feb 3, 2025

pappz commented Feb 3, 2025

netandreus commented Feb 3, 2025 •

edited

Loading

pappz commented Feb 4, 2025

Pings disappear from some peers to others after one week #3245

Pings disappear from some peers to others after one week #3245

Comments

netandreus commented Jan 28, 2025 • edited Loading

pappz commented Jan 28, 2025

pappz commented Jan 28, 2025

netandreus commented Jan 28, 2025

pappz commented Jan 28, 2025

netandreus commented Jan 28, 2025

pappz commented Jan 28, 2025

netandreus commented Jan 29, 2025

netandreus commented Jan 30, 2025 • edited Loading

pappz commented Jan 30, 2025

netandreus commented Jan 30, 2025

pappz commented Jan 30, 2025

pappz commented Jan 30, 2025

pappz commented Jan 30, 2025

netandreus commented Jan 31, 2025

pappz commented Jan 31, 2025

netandreus commented Feb 3, 2025

pappz commented Feb 3, 2025

pappz commented Feb 3, 2025

netandreus commented Feb 3, 2025

pappz commented Feb 3, 2025

netandreus commented Feb 3, 2025 • edited Loading

netandreus commented Feb 3, 2025

pappz commented Feb 3, 2025

netandreus commented Feb 3, 2025 • edited Loading

pappz commented Feb 4, 2025

netandreus commented Jan 28, 2025 •

edited

Loading

netandreus commented Jan 30, 2025 •

edited

Loading

netandreus commented Feb 3, 2025 •

edited

Loading

netandreus commented Feb 3, 2025 •

edited

Loading