calico bird component source code optimization #115

xuchuan-666 · 2024-12-13T09:31:11Z

Expected Behavior

When abnormal nodes are down or the security group is isolated in the cluster, when nodes are added to the cluster or other nodes are restored to be down, calico bgp route establishment takes a long time, which takes 4 minutes. I expect that the state is that nodes are added to the cluster or nodes are restored to be down, and calico bgp route establishment will not be affected

Current Behavior

calico BGP route association takes 4 minutes

Possible Solution

bird source code modification

Steps to Reproduce (for bugs)

1.Example Modify the proto/bgp/bgp.c file with the following code
`static void
bgp_sock_err(sock *sk, int err)
{
struct bgp_conn *conn = sk->data;
struct bgp_proto *p = conn->bgp;

/*

This error hook may be called either asynchronously from main
loop, or synchronously from sk_send(). But sk_send() is called
only from bgp_tx() and bgp_kick_tx(), which are both called
asynchronously from main loop. Moreover, they end if err hook is
called. Therefore, we could suppose that it is always called
asynchronously.
*/

bgp_store_error(p, conn, BE_SOCKET, err);

if (err)
BGP_TRACE(D_EVENTS, "Connection lost (%M)", err);

else
BGP_TRACE(D_EVENTS, "Connection closed");

/*

xc add code start
/
if (err == ECONNREFUSED || err == EHOSTUNREACH) {
log(L_INFO "The link error message is Connection refused or No route to host, clear the host lock");
proto_graceful_restart_unlock(&p->p);
}
/
xc add code end
*/

if ((conn->state == BS_ESTABLISHED) && p->gr_ready)
bgp_handle_graceful_restart(p);

bgp_conn_enter_idle_state(conn);
}`

Context

Your Environment

Calico version 3.29.1
Orchestrator version 1.32
Operating System and version: linux

MichalFupso · 2024-12-17T17:40:10Z

Hi @xuchuan-666, could you please share logs from calico-node and any bgp configuration you changed?

xuchuan-666 · 2024-12-23T08:50:00Z

I only modified the bird source code, did not modify any bgp configuration, I printed a log in the proto_graceful_restart_unlock method, and showed it in the image below. The final effect of the modification is that when there is a network unreachable node in the cluster, bird can also quickly complete the graceful restart, rather than waiting for the 240s timeout

bird code before adjustment：

After code adjustment：

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calico bird component source code optimization #115

calico bird component source code optimization #115

xuchuan-666 commented Dec 13, 2024

MichalFupso commented Dec 17, 2024

xuchuan-666 commented Dec 23, 2024

calico bird component source code optimization #115

calico bird component source code optimization #115

Comments

xuchuan-666 commented Dec 13, 2024

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

MichalFupso commented Dec 17, 2024

xuchuan-666 commented Dec 23, 2024