-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ldpd: use a timer instead of sleeping in LM init #6274
Conversation
Continuous Integration Result: SUCCESSFULContinuous Integration Result: SUCCESSFULCongratulations, this patch passed basic tests Tested-by: NetDEF / OpenSourceRouting.org CI System CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-11996/ This is a comment from an automated CI system. Warnings Generated during build:Debian 10 amd64 build: Successful with additional warningsDebian Package lintian failed for Debian 10 amd64 build:
|
Outdated results 💚Basic BGPD CI results: SUCCESS, 0 tests failed
For details, please contact louberger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. I have just one small request inline.
ldpd/lde.c
Outdated
|
||
/* Retry using a timer */ | ||
thread_add_timer(master, zclient_sync_retry, | ||
(void *)(intptr_t)instance, 1, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have the session_id
field, ldpd doesn't need the instance hack anymore. I'd appreciate if you could remove it as part of this PR in order to simplify the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean "just send a zero" for the instance value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. ldpd doesn't support multiple instances and probably never will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, got it - I saw that it does offer a command-line flag to set an instance value...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was added specifically for the LM instance hack :)
Even bgpd has -I, --int_num
, which clearly isn't necessary anymore as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry guys, but this will not solve the problem. There is another loop line 1827 in lde_label_list_init() function which is call just right after zclient_sync_init() function line 177. If zclient_sync_init() failed, then call to lde_get_label_chunk () line 1827 will failed and will enter in the same kind of loop.
In fact, like for OSPF-SR or IS-IS-SR, if we can't get connection with the Label Manager, we can't start LDP. For OSPF-SR and IS-IS-SR, I remove the loop and return a failure and doesn't start Segment Routing. I'm wondering if we should do the same here. Without connection to the Label Manager, LDP has no valid label, thus can't start properly. The other solution is to go to Asynchronous connection with Label Manager and trigger the real start of LDP once getting the first label chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It remains a loop line 1827 that this PR doesn't solve.
ldpd/lde.c
Outdated
|
||
/* Retry using a timer */ | ||
thread_add_timer(master, zclient_sync_retry, | ||
(void *)(intptr_t)instance, 1, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry guys, but this will not solve the problem. There is another loop line 1827 in lde_label_list_init() function which is call just right after zclient_sync_init() function line 177. If zclient_sync_init() failed, then call to lde_get_label_chunk () line 1827 will failed and will enter in the same kind of loop.
In fact, like for OSPF-SR or IS-IS-SR, if we can't get connection with the Label Manager, we can't start LDP. For OSPF-SR and IS-IS-SR, I remove the loop and return a failure and doesn't start Segment Routing. I'm wondering if we should do the same here. Without connection to the Label Manager, LDP has no valid label, thus can't start properly. The other solution is to go to Asynchronous connection with Label Manager and trigger the real start of LDP once getting the first label chunk.
well, I think we are expecting that if we were able to get to zebra, we will be able to get the LM response too. that's been ... a reasonable expectation for a long time. I wasn't trying to re-design ldpd or change its basic assumptions about the system it's running in. what I was trying to do here was to show that it's possible to do some retrying if there's some initial delay in connecting with zebra. like anything else in frr, if zebra is really not present, or not responsive, everything is pretty much dead. |
Stop sleeping if synchronous label-manager zapi session has trouble during init: retry using a timer instead. Move initial label-block request to a point where the LM zapi session is known to be running. Remove the use of the daemon 'instance' - we're using the session_id to distinguish the LM zapi session. Signed-off-by: Mark Stapp <mjs@voltanet.io>
@mjstapp Agree. But, the timer introduces by this PR, is not for any help regarding the loop line 1827 and thus, connection to Label Manager remains fragile. I think that the call to function lde_label_list_init() must be move from line 177 within the zsync_client_init() when success just before the return statement. |
Oups! Just saw that you do the change. So, I would suggest to remove the loop line 1827 in this case. |
@mjstapp Finally, after digging around the code of lde_label_list_init() function, if we would avoid any problem, the best is to change the function lde_label_list_init() by removing the loop and adding a return code and finally surrounding the call to the function by checking the return code and goto to retry in case of failure. Like that:
with a lde_label_list_init() function like that:
The remaining problem, is to detect that Label Manager is to ready and thus, delaying LDP start until Label Manager is up and provide a first Label Chunk. In fact, I think the problem with the lde_label_list-init() loop could occur only with external Label Manager. When using the internal Zebra Label Manager, once connected, requesting a label chunk will only hang if there is no more available labels. |
So, as I said above, I was only trying to offer a demonstration of replacing the init-time sleep() calls with a simple timer to do a retry in case of a timing problem at startup. I don't intend to solve any other problems in ldpd at this time. |
Agree, but what do you think about my suggestion to remove the loop in lde_label_list_init() function ? |
I'm sorry to say that I don't think it makes any difference at all. I think the only way that a failure occurs there is through loss of the session with zebra (or if zebra sends invalid data back), and ldpd is not prepared to handle that at all. it's a synchronous call: that's the problem, because it can block/pend the only thread the poor daemon has. the loop/sleep there is sort of just noise, IMO. |
4b747fa
to
9d694b0
Compare
Finally got through to github to push the 'instance' change that Renato requested... |
I've pushed a change to remove the use of 'instance' from this path... |
@mjstapp OK. As you want. @rwestphal what do you think about removing the loop in lde_label_init_list() ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjstapp thanks for the update. LGTM.
@rwestphal what do you think about removing the loop in lde_label_init_list() ?
I wouldn't worry about that at the moment. As Mark said, lde_get_label_chunk()
will only fail if zebra dies while ldpd is running, and ldpd isn't prepared to handle that at all. So removing the sleep()
call from lde_label_init_list()
wouldn't make any practical difference.
Also, I have plans to refactor the LM API in order to make all label requests asynchronous. Once that work is done, the synchronous zclient will be gone and we won't need to bother about the reconnection issue anymore.
Continuous Integration Result: SUCCESSFULCongratulations, this patch passed basic tests Tested-by: NetDEF / OpenSourceRouting.org CI System CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-12015/ This is a comment from an automated CI system. Warnings Generated during build:Debian 10 amd64 build: Successful with additional warningsDebian Package lintian failed for Debian 10 amd64 build:
|
💚 Basic BGPD CI results: SUCCESS, 0 tests failedResults table
For details, please contact louberger |
OK. I'll approve this PR.
I'm looking to use the asynchronous version of LM for OSPF-SR and IS-IS-SR too like BGP do. Update to OSPF-SR PR will come today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM waiting use of asynchronous version of the Label Manager
Man, this synchronous thing is so fragile. Stop sleeping if synchronous label-manager zapi session has trouble during init; retry using a timer instead. Move initial label-block request to a point where the LM zapi session is known to be running.