Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ldpd: use a timer instead of sleeping in LM init #6274

Merged
merged 1 commit into from
Apr 24, 2020

Conversation

mjstapp
Copy link
Contributor

@mjstapp mjstapp commented Apr 22, 2020

Man, this synchronous thing is so fragile. Stop sleeping if synchronous label-manager zapi session has trouble during init; retry using a timer instead. Move initial label-block request to a point where the LM zapi session is known to be running.

@NetDEF-CI
Copy link
Collaborator

NetDEF-CI commented Apr 22, 2020

Continuous Integration Result: SUCCESSFUL

Continuous Integration Result: SUCCESSFUL

Congratulations, this patch passed basic tests

Tested-by: NetDEF / OpenSourceRouting.org CI System

CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-11996/

This is a comment from an automated CI system.
For questions and feedback in regards to this CI system, please feel free to email
Martin Winter - mwinter (at) opensourcerouting.org.

Warnings Generated during build:

Debian 10 amd64 build: Successful with additional warnings

Debian Package lintian failed for Debian 10 amd64 build:
(see full package build log at https://ci1.netdef.org/browse/FRR-FRRPULLREQ-11996/artifact/DEB10BUILD/ErrorLog/log_lintian.txt)

W: frr source: pkg-js-tools-test-is-missing
W: frr source: newer-standards-version 4.4.1 (current is 4.3.0)
W: frr source: pkg-js-tools-test-is-missing
W: frr source: newer-standards-version 4.4.1 (current is 4.3.0)
W: frr-pythontools: changelog-file-missing-explicit-entry 6.0-2 -> 7.4-dev-20200422-04-g4b747fad7-0 (missing) -> 7.4-dev-20200422-04-g4b747fad7-0~deb10u1
W: frr-doc: changelog-file-missing-explicit-entry 6.0-2 -> 7.4-dev-20200422-04-g4b747fad7-0 (missing) -> 7.4-dev-20200422-04-g4b747fad7-0~deb10u1
W: frr-rpki-rtrlib: changelog-file-missing-explicit-entry 6.0-2 -> 7.4-dev-20200422-04-g4b747fad7-0 (missing) -> 7.4-dev-20200422-04-g4b747fad7-0~deb10u1
W: frr-snmp: changelog-file-missing-explicit-entry 6.0-2 -> 7.4-dev-20200422-04-g4b747fad7-0 (missing) -> 7.4-dev-20200422-04-g4b747fad7-0~deb10u1
W: frr: changelog-file-missing-explicit-entry 6.0-2 -> 7.4-dev-20200422-04-g4b747fad7-0 (missing) -> 7.4-dev-20200422-04-g4b747fad7-0~deb10u1

@LabN-CI
Copy link
Collaborator

LabN-CI commented Apr 22, 2020

Outdated results 💚

Basic BGPD CI results: SUCCESS, 0 tests failed

_ _
Result SUCCESS git merge/6274 4b747fa
Date 04/22/2020
Start 14:21:16
Finish 14:47:21
Run-Time 26:05
Total 1815
Pass 1815
Fail 0
Valgrind-Errors 0
Valgrind-Loss 0
Details vncregress-2020-04-22-14:21:16.txt
Log autoscript-2020-04-22-14:22:20.log.bz2
Memory 483 460 425

For details, please contact louberger

Copy link
Member

@rwestphal rwestphal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. I have just one small request inline.

ldpd/lde.c Outdated

/* Retry using a timer */
thread_add_timer(master, zclient_sync_retry,
(void *)(intptr_t)instance, 1, NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have the session_id field, ldpd doesn't need the instance hack anymore. I'd appreciate if you could remove it as part of this PR in order to simplify the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "just send a zero" for the instance value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. ldpd doesn't support multiple instances and probably never will.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, got it - I saw that it does offer a command-line flag to set an instance value...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was added specifically for the LM instance hack :)

Even bgpd has -I, --int_num, which clearly isn't necessary anymore as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry guys, but this will not solve the problem. There is another loop line 1827 in lde_label_list_init() function which is call just right after zclient_sync_init() function line 177. If zclient_sync_init() failed, then call to lde_get_label_chunk () line 1827 will failed and will enter in the same kind of loop.

In fact, like for OSPF-SR or IS-IS-SR, if we can't get connection with the Label Manager, we can't start LDP. For OSPF-SR and IS-IS-SR, I remove the loop and return a failure and doesn't start Segment Routing. I'm wondering if we should do the same here. Without connection to the Label Manager, LDP has no valid label, thus can't start properly. The other solution is to go to Asynchronous connection with Label Manager and trigger the real start of LDP once getting the first label chunk.

Copy link
Member

@odd22 odd22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It remains a loop line 1827 that this PR doesn't solve.

ldpd/lde.c Outdated

/* Retry using a timer */
thread_add_timer(master, zclient_sync_retry,
(void *)(intptr_t)instance, 1, NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry guys, but this will not solve the problem. There is another loop line 1827 in lde_label_list_init() function which is call just right after zclient_sync_init() function line 177. If zclient_sync_init() failed, then call to lde_get_label_chunk () line 1827 will failed and will enter in the same kind of loop.

In fact, like for OSPF-SR or IS-IS-SR, if we can't get connection with the Label Manager, we can't start LDP. For OSPF-SR and IS-IS-SR, I remove the loop and return a failure and doesn't start Segment Routing. I'm wondering if we should do the same here. Without connection to the Label Manager, LDP has no valid label, thus can't start properly. The other solution is to go to Asynchronous connection with Label Manager and trigger the real start of LDP once getting the first label chunk.

@mjstapp
Copy link
Contributor Author

mjstapp commented Apr 23, 2020

It remains a loop line 1827 that this PR doesn't solve.

well, I think we are expecting that if we were able to get to zebra, we will be able to get the LM response too. that's been ... a reasonable expectation for a long time.

I wasn't trying to re-design ldpd or change its basic assumptions about the system it's running in. what I was trying to do here was to show that it's possible to do some retrying if there's some initial delay in connecting with zebra. like anything else in frr, if zebra is really not present, or not responsive, everything is pretty much dead.

Stop sleeping if synchronous label-manager zapi session
has trouble during init: retry using a timer instead. Move
initial label-block request to a point where the LM zapi
session is known to be running. Remove the use of the
daemon 'instance' - we're using the session_id to distinguish
the LM zapi session.

Signed-off-by: Mark Stapp <mjs@voltanet.io>
@odd22
Copy link
Member

odd22 commented Apr 23, 2020

@mjstapp Agree. But, the timer introduces by this PR, is not for any help regarding the loop line 1827 and thus, connection to Label Manager remains fragile. I think that the call to function lde_label_list_init() must be move from line 177 within the zsync_client_init() when success just before the return statement.

@odd22
Copy link
Member

odd22 commented Apr 23, 2020

Oups! Just saw that you do the change. So, I would suggest to remove the loop line 1827 in this case.

@odd22
Copy link
Member

odd22 commented Apr 23, 2020

@mjstapp Finally, after digging around the code of lde_label_list_init() function, if we would avoid any problem, the best is to change the function lde_label_list_init() by removing the loop and adding a return code and finally surrounding the call to the function by checking the return code and goto to retry in case of failure. Like that:

if (lde_label_list_init() < 0)
        goto retry;

with a lde_label_list_init() function like that:

static int
lde_label_list_init(void)
{
	if (!label_chunk_list) {
		label_chunk_list = list_new();
		label_chunk_list->del = lde_del_label_chunk;
	}

	/* get first chunk */
	if (lde_get_label_chunk () != 0) {
		log_warnx("Error getting first label chunk!");
		return -1;
	}
}

The remaining problem, is to detect that Label Manager is to ready and thus, delaying LDP start until Label Manager is up and provide a first Label Chunk.

In fact, I think the problem with the lde_label_list-init() loop could occur only with external Label Manager. When using the internal Zebra Label Manager, once connected, requesting a label chunk will only hang if there is no more available labels.

@mjstapp
Copy link
Contributor Author

mjstapp commented Apr 23, 2020

@mjstapp Finally, after digging around the code of lde_label_list_init() function, if we would

So, as I said above, I was only trying to offer a demonstration of replacing the init-time sleep() calls with a simple timer to do a retry in case of a timing problem at startup. I don't intend to solve any other problems in ldpd at this time.

@odd22
Copy link
Member

odd22 commented Apr 23, 2020

@mjstapp

So, as I said above, I was only trying to offer a demonstration of replacing the init-time sleep() calls with a simple timer to do a retry in case of a timing problem at startup. I don't intend to solve any other problems in ldpd at this time.

Agree, but what do you think about my suggestion to remove the loop in lde_label_list_init() function ?

@mjstapp
Copy link
Contributor Author

mjstapp commented Apr 23, 2020

Agree, but what do you think about my suggestion to remove the loop in lde_label_list_init() function ?

I'm sorry to say that I don't think it makes any difference at all. I think the only way that a failure occurs there is through loss of the session with zebra (or if zebra sends invalid data back), and ldpd is not prepared to handle that at all. it's a synchronous call: that's the problem, because it can block/pend the only thread the poor daemon has. the loop/sleep there is sort of just noise, IMO.

@mjstapp mjstapp force-pushed the fix_lde_blocking_sleep branch from 4b747fa to 9d694b0 Compare April 23, 2020 16:28
@mjstapp
Copy link
Contributor Author

mjstapp commented Apr 23, 2020

Finally got through to github to push the 'instance' change that Renato requested...

@mjstapp
Copy link
Contributor Author

mjstapp commented Apr 23, 2020

Thanks for doing this. I have just one small request inline.

I've pushed a change to remove the use of 'instance' from this path...

@odd22
Copy link
Member

odd22 commented Apr 23, 2020

@mjstapp OK. As you want.

@rwestphal what do you think about removing the loop in lde_label_init_list() ?

Copy link
Member

@rwestphal rwestphal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjstapp thanks for the update. LGTM.

@rwestphal what do you think about removing the loop in lde_label_init_list() ?

I wouldn't worry about that at the moment. As Mark said, lde_get_label_chunk() will only fail if zebra dies while ldpd is running, and ldpd isn't prepared to handle that at all. So removing the sleep() call from lde_label_init_list() wouldn't make any practical difference.

Also, I have plans to refactor the LM API in order to make all label requests asynchronous. Once that work is done, the synchronous zclient will be gone and we won't need to bother about the reconnection issue anymore.

@NetDEF-CI
Copy link
Collaborator

Continuous Integration Result: SUCCESSFUL

Congratulations, this patch passed basic tests

Tested-by: NetDEF / OpenSourceRouting.org CI System

CI System Testrun URL: https://ci1.netdef.org/browse/FRR-FRRPULLREQ-12015/

This is a comment from an automated CI system.
For questions and feedback in regards to this CI system, please feel free to email
Martin Winter - mwinter (at) opensourcerouting.org.

Warnings Generated during build:

Debian 10 amd64 build: Successful with additional warnings

Debian Package lintian failed for Debian 10 amd64 build:
(see full package build log at https://ci1.netdef.org/browse/FRR-FRRPULLREQ-12015/artifact/DEB10BUILD/ErrorLog/log_lintian.txt)

W: frr source: pkg-js-tools-test-is-missing
W: frr source: newer-standards-version 4.4.1 (current is 4.3.0)
W: frr source: pkg-js-tools-test-is-missing
W: frr source: newer-standards-version 4.4.1 (current is 4.3.0)
W: frr: changelog-file-missing-explicit-entry 6.0-2 -> 7.4-dev-20200423-00-g9d694b0b0-0 (missing) -> 7.4-dev-20200423-00-g9d694b0b0-0~deb10u1
W: frr-rpki-rtrlib: changelog-file-missing-explicit-entry 6.0-2 -> 7.4-dev-20200423-00-g9d694b0b0-0 (missing) -> 7.4-dev-20200423-00-g9d694b0b0-0~deb10u1
W: frr-doc: changelog-file-missing-explicit-entry 6.0-2 -> 7.4-dev-20200423-00-g9d694b0b0-0 (missing) -> 7.4-dev-20200423-00-g9d694b0b0-0~deb10u1
W: frr-snmp: changelog-file-missing-explicit-entry 6.0-2 -> 7.4-dev-20200423-00-g9d694b0b0-0 (missing) -> 7.4-dev-20200423-00-g9d694b0b0-0~deb10u1
W: frr-pythontools: changelog-file-missing-explicit-entry 6.0-2 -> 7.4-dev-20200423-00-g9d694b0b0-0 (missing) -> 7.4-dev-20200423-00-g9d694b0b0-0~deb10u1

@LabN-CI
Copy link
Collaborator

LabN-CI commented Apr 23, 2020

💚 Basic BGPD CI results: SUCCESS, 0 tests failed

Results table
_ _
Result SUCCESS git merge/6274 9d694b0
Date 04/23/2020
Start 14:23:51
Finish 14:49:49
Run-Time 25:58
Total 1815
Pass 1815
Fail 0
Valgrind-Errors 0
Valgrind-Loss 0
Details vncregress-2020-04-23-14:23:51.txt
Log autoscript-2020-04-23-14:24:49.log.bz2
Memory 458 483 426

For details, please contact louberger

@odd22
Copy link
Member

odd22 commented Apr 24, 2020

I wouldn't worry about that at the moment. As Mark said, lde_get_label_chunk() will only fail if zebra dies while ldpd is running, and ldpd isn't prepared to handle that at all. So removing the sleep() call from lde_label_init_list() wouldn't make any practical difference.

OK. I'll approve this PR.

Also, I have plans to refactor the LM API in order to make all label requests asynchronous. Once that work is done, the synchronous zclient will be gone and we won't need to bother about the reconnection issue anymore.

I'm looking to use the asynchronous version of LM for OSPF-SR and IS-IS-SR too like BGP do. Update to OSPF-SR PR will come today

Copy link
Member

@odd22 odd22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM waiting use of asynchronous version of the Label Manager

@odd22 odd22 merged commit 814f6fc into FRRouting:master Apr 24, 2020
@mjstapp mjstapp deleted the fix_lde_blocking_sleep branch June 12, 2020 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants