-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Optimization of linkcheck
builder's workers (follow-up #11324)
#11346
Comments
Also see related suggestion for improvement #4303. |
There's an idea in #9592 that could be worth considering: the linkchecker could keep a record of hosts where TLS failures have occurred, and avoid re-checking those. (I think the key for such a lookup table would be fully-qualified domain name plus TCP port number, but please correct me if I'm wrong. also this wouldn't have to be implemented at the same time as any other optimizations, but if it seems worthwhile, then designing with the possibility of adding it would be good) |
Depending on the (target) server configuration, you may have some links that are accessible without triggering a TLS failure while others are, so discarding the whole of them may be a bit too much although it is quite unlikely. As such I would agree to blacklist the whole domain.
If we blacklist by DNS A or AAAA records, we may blacklist fallback IPs that could have been ok. I'd prefer no to but I'm not sure how to actually force the DNS to switch to another IP if one fails except using an |
Do you think that invalidating the session if an |
By invalidating the session, do you mean closing it naturally? if so, I would say "yes", because I don't think it's good to have pending resources. It might be better to actually push failing links at the end of the process instead of trying to check them again and again. One reason is that it may help us in dealing in advance with links that would fail because of a blacklisted domain instead of letting them being processed by another worker. One challenge is to have a shared object containing the blacklisted domains but with some manager, it should be easy to do (+ ensuring that operations are atomic). |
I think so, yes. My idea was probably slightly imprecise, even to me, but my goal is: let's find an approach that implements most of the behaviour and optimization that we want, while also being straightforward to implement and understand. If we imagine the domain-grouped-hyperlinks as stacks on pallets that arrive via the entrance of a virtual factory workfloor, then I think what I'm suggesting is that if any individual pallet's stack doesn't pass an inspection check (TLS connection setup), then in v1 of the workflow we simply skip the rest of the workflow for that pallet's contents (all the items for that domain) and redirect it to one of the exits. That would avoid the need to share and/or communicate any context across the factory workflows (shared objects across threads). Maybe not optimal long-term, but a simple initial implementation (and hopefully easier to review). |
Implementwise, it makes sense. Thing is we could abstract your "stack of pallet" so that we can an elegant flow:
The splitting phase should be done by the main thread. The pallet validation can be done in a multi-threaded way because each pallet should be independent. Then, the actual validation would also be multi-threaded. The error callback would then aggregate the different bad pallets, deciding whether to resend them for a validation phase, possibly by changing them (like pruning some parts, or splitting them even further, anything can be actually done) and we repeat the process. Using a pallet approach should allow us to plug-in a bunch of additional steps. Sharing a state can be done separately and this would allow to add a dependency between pallets if needed. But as a v1, we don't necessarily need it. What I fear the most is that, while it may be really be good on paper, I am unsure of how much of improvements we will gain (maybe we won't gain any !). |
This is a follow-up of #11324 (comment) to formalize #11324 (comment)
The
linkcheck
builder extracts all the links that should be checked, putting them in a queue so that aHyperlinkAvailabilityCheckWorker
can check the link for availability. The number of workers is predefined (say n) at the start of the build and does not change. Since we are slowly moving to using sessions objects instead of standalone requests, at any point in time, there should have at most n sessions running, one for each worker.Assume that there are two workers and three links to check, say
bar.org/x
,bar.org/y
andfoo.org
. The queue may look likeQ = ['bar.org/x', 'foo.org', 'bar.org/y']
. Then, Worker 1 and Worker 2 would checkQ[0]
andQ[1]
simultaneously. If Worker 2 finishes before Worker 1, then Worker 1 checksQ[2]
. Ideally, it would have been better to let Worker 1 check that link since this would not have needed a new Session to be opened.Since the links to check are known before checking anything, one can first pre-process them so that and reorganize them as follows:
Alternatively, we could assume that there are only a worker responsible for processing a domain and the others only help if there is nothing else to do. If there are 15 links for domain 1 and 5 links for domain 2, the flow is as follows:
t=0
--- Worker 1 processes links 1 to 5 of domain 1 and Worker 2 processes all links of domain 2.t=1
--- Worker 1 processes links 6 to 10 of domain 1 and Worker 2 processes links 11 to 15 of domain 1.The idea is to balance as much as possible the blocks so that the number of TCP connections to open is as small as possible and there is no waiting time between two checks.
This feature has probably a very low priority because the implementation will not be trivial and we might need a PoC to know whether this is really worth implementing. For projects with many links of different domains, the current implementation is suitable but for projects using
sphinx.ext.intersphinx
, this may be of interest. Still it is good to have an opened issue where we could discuss.Related:
The text was updated successfully, but these errors were encountered: