-
-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
federation: parallel sending per instance #4623
Conversation
936469a
to
5e986ef
Compare
I've split the instance worker into three separate files:
|
@@ -0,0 +1,149 @@ | |||
use crate::util::LEMMY_TEST_FAST_FEDERATION; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a note, i've not made any changes to this code, just moved it into a separate struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: the code now has minor (non-functional) modifications to allow testing it
As mentioned in the dev chat it would be very useful to have some unit tests here, to ensure it works as expected. |
Any ideas on how best to do that? The only clean way I can think of would be abstracting all DB and HTTP interactions (I guess it would be like 5-10 functions?) into a trait so the whole federate crate code is pure, and then mocking the DB interactions and HTTP interactions with data from memory. |
Mocking would be too complicated and could introduce problems of its own. Look at how other tests are implemented, eg for db views. Basically write some test data to the db, then call functions and see if they behave as expected. You can start a local server with inbox route to check that activities are received (with an instance like |
Co-authored-by: dullbananas <dull.bananas0@gmail.com>
if fail_count > self.state.fail_count { | ||
// override fail count - if multiple activities are currently sending this value may get | ||
// conflicting info but that's fine | ||
self.state.fail_count = fail_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.state.fail_count += fail_count
Though I guess that wont work because the same activity can report multiple failures. Bit awkward to handle...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be this way, all alternatives would be worse. The reason is that if 10 simultaneous requests fail within a 1s period, we don't want the next retry to be exponentially 2**10 s later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I thought about it more and couldnt think of any better solution. Its just that the comment makes it sound like an ugly hack, so maybe it can be written differently.
published: None, | ||
was_skipped: true, | ||
})) | ||
.ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why youre sending success in error case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An error in this location means there is some deeper internal issue with the activity, for example the actor is can't be loaded or similar. These issues are probably not solveable by retrying and would cause the federation for this instance to permanently be stuck in a retry loop. So we log the error and skip the activity.
(added this as a comment in the code as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you get tests passing, we can merge.
CI passes now. I've re-removed the double run of federation tests because it's too much of a chore to make them reliably run twice without resetting CI. @Nutomic still needs to approve I guess. |
After this merges, you want me to do an alpha release to test? |
It seems I don't have merge permission anymore, so could one of you merge it please? |
Done, I'll try to fix your perms also. |
* federation: parallel sending * federation: some comments * lint and set force_write true when a request fails * inbox_urls return vec * split inbox functions into separate file * cleanup * extract sending task code to separate file * move federation concurrent config to config file * off by one issue * improve msg * fix both permanent stopping of federation queues and multiple creation of the same federation queues * fix after merge * lint fix * Update crates/federate/src/send.rs Co-authored-by: dullbananas <dull.bananas0@gmail.com> * comment about reverse ordering * remove crashable, comment * comment * move comment * run federation tests twice * fix test run * prettier * fix config default * upgrade rust to 1.78 to fix diesel cli * fix clippy * delay * add debug to make localhost urls not valid in ap crate, add some debug logs * federation tests: ensure server stop after test and random activity id * ci fix * add test to federate 100 events * fix send 100 test * different data every time so activities are distinguishable * allow out of order receives in test * lint * comment about #4623 (comment) * move sender for clarity, add comment * move more things to members * update test todo comment, use same env var as worker test but default to 1 * remove else below continue * some more cleanup * handle todo about smooth exit * add federate inboxes collector tests * lint * actor max length * don't reset fail count if activity skipped * fix some comments * reuse vars * format * Update .woodpecker.yml * fix recheck time * fix inboxes tests under fast mode * format * make i32 and ugly casts * clippy --------- Co-authored-by: dullbananas <dull.bananas0@gmail.com>
* federation: parallel sending * federation: some comments * lint and set force_write true when a request fails * inbox_urls return vec * split inbox functions into separate file * cleanup * extract sending task code to separate file * move federation concurrent config to config file * off by one issue * improve msg * fix both permanent stopping of federation queues and multiple creation of the same federation queues * fix after merge * lint fix * Update crates/federate/src/send.rs Co-authored-by: dullbananas <dull.bananas0@gmail.com> * comment about reverse ordering * remove crashable, comment * comment * move comment * run federation tests twice * fix test run * prettier * fix config default * upgrade rust to 1.78 to fix diesel cli * fix clippy * delay * add debug to make localhost urls not valid in ap crate, add some debug logs * federation tests: ensure server stop after test and random activity id * ci fix * add test to federate 100 events * fix send 100 test * different data every time so activities are distinguishable * allow out of order receives in test * lint * comment about LemmyNet#4623 (comment) * move sender for clarity, add comment * move more things to members * update test todo comment, use same env var as worker test but default to 1 * remove else below continue * some more cleanup * handle todo about smooth exit * add federate inboxes collector tests * lint * actor max length * don't reset fail count if activity skipped * fix some comments * reuse vars * format * Update .woodpecker.yml * fix recheck time * fix inboxes tests under fast mode * format * make i32 and ugly casts * clippy --------- Co-authored-by: dullbananas <dull.bananas0@gmail.com>
Currently, with the implementation of the federation queue from #3605 that was enabled in 0.19, the federation is parallel across instances but sequential per receiving instance.
This means that the maximum throughput of activities is limited by the network latency between instances as well as the internal latency for processing a single activity.
There is an extensive discussion here: #4529 (comment) Though the issue itself is only about one sub-problem.
This PR changes the federation sending component to send activities in parallel, with a configurable maximum concurrency of e.g. 8. The implementation is more complex than I expected since we need to keep track of the last_successful_id (which needs to be the highest activity id where every single lower activity has been successfully sent) and we need to keep track of failure counts without immediately jumping to hour-long retry delays when 8 concurrent sends fail simultaneously.
The implementation roughly works as follows:
In order for the results of this to still be correct, fixes need to be applied to make all activities commutative (as discussed above).
It should be possible to also make the concurrency only happen when necessary since for most instance-instance connections it is not, which would reduce the ordering issue. This is not implemented here though.
Currently, this PR fails the federation tests. I think this is both due to a bug somewhere as well as due to the ordering problem.