federation: parallel sending per instance #4623

phiresky · 2024-04-13T21:31:27Z

Currently, with the implementation of the federation queue from #3605 that was enabled in 0.19, the federation is parallel across instances but sequential per receiving instance.

This means that the maximum throughput of activities is limited by the network latency between instances as well as the internal latency for processing a single activity.

There is an extensive discussion here: #4529 (comment) Though the issue itself is only about one sub-problem.

This PR changes the federation sending component to send activities in parallel, with a configurable maximum concurrency of e.g. 8. The implementation is more complex than I expected since we need to keep track of the last_successful_id (which needs to be the highest activity id where every single lower activity has been successfully sent) and we need to keep track of failure counts without immediately jumping to hour-long retry delays when 8 concurrent sends fail simultaneously.

The implementation roughly works as follows:

We have a main loop that waits for new activities and spawns send tasks for them when the conditions are right
We create a multi-producer single-consumer channel, each activity send sends a notification through this channel to the main loop:
- If successful, the activity id is added to a in-memory priority queue, if the lowest ID in the queue is the previous ID + 1 we increment the last_successful_id counter
- If failing, the fail_count and last retry is updated. We completely ignore the activity id here and always take the maximum (but not sum) of fail_count from any activity send that is happening

In order for the results of this to still be correct, fixes need to be applied to make all activities commutative (as discussed above).

It should be possible to also make the concurrency only happen when necessary since for most instance-instance connections it is not, which would reduce the ordering issue. This is not implemented here though.

Currently, this PR fails the federation tests. I think this is both due to a bug somewhere as well as due to the ordering problem.

crates/federate/src/worker.rs

phiresky · 2024-04-15T17:45:17Z

I've split the instance worker into three separate files:

worker.rs - main code
inboxes.rs - code that makes sure the community lists are up to date and calculates the inboxes
send.rs - a single send-retry-loop task that runs concurrently

phiresky · 2024-04-15T17:46:14Z

crates/federate/src/inboxes.rs

@@ -0,0 +1,149 @@
+use crate::util::LEMMY_TEST_FAST_FEDERATION;


as a note, i've not made any changes to this code, just moved it into a separate struct

update: the code now has minor (non-functional) modifications to allow testing it

crates/federate/src/inboxes.rs

crates/federate/src/worker.rs

Nutomic · 2024-04-16T10:01:04Z

As mentioned in the dev chat it would be very useful to have some unit tests here, to ensure it works as expected.

phiresky · 2024-04-16T13:37:31Z

As mentioned in the dev chat it would be very useful to have some unit tests here, to ensure it works as expected.

Any ideas on how best to do that? The only clean way I can think of would be abstracting all DB and HTTP interactions (I guess it would be like 5-10 functions?) into a trait so the whole federate crate code is pure, and then mocking the DB interactions and HTTP interactions with data from memory.

Nutomic · 2024-04-17T13:00:25Z

Mocking would be too complicated and could introduce problems of its own. Look at how other tests are implemented, eg for db views. Basically write some test data to the db, then call functions and see if they behave as expected. You can start a local server with inbox route to check that activities are received (with an instance like localhost:8123). For start_stop_federation_workers() there should be some way to check the content of workers variable, eg by moving it into a struct and moving the function into impl block.

crates/federate/src/send.rs

…n of the same federation queues

Co-authored-by: dullbananas <dull.bananas0@gmail.com>

crates/federate/src/worker.rs

Nutomic · 2024-07-16T14:18:47Z

crates/federate/src/worker.rs

+          if fail_count > self.state.fail_count {
+            // override fail count - if multiple activities are currently sending this value may get
+            // conflicting info but that's fine
+            self.state.fail_count = fail_count;


self.state.fail_count += fail_count

Though I guess that wont work because the same activity can report multiple failures. Bit awkward to handle...

This needs to be this way, all alternatives would be worse. The reason is that if 10 simultaneous requests fail within a 1s period, we don't want the next retry to be exponentially 2**10 s later.

Yep I thought about it more and couldnt think of any better solution. Its just that the comment makes it sound like an ugly hack, so maybe it can be written differently.

Nutomic · 2024-07-16T14:19:44Z

crates/federate/src/worker.rs

+            published: None,
+            was_skipped: true,
+          }))
+          .ok();


Not sure why youre sending success in error case.

An error in this location means there is some deeper internal issue with the activity, for example the actor is can't be loaded or similar. These issues are probably not solveable by retrying and would cause the federation for this instance to permanently be stuck in a retry loop. So we log the error and skip the activity.

(added this as a comment in the code as well)

crates/federate/src/worker.rs

crates/federate/src/send.rs

crates/federate/src/inboxes.rs

dessalines

Once you get tests passing, we can merge.

phiresky · 2024-07-19T12:44:17Z

CI passes now. I've re-removed the double run of federation tests because it's too much of a chore to make them reliably run twice without resetting CI.

@Nutomic still needs to approve I guess.

Nutomic · 2024-07-19T13:26:37Z

Good to see that duration of tests and ci are back to normal. Only the int type needs to be fixed, then it can be merged. Great work @phiresky!

Edit: If were running api tests only once like before, you can also revert the changes there.

dessalines · 2024-07-19T16:17:13Z

After this merges, you want me to do an alpha release to test?

phiresky · 2024-07-20T11:15:38Z

It seems I don't have merge permission anymore, so could one of you merge it please?

dessalines · 2024-07-21T15:51:13Z

Done, I'll try to fix your perms also.

* federation: parallel sending * federation: some comments * lint and set force_write true when a request fails * inbox_urls return vec * split inbox functions into separate file * cleanup * extract sending task code to separate file * move federation concurrent config to config file * off by one issue * improve msg * fix both permanent stopping of federation queues and multiple creation of the same federation queues * fix after merge * lint fix * Update crates/federate/src/send.rs Co-authored-by: dullbananas <dull.bananas0@gmail.com> * comment about reverse ordering * remove crashable, comment * comment * move comment * run federation tests twice * fix test run * prettier * fix config default * upgrade rust to 1.78 to fix diesel cli * fix clippy * delay * add debug to make localhost urls not valid in ap crate, add some debug logs * federation tests: ensure server stop after test and random activity id * ci fix * add test to federate 100 events * fix send 100 test * different data every time so activities are distinguishable * allow out of order receives in test * lint * comment about #4623 (comment) * move sender for clarity, add comment * move more things to members * update test todo comment, use same env var as worker test but default to 1 * remove else below continue * some more cleanup * handle todo about smooth exit * add federate inboxes collector tests * lint * actor max length * don't reset fail count if activity skipped * fix some comments * reuse vars * format * Update .woodpecker.yml * fix recheck time * fix inboxes tests under fast mode * format * make i32 and ugly casts * clippy --------- Co-authored-by: dullbananas <dull.bananas0@gmail.com>

* federation: parallel sending * federation: some comments * lint and set force_write true when a request fails * inbox_urls return vec * split inbox functions into separate file * cleanup * extract sending task code to separate file * move federation concurrent config to config file * off by one issue * improve msg * fix both permanent stopping of federation queues and multiple creation of the same federation queues * fix after merge * lint fix * Update crates/federate/src/send.rs Co-authored-by: dullbananas <dull.bananas0@gmail.com> * comment about reverse ordering * remove crashable, comment * comment * move comment * run federation tests twice * fix test run * prettier * fix config default * upgrade rust to 1.78 to fix diesel cli * fix clippy * delay * add debug to make localhost urls not valid in ap crate, add some debug logs * federation tests: ensure server stop after test and random activity id * ci fix * add test to federate 100 events * fix send 100 test * different data every time so activities are distinguishable * allow out of order receives in test * lint * comment about LemmyNet#4623 (comment) * move sender for clarity, add comment * move more things to members * update test todo comment, use same env var as worker test but default to 1 * remove else below continue * some more cleanup * handle todo about smooth exit * add federate inboxes collector tests * lint * actor max length * don't reset fail count if activity skipped * fix some comments * reuse vars * format * Update .woodpecker.yml * fix recheck time * fix inboxes tests under fast mode * format * make i32 and ugly casts * clippy --------- Co-authored-by: dullbananas <dull.bananas0@gmail.com>

phiresky added 2 commits April 13, 2024 23:18

federation: parallel sending

539f06a

federation: some comments

491daab

Nutomic reviewed Apr 15, 2024

View reviewed changes

phiresky added 6 commits April 15, 2024 17:59

lint and set force_write true when a request fails

987174a

inbox_urls return vec

a66aec6

split inbox functions into separate file

a3d705f

cleanup

7eedcb7

extract sending task code to separate file

e719baf

move federation concurrent config to config file

5e986ef

phiresky force-pushed the federation-send-parallel branch from 936469a to 5e986ef Compare April 15, 2024 17:45

phiresky commented Apr 15, 2024

View reviewed changes

off by one issue

c1932f9

Nutomic reviewed Apr 16, 2024

View reviewed changes

crates/federate/src/inboxes.rs Outdated Show resolved Hide resolved

Nutomic reviewed Apr 16, 2024

View reviewed changes

crates/federate/src/inboxes.rs Outdated Show resolved Hide resolved

Nutomic reviewed Apr 16, 2024

View reviewed changes

crates/federate/src/worker.rs Outdated Show resolved Hide resolved

dullbananas reviewed Apr 20, 2024

View reviewed changes

crates/federate/src/send.rs Show resolved Hide resolved

dessalines added this to the 0.19.5 milestone Apr 29, 2024

improve msg

a7c7abd

dullbananas reviewed May 21, 2024

View reviewed changes

crates/federate/src/send.rs Outdated Show resolved Hide resolved

phiresky added 2 commits May 29, 2024 15:16

fix both permanent stopping of federation queues and multiple creatio…

13ff059

…n of the same federation queues

Merge branch 'fix-dupe-activity-sending' into federation-send-parallel

7cb4e82

phiresky mentioned this pull request May 30, 2024

Unit tests and cleanup for outgoing federation code (part 2) #4751

Closed

phiresky and others added 4 commits May 30, 2024 12:23

fix after merge

10d3b7d

Merge remote-tracking branch 'origin/main' into federation-send-parallel

ffb99cd

lint fix

a0b0a7a

Update crates/federate/src/send.rs

cdff275

Co-authored-by: dullbananas <dull.bananas0@gmail.com>