Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

federation: parallel sending per instance #4623

Merged
merged 60 commits into from
Jul 21, 2024
Merged

Conversation

phiresky
Copy link
Collaborator

@phiresky phiresky commented Apr 13, 2024

Currently, with the implementation of the federation queue from #3605 that was enabled in 0.19, the federation is parallel across instances but sequential per receiving instance.

This means that the maximum throughput of activities is limited by the network latency between instances as well as the internal latency for processing a single activity.

There is an extensive discussion here: #4529 (comment) Though the issue itself is only about one sub-problem.

This PR changes the federation sending component to send activities in parallel, with a configurable maximum concurrency of e.g. 8. The implementation is more complex than I expected since we need to keep track of the last_successful_id (which needs to be the highest activity id where every single lower activity has been successfully sent) and we need to keep track of failure counts without immediately jumping to hour-long retry delays when 8 concurrent sends fail simultaneously.

The implementation roughly works as follows:

  1. We have a main loop that waits for new activities and spawns send tasks for them when the conditions are right
  2. We create a multi-producer single-consumer channel, each activity send sends a notification through this channel to the main loop:
    • If successful, the activity id is added to a in-memory priority queue, if the lowest ID in the queue is the previous ID + 1 we increment the last_successful_id counter
    • If failing, the fail_count and last retry is updated. We completely ignore the activity id here and always take the maximum (but not sum) of fail_count from any activity send that is happening

In order for the results of this to still be correct, fixes need to be applied to make all activities commutative (as discussed above).

It should be possible to also make the concurrency only happen when necessary since for most instance-instance connections it is not, which would reduce the ordering issue. This is not implemented here though.

Currently, this PR fails the federation tests. I think this is both due to a bug somewhere as well as due to the ordering problem.

crates/federate/src/worker.rs Outdated Show resolved Hide resolved
crates/federate/src/worker.rs Show resolved Hide resolved
crates/federate/src/worker.rs Outdated Show resolved Hide resolved
crates/federate/src/worker.rs Outdated Show resolved Hide resolved
crates/federate/src/worker.rs Outdated Show resolved Hide resolved
crates/federate/src/worker.rs Outdated Show resolved Hide resolved
crates/federate/src/worker.rs Outdated Show resolved Hide resolved
crates/federate/src/worker.rs Outdated Show resolved Hide resolved
crates/federate/src/worker.rs Outdated Show resolved Hide resolved
crates/federate/src/worker.rs Outdated Show resolved Hide resolved
@phiresky phiresky force-pushed the federation-send-parallel branch from 936469a to 5e986ef Compare April 15, 2024 17:45
@phiresky
Copy link
Collaborator Author

I've split the instance worker into three separate files:

  • worker.rs - main code
  • inboxes.rs - code that makes sure the community lists are up to date and calculates the inboxes
  • send.rs - a single send-retry-loop task that runs concurrently

@@ -0,0 +1,149 @@
use crate::util::LEMMY_TEST_FAST_FEDERATION;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a note, i've not made any changes to this code, just moved it into a separate struct

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: the code now has minor (non-functional) modifications to allow testing it

@Nutomic
Copy link
Member

Nutomic commented Apr 16, 2024

As mentioned in the dev chat it would be very useful to have some unit tests here, to ensure it works as expected.

@phiresky
Copy link
Collaborator Author

As mentioned in the dev chat it would be very useful to have some unit tests here, to ensure it works as expected.

Any ideas on how best to do that? The only clean way I can think of would be abstracting all DB and HTTP interactions (I guess it would be like 5-10 functions?) into a trait so the whole federate crate code is pure, and then mocking the DB interactions and HTTP interactions with data from memory.

@Nutomic
Copy link
Member

Nutomic commented Apr 17, 2024

Mocking would be too complicated and could introduce problems of its own. Look at how other tests are implemented, eg for db views. Basically write some test data to the db, then call functions and see if they behave as expected. You can start a local server with inbox route to check that activities are received (with an instance like localhost:8123). For start_stop_federation_workers() there should be some way to check the content of workers variable, eg by moving it into a struct and moving the function into impl block.

@dessalines dessalines added this to the 0.19.5 milestone Apr 29, 2024
if fail_count > self.state.fail_count {
// override fail count - if multiple activities are currently sending this value may get
// conflicting info but that's fine
self.state.fail_count = fail_count;
Copy link
Member

@Nutomic Nutomic Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.state.fail_count += fail_count

Though I guess that wont work because the same activity can report multiple failures. Bit awkward to handle...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be this way, all alternatives would be worse. The reason is that if 10 simultaneous requests fail within a 1s period, we don't want the next retry to be exponentially 2**10 s later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I thought about it more and couldnt think of any better solution. Its just that the comment makes it sound like an ugly hack, so maybe it can be written differently.

published: None,
was_skipped: true,
}))
.ok();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why youre sending success in error case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error in this location means there is some deeper internal issue with the activity, for example the actor is can't be loaded or similar. These issues are probably not solveable by retrying and would cause the federation for this instance to permanently be stuck in a retry loop. So we log the error and skip the activity.

(added this as a comment in the code as well)

Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you get tests passing, we can merge.

@phiresky
Copy link
Collaborator Author

phiresky commented Jul 19, 2024

CI passes now. I've re-removed the double run of federation tests because it's too much of a chore to make them reliably run twice without resetting CI.

@Nutomic still needs to approve I guess.

@Nutomic
Copy link
Member

Nutomic commented Jul 19, 2024

Good to see that duration of tests and ci are back to normal. Only the int type needs to be fixed, then it can be merged. Great work @phiresky!

Edit: If were running api tests only once like before, you can also revert the changes there.

@dessalines
Copy link
Member

After this merges, you want me to do an alpha release to test?

@phiresky
Copy link
Collaborator Author

It seems I don't have merge permission anymore, so could one of you merge it please?

@dessalines dessalines merged commit a08642f into main Jul 21, 2024
2 checks passed
@dessalines
Copy link
Member

Done, I'll try to fix your perms also.

Nutomic pushed a commit that referenced this pull request Jul 23, 2024
* federation: parallel sending

* federation: some comments

* lint and set force_write true when a request fails

* inbox_urls return vec

* split inbox functions into separate file

* cleanup

* extract sending task code to separate file

* move federation concurrent config to config file

* off by one issue

* improve msg

* fix both permanent stopping of federation queues and multiple creation of the same federation queues

* fix after merge

* lint fix

* Update crates/federate/src/send.rs

Co-authored-by: dullbananas <dull.bananas0@gmail.com>

* comment about reverse ordering

* remove crashable, comment

* comment

* move comment

* run federation tests twice

* fix test run

* prettier

* fix config default

* upgrade rust to 1.78 to fix diesel cli

* fix clippy

* delay

* add debug to make localhost urls not valid in ap crate, add some debug logs

* federation tests: ensure server stop after test and random activity id

* ci fix

* add test to federate 100 events

* fix send 100 test

* different data every time so activities are distinguishable

* allow out of order receives in test

* lint

* comment about #4623 (comment)

* move sender for clarity, add comment

* move more things to members

* update test todo comment, use same env var as worker test but default to 1

* remove else below continue

* some more cleanup

* handle todo about smooth exit

* add federate inboxes collector tests

* lint

* actor max length

* don't reset fail count if activity skipped

* fix some comments

* reuse vars

* format

* Update .woodpecker.yml

* fix recheck time

* fix inboxes tests under fast mode

* format

* make i32 and ugly casts

* clippy

---------

Co-authored-by: dullbananas <dull.bananas0@gmail.com>
Nutomic pushed a commit to sunaurus/lemmy that referenced this pull request Sep 20, 2024
* federation: parallel sending

* federation: some comments

* lint and set force_write true when a request fails

* inbox_urls return vec

* split inbox functions into separate file

* cleanup

* extract sending task code to separate file

* move federation concurrent config to config file

* off by one issue

* improve msg

* fix both permanent stopping of federation queues and multiple creation of the same federation queues

* fix after merge

* lint fix

* Update crates/federate/src/send.rs

Co-authored-by: dullbananas <dull.bananas0@gmail.com>

* comment about reverse ordering

* remove crashable, comment

* comment

* move comment

* run federation tests twice

* fix test run

* prettier

* fix config default

* upgrade rust to 1.78 to fix diesel cli

* fix clippy

* delay

* add debug to make localhost urls not valid in ap crate, add some debug logs

* federation tests: ensure server stop after test and random activity id

* ci fix

* add test to federate 100 events

* fix send 100 test

* different data every time so activities are distinguishable

* allow out of order receives in test

* lint

* comment about LemmyNet#4623 (comment)

* move sender for clarity, add comment

* move more things to members

* update test todo comment, use same env var as worker test but default to 1

* remove else below continue

* some more cleanup

* handle todo about smooth exit

* add federate inboxes collector tests

* lint

* actor max length

* don't reset fail count if activity skipped

* fix some comments

* reuse vars

* format

* Update .woodpecker.yml

* fix recheck time

* fix inboxes tests under fast mode

* format

* make i32 and ugly casts

* clippy

---------

Co-authored-by: dullbananas <dull.bananas0@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants