-
-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Some Lemmy instances sending every activity twice #4609
Comments
This seems like a bug in Lemmy so youre definitely right to report it here. Doesnt matter if you observe it from another platform. It looks like all requests go to @phiresky Do you have any idea about this? All I can think of is that theres more than one federation worker active for a single target instance, but the code looks correct. Edit: It could help to write unit tests for the outgoing federation queue, to ensure that it starts and stops workers as expected, and sends activities correctly. |
I know that lemmy.world definitely had a misconfiguration for a bit where they had multiple instances of Lemmy running without the appropriate -federation-worker flags set (since those where not required before 0.19). If this issue doesn't happen always and for every instance then it's likely a misconfiguration in those instances. That doesn't mean we shouldn't do something to improve the situation. For example we could add a heuristic to the federation workers that detects and logs an error when the federation_queue_state table has been updated by someone else which would mean overlapping federation workers running. The best solution would be a horizontal scaling method that does not require manual configuration of federation instance ranges but I can't think of a way to do that without huge complexity. |
The bug also seems to happen for instances that are running only a single Lemmy backend, eg mander.xyz. |
Have done some more testing. Test 1Sent a Follow from my server to a community on mander.xyz using a Ruby script. Script in here #!/usr/bin/env ruby
require 'securerandom'
require 'openssl'
require 'http' # needs 'gem install http'
# change these
my_site = "lemmon.website"
my_dir = "u"
my_actor = "freamon"
my_private_key_location = "/home/andrew/keys/#{my_actor}/private.pem"
my_public_key_location = "https://#{my_site}/#{my_dir}/#{my_actor}"
target_site = "mander.xyz"
target_dir = "c"
target_actor = "printforgood"
uuid = SecureRandom.uuid
json = <<~EOT
{
"@context": [
"https://w3id.org/security/v1",
"https://www.w3.org/ns/activitystreams"
],
"id":"https://#{my_site}/follows/#{uuid}",
"type":"Follow",
"actor":"https://#{my_site}/#{my_dir}/#{my_actor}",
"object":"https://#{target_site}/#{target_dir}/#{target_actor}"
}
EOT
date = Time.now.utc.httpdate
digest = Base64.strict_encode64(OpenSSL::Digest.digest('SHA256', json))
digest.prepend("SHA-256=")
keypair = OpenSSL::PKey::RSA.new(File.read("#{my_private_key_location}"))
signed_string = "(request-target): post /#{target_dir}/#{target_actor}/inbox\ncontent-type: application/activity+json\ndate: #{date}\ndigest: #{digest}\nhost: #{target_site}"
signature = Base64.strict_encode64(keypair.sign(OpenSSL::Digest::SHA256.new, signed_string))
header = 'keyId="' + my_public_key_location + '#main-key",algorithm="rsa-sha256",headers="(request-target) content-type date digest host",signature="' + signature + '"'
response = HTTP.headers({ 'Host': target_site, 'User-Agent': 'ActivityPub +https://' + my_site, 'Content-Type': 'application/activity+json', 'Digest': digest, 'Date': date, 'Signature': header })
.post("https://#{target_site}/#{target_dir}/#{target_actor}/inbox", body: json)
p response # header
p response.to_s # body (if any) Received 3 duplicate Accepts in response. ip.address - - [09/Apr/2024:18:30:27 +0000] "POST /inbox HTTP/1.1" 200 0 "-" "Lemmy/0.19.3; +https://mander.xyz"
ip.address - - [09/Apr/2024:18:30:27 +0000] "POST /inbox HTTP/1.1" 200 0 "-" "Lemmy/0.19.3; +https://mander.xyz"
ip.address - - [09/Apr/2024:18:30:27 +0000] "POST /inbox HTTP/1.1" 200 0 "-" "Lemmy/0.19.3; +https://mander.xyz" My backend server just gives each request a UUID and dumps the bodies into a file: -rw-r--r-- 1 andrew andrew 475 Apr 9 18:30 activity_from_printforgood_737b341e-54f4-42fe-9bdb-ae89c7f89222.json
-rw-r--r-- 1 andrew andrew 475 Apr 9 18:30 activity_from_printforgood_38b15127-ba42-4b4a-b552-946a6707d2e9.json
-rw-r--r-- 1 andrew andrew 475 Apr 9 18:30 activity_from_printforgood_f8a6f4ed-376c-4ade-9abe-87d9035435fc.json Each file contains the same Accept JSON. JSON in here{
"@context": [
"https://join-lemmy.org/context.json",
"https://www.w3.org/ns/activitystreams"
],
"actor": "https://mander.xyz/c/printforgood",
"to": [
"https://lemmon.website/u/freamon"
],
"object": {
"actor": "https://lemmon.website/u/freamon",
"to": null,
"object": "https://mander.xyz/c/printforgood",
"type": "Follow",
"id": "https://lemmon.website/follows/194d816b-1599-4587-95ae-d274c1d4209c"
},
"type": "Accept",
"id": "https://mander.xyz/activities/accept/b0f94ef2-f411-4b09-8548-9d81b98b01c7"
} Test 2Sent a Follow from a Lemmy instance running inside a VM and tunnelled through to via ngrok.com, to a community on mander.xyz, using the lemmy-ui Received only 1 Accept in response: Test 3Still on the VM. Extracted the private key for my lemmy user, and sent a Follow to a different community on mander.xyz (that was conveniently stuck at 'Pending'), using the Ruby script from Test 1. Received 2 Accepts in response, one of which lemmy dutifully rejected Test 4Used the lemmy-ui again to subscribe to a different community on mander.xyz. Also received 2 Accepts in response, one of which lemmy rejected Test 5Same as test 2-5, but for communities on lemmy.ml instead. Follows were sent using the lemmy-ui, and the Ruby script, but only 1 Accept was ever sent back Test 6Sent a Follow from my server to a community on lemmy.ml using the Ruby script. Received only 1 Accept back in response ThoughtsI've considered the possibility that me using lemmon.website to experiment with ActivityPub is a factor, and Lemmy instances have more than 1 idea of what the site is, and are sending activity for each version. The argument against this, is that I've interacted way more with lemmy.ml than mander.xyz, and there's no duplications coming from lemmy.ml. Also, when lemmy.world was on 0.18.5, it likewise never sent me dupes. I even tried to insert another record for 'lemmon.website' into the lemmy DB on the VM I was using, and - unsurprisingly - it rejected it for not being unique. Likewise, the Ruby script I've used - it works without issue for lemmy.ml (and Mastodon instances, and Pixelfed instances, etc), but me using it in Test 3 seemed to be the trigger that tripped mander.xyz up into starting sending me activity twice. I haven't been able to form anything conclusive from this, but maybe something will stand out to people more familiar with Lemmy's code. |
I've seen duplicate activities on Lemmy <-> Lemmy before, it seems unlikely to be related to the software on the receiving side. |
It would be good to get a reproduction with an instance that someone actually controls because otherwise it's hard to say whether there is a misconfiguration or not and to get the relevant logs / db entries (which are on the sending side) |
It might be relevant since in general every actor has their own inbox, and the merging of these inboxes into a single site-wide inbox might not work correctly depending on how other end presents their inboxes. If lemmy sees inboxes as separate then it may send events to multiple inboxes |
I also thought about that, but based on the logs posted by @freamon, all activities were sent to I messaged the admin of mander.xyz to help investigate, but havent received a response so far. |
Some good news / bad news for you - lemmy.ml is duplicating activities. I subscribed to asklemmy@lemmy.ml from a test Lemmy instance tunnelled through via ngrok, so I could screenshot the 'inspect' window - you can see a pattern of 200 OK, 400 Bad Request for every activity. I've included two images to demonstrate that they are the same activity. (I don't imagine it's a config issue for you guys, but LW changed their config to fix the problem when they were doing it) |
Note that the duplication has returned: https://lemmy.ml/comment/11010288 Anyone know if lemmy.world is testing that config change again or something? |
@freamon brought this up in their matrix chat the other day and the answer was nothing changed on lemmy.world's end |
@jlandahl - LW had to reset a container, so you shouldn't be getting any more duplicate activity from them. It was the same with lemmy.ml incidentally - a few comments up I was saying that they were duplicating too, but then they reset as part of testing an upgrade to 0.19.4, and that stopped it. As a Issue that seems to start randomly, and is stopped by 'turning it off and on again', I suspect this will be a hard problem to solve. I noticed when I was recording the video the other day (of which this is an excerpt), that there was a recurring pattern: 30 seconds of nothing, then a flood of activity, then back to 30 seconds of nothing, etc. I imagine this is by design, and my instance has to wait its turn to get a bunch in one, rather than get activities as they are created. It made me wonder whether 'version 1' of an activity was being generated but not actually sent, just queued up and stuck somewhere. Then Lemmy was generating 'version 2' because it thought the first version wasn't delivered successfully, and then both versions were being sent at the same time. |
Hello, I'm part of the Lemmy.World infra team. I'd be happy to help finding the root cause of this issue, but it's currently not clear what information we should be looking for? We're currently running with the default log level for our federation container and there are no logs even mentioning lemmon.website, except for some unrelated html error pages from a kbin instance. |
To fix this we have to narrow down what exactly is causing the duplicate activities. My understanding is that the bug doesnt happen with the default ansible configuration, and affected instances have some different config. Some things worth trying on a test instance:
|
We've previously been using multiple federation containers and now we are using a single federation container.
It seems to be random. Restarting fixes it according to @freamon's feedback after we restart the container.
This is probably something @freamon can answer. |
Yeah - restarting seems to fix it (at least from my perspective). It's for all activities - anything a remote community might Announce, and - for my one local community - anything a user might send. (I just verified that last part by upvoting a random post on endlesstalk.org. You can tell how broke that instance is anyway by how out-of-sync it is with votes generally (e.g. compare https://endlesstalk.org/post/34649989 with the fedi-link), but the upvote I gave for a post on !tails@lemmon.website was sent to me a sweet 7 times. There's some info here about his container setup, with a comment that also suggests that a restart fixes it.) |
I added some extra logging in #4726 to figure out where the problem is coming from. Once that is merged and released, run Lemmy with |
We also noticed some very strange behavior where it seems that our outbound federation progress for some instances seems to be jumping between two very different states: federation log
This can also be observed when watching https://phiresky.github.io/lemmy-federation-state/site?domain=lemmy.world or dashboards from grafana.lem.rocks. |
…myNet#4726) * Extra logging to debug duplicate activities (ref LemmyNet#4609) * Fix logging for api tests * fmt This is cherry-picked from 6b46a70 LemmyNet@6b46a70 LemmyNet#4726
…myNet#4726) * Extra logging to debug duplicate activities (ref LemmyNet#4609) * Fix logging for api tests * fmt This is cherry-picked from 6b46a70 LemmyNet@6b46a70 LemmyNet#4726
@MrKaplan-lw Yes that definitely looks like there is more than one worker active. So far its not clear where exactly in the code this is caused, but maybe my changes in #4733 and further testing/refactoring will help. Does this jumping between different states happen only for a single remote instance at a time, or are multiple/all of them affected at once? |
I think that is normal as Lemmy processes a batch of outgoing activities, then goes to sleep for 30s, new activities get generated and then sent out after the sleep is over. |
From the most recent occurrence for us, we had a short DB outage, which resulted in messages like
being logged, and afterwards when a new worker was started we started sending duplicate activities. |
That particular logic was removed here. So if were lucky the problem is already fixed in 0.19.4-rc.3. |
That error is actually an example why that loop you removed is necessary, because the init_and_loop function exits the loop and returns an error when it errors unexpectedly. So when this kind of error happens, the worker exits the loop, and the loop with "task exited, restarting:" restarts the task. At least, that's how it was supposed to work. I don't know why it's sending activities twice and this is a good hint, but logically your change isn't right / we need a better explanation |
I think I know what the issue is! When there is some internal error (e.g. database pool full) during a query, that query throws. If this throw happens within an InstanceWorker, the worker exits its loop, that is logged and is restarted by the CancellableTask it is wrapped in. If this throw happens within the SendManager, the SendManager exits and is restarted by the CancellableTask it is wrapped in. The SendManager then starts creating all the InstanceWorkers for each instance. The problem is that when the SendManager dies due to this, it does not cleanup all the InstanceWorkers it has created. So the InstanceWorkers of the old SendManager are still running (orphaned) and are never cancelled / cleaned up. So either the SendManager needs to cleanup it's dependents on issues, or it needs to never die. This CancellableTask restart code was wrongly removed here: https://github.com/LemmyNet/lemmy/pull/4733/files/451d1fd24fdabda6556b044fc3e601ccea05a6e3#diff-899cee276e9713c57bb5ff473ec9bd72df7831111780c0d2de7a543544667b7e and needs to be readded in some form - without it neither SendManager nor InstanceWorker will restart when they fail which will cause federation to stop working randomly |
…myNet#4726) * Extra logging to debug duplicate activities (ref LemmyNet#4609) * Fix logging for api tests * fmt This is cherry-picked from 6b46a70 LemmyNet@6b46a70 LemmyNet#4726
…myNet#4726) * Extra logging to debug duplicate activities (ref LemmyNet#4609) * Fix logging for api tests * fmt This is cherry-picked from 6b46a70 LemmyNet@6b46a70 LemmyNet#4726
Requirements
Summary
This is discussed this post on on Lemmy - you can see me protesting that it should be me filing the report, but I've slept since then, and Nutomic said please (?!). There's a question of whether it's due to instance configuration, Lemmy itself, or some combination of the two (it only stated with 0.19.x, but not all instances do it, and there isn't always a correlation between it happening and the amount of containers an instance is using). I was reluctant to make this Issue because there's information you'll need about instance config that I'm not going to be able to give to you, because I'm not running Lemmy, so this is all second-hand info and observation.
The post on Lemmy has a video. The background to it is that I wanted to use ActivityPub for an Automod on lemmy.world/c/starwarsmemes, so I sent a Follow request to that community (using a Ruby script). When anything happens lemmy.world sends an Announce activity to my server, and the Automod can, amongst other things, send a notification to my phone when a new post is made. In the video, the red terminal is showing a
tail -f
on the nginx log; the green terminal is showing the.object.id
,.object.type
, and.object.actor
from the JSON contained inside each request. starwarsmemes is pretty quiet at the minute, so I was able to monitor the results of activity that I myself was generating as a user on lemmy.world. Hopefully, the video makes it clear that duplicate activity is being sent every time.I've seen similar behaviour before: post on lemmy.ca and post on endlesstalk.org. The background to those posts is that I was running a fake lemmy community (best viewed from here), that was using a basic ActivityPub server to federate comments and votes, and I could tell from that that those instances (among others) were duplicating everything.
I saw it first with lemmy.world when they upgraded to 0.19.3, and my dumb AutoMod pinged my phone 4 times to tell me about the same new post (I mentioned it to the admins in this comment).
Outside of my server, you can also see duplicate activity trip other recipients up. One place is Lemmy modlogs. For example, this modlog shows me manually restoring a comment that the AutoMod removed. I did this with my own fair hand, using the lemmy-ui, so I know that whilst I didn't do the same thing twice, activity was sent twice to endlesstalk.org about it. Another is PieFed. They've tightened up their ability to filter out duplicates now, but - if they've not been deleted since - you can still see what happened here and here, as well as here and here (the originals weren't updated, btw). In both circumstances (modlogs and piefed), I'm guessing it's hit-and-miss because it depends on how much time there was between the original activity and its duplicate.
It's entirely possible that is down to instance mis-configuration (in which, quite a few instances are making the same mistake, which is a problem in itself). It's only slightly vague suggestions from admins that the software itself is at fault. That's why I'd prefer it was them making this Issue instead of me, but I've been requested to do it, so here it is.
Steps to Reproduce
Technical Details
(I can't you much here, for the aforementioned I'm-not-running-Lemmy reasons)
Version
0.19.3
Lemmy Instance URL
lemmy.world
The text was updated successfully, but these errors were encountered: