Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Presence over federation should be less aggressive when trying to send transactions #11383

Closed
zaibon opened this issue Nov 18, 2021 · 11 comments
Closed

Comments

@zaibon
Copy link

zaibon commented Nov 18, 2021

Description

Starting a couple of week ago, my federation sender worker started to completely flood my LAN to the point where no traffic can flow inside anymore.

Steps to reproduce

I'm running my homeserver with workers. I have:

  • main synapse processs
  • 2 generic workers
  • 1 federation sender worker
  • 1 appservice worker
  • 1 media worker

Everything goes well until I start the federation sender worker. After a few minutes, my LAN is completely unusable, no connection can be made to remote server or between device in the LAN. As soon as I stop the federation worker, the storm stops and everything come back to normal.

Here is a snippet of the worker logs:
sender.log

Version information

  • Homeserver: synapse.yarnoush.be

If not matrix.org:

  • Version: server_version 1.47.0 python_version: 3.8.12

  • Install method: Everything runs into docker containers

  • Platform: docker running on linux platform (I'm using arch btw ;-) )
    All synapse processes runs on the same host.

I do have monitoring in place, so I can also provide grafana screenshot if that would be useful.

@DMRobertson
Copy link
Contributor

Do you have presence enabled? I think that's known to be particularly heavyweight on federation traffic.

Starting a couple of week ago,

Are you able to narrow down when this started, e.g. via a git bisect?

@DMRobertson
Copy link
Contributor

Reminds me a little bit of #11049, but I'm not convinced that's the same scenario (and besides, that should be fixed in 1.47)

@DMRobertson
Copy link
Contributor

Some more thoughts:

  • there are known performance problems when DNS resolution is slow. A good way to mitigate this is to have a fast, local dns cache. I would think most OSes provide this---certainly systemd-resolved does. But I'm not sure about how this will work under docker---it will probably depend on which particular image you're running.
  • is it possible your network bandwidth is saturated?
  • Is there any evidence of a hardware problem, e.g. an overwhelmed router?

@DMRobertson DMRobertson added the X-Needs-Info This issue is blocked awaiting information from the reporter label Nov 18, 2021
@reivilibre
Copy link
Contributor

reivilibre commented Nov 18, 2021

A little local experimentation seems to show that Docker is prone to bypassing your host's DNS cache.
On my system, I'm using systemd-resolved as a resolver on the host (this service provides DNS caching, if you're not familiar).

My /etc/resolv.conf looks like this on the host:

# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0 trust-ad
search .

Doing a little bit of research, it seems that Docker basically copies /etc/resolv.conf from the host to the guest (container), but if it finds a localhost/loopback DNS resolver, it replaces it with some other DNS resolver (in my case: it seems to know what resolver I have set in NetworkManager and systemd-resolved, but I've found reports of it defaulting to Google's DNS servers 8.8.8.8 and 8.8.4.4).

Trying in a Docker image I see:

# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 11.42.36.78  # <--- this is my DNS server on the host
search .

In other words, it seems like Docker will bypass the DNS cache I have running on 127.0.0.53 and instead goes straight to an external resolver.
I don't know if there are workarounds — I thought Docker provided its own DNS to containers so they could resolve each other and use DNS from the host, but maybe this is something you have to use docker networks for or otherwise enable? (would need to research, sorry).

Another solution might be for the Synapse docker container to come with a built-in DNS cache.. :/

@zaibon
Copy link
Author

zaibon commented Nov 18, 2021

Do you have presence enabled? I think that's known to be particularly heavyweight on federation traffic.

I do, but I have the same config for months and only noticed this recently.

Are you able to narrow down when this started, e.g. via a git bisect?

Didn't try that yet. I did try to rollback to 1.46 and 1.45 and have the same behavior.

there are known performance problems when DNS resolution is slow

I had some doubt about DNS too. We might be onto something. I will try to see if I can setup a local DNS cache to see if that improves things.

is it possible your network bandwidth is saturated?

Very unlikely. I have fiber and everything is cable connected.

@zaibon
Copy link
Author

zaibon commented Nov 18, 2021

I think we can exclude DNS. I've have setup my federation worker outside of docker and run it directly on my host with sysremd-resolved (so with DNS cache) and I have observed exactly the same behavior.

@zaibon
Copy link
Author

zaibon commented Nov 19, 2021

@DMRobertson I've made some more experiment...
I have whitelisted only matrix.org for federation and see how this would impact my network. So far so good I don't seem to see anything going wrong, which I guess make sense.

So this would mean my problem comes from the number of homeserver I'm federated with.
I can understand that after a long period of time with my federation worker turned off there would be a lot of things to catch up and that could overwhelmed my network. But I've already tried to let it run for a full night and in the morning my network was still dying. So I'm wondering... shouldn't synapse try to be less aggressive in the way he's trying to send federation transaction ?
Is there any configuration I can try to ease the amount of traffic the federation worker is sending at a time ?

@zaibon
Copy link
Author

zaibon commented Nov 23, 2021

seems #5373 could also be related to my problem.

@zaibon
Copy link
Author

zaibon commented Nov 23, 2021

I have some more findings.
I have restarted my federation worker for 2 days but with presence disabled. Things ran smooth for the 2 full days.

Today, I tried to re-enable presence, and things just exploded right away. So I think it is safe to say that the problem come from there.

@zaibon zaibon changed the title Federation is flooding my network Presence over federation should be less aggressive when trying to send transactions Nov 23, 2021
@DMRobertson
Copy link
Contributor

Sounds like it falls under the umbrella of #9478.

It's interesting that you mention this has only got worse recently. (But perhaps you only recently joined a room that's federated across multiple homeservers?)

@DMRobertson DMRobertson added A-Presence and removed X-Needs-Info This issue is blocked awaiting information from the reporter labels Nov 24, 2021
@zaibon
Copy link
Author

zaibon commented Nov 24, 2021

But perhaps you only recently joined a room that's federated across multiple homeservers?

Could be, I can't tell for sure.

I guess I can close this issue since #9478 already tracks all the work to be done regarding this issue.

Thanks for the help @DMRobertson and @reivilibre

@zaibon zaibon closed this as completed Nov 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants