Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

disable federation if sqlite is used #2917

Open
uhoreg opened this issue Feb 27, 2018 · 23 comments
Open

disable federation if sqlite is used #2917

uhoreg opened this issue Feb 27, 2018 · 23 comments
Labels
A-Performance Performance, both client-facing and admin-facing O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@uhoreg
Copy link
Member

uhoreg commented Feb 27, 2018

unless the admin set a yes_i_want_my_server_to_catch_on_fire option. This preserves the ability to get something up and running for testing, but reduce the surprise of peoples' servers melting when joining a large federated room like HQ. Of course, there will need to be a big warning somewhere so that people know that federation is disabled.

@neilisfragile neilisfragile added enhancement z-p2 (Deprecated Label) A-Docs things relating to the documentation labels Feb 28, 2018
@neilisfragile
Copy link
Contributor

Agreed on need for big warning - we'll be revamping our docs and onboarding rsn so I would see this as a good thing to include.

(Not really docs, but tagging since it would need clear doc'ing)

@ShellCode33
Copy link

This feature would be great even if using postgresql ! I love Synapse and the Matrix environment, but I don't have a big server and being able to disable federation just in order to be able to chat with my friends on my private homeserver would be wonderful ! Synapse is regularly killed by the OOM Killer because of its RAM consumption and having to restart it every two or three days is a real PITA...

@ara4n
Copy link
Member

ara4n commented Feb 28, 2018

disabling federation in general landed on develop a few weeks ago and will be in synapse 0.27

@krombel
Copy link
Contributor

krombel commented Mar 13, 2018

Just as a reference: #2619

// EDIT as question got answered in the meantime

@ara4n ara4n removed the A-Docs things relating to the documentation label Apr 5, 2018
@ara4n
Copy link
Member

ara4n commented Apr 5, 2018

see also #2889

@ara4n ara4n added p1 and removed z-p2 (Deprecated Label) labels Apr 9, 2018
@ara4n
Copy link
Member

ara4n commented Apr 9, 2018

bumping priority as more and more people are getting a horrible experience due to this :(

@neilisfragile
Copy link
Contributor

The problem with disabling/refusing to federate large rooms is that we end up breaking sytest and existing deployments. So conclusion is that we will

  • Investigate sqlite regressions to see if anything obvious exists
  • Add a health warning logs on synctl start up
  • Make set up guide default to postgres and extract installation notes into it's own page

@Valodim
Copy link
Contributor

Valodim commented Dec 18, 2018

So. What's the actual point of even supporting sqlite? It's not testing - there are working tests with postgres. It's not for trying out matrix - no need to run a homeserver for that. If it's for "running small instances" - if running a postgres instance is the relevant hurdle for some sysadmin on whether to run a synapse homeserver or not, it is frankly a question of time until that goes sideways and they'll walk away burnt. Is there a good reason to keep sqlite support around?

@neilisfragile
Copy link
Contributor

It's mainly there to help people get going quickly with a view to upgrading to postgres as soon as their usage gets more serious.

Having postgres in CI definitely reduces the case for retaining sqlite support and supporting two db engines doesn't come for free.

@neilisfragile
Copy link
Contributor

To summarise.
Disable federation by default for sqlite, allow a i-know-this-is-slow flag in config, ensure that the reason that federation fails is clear to the admin.

@Half-Shot
Copy link
Collaborator

This needs to happen pre 1.0, and ideally now-ish as people get excited for 1.0 and start installing their homeservers.

@richvdh
Copy link
Member

richvdh commented Apr 24, 2019

I am not sure I agree that it should be pre-1.0, for the reasons set out at #5078 (review).

d3m3vilurr added a commit to d3m3vilurr/matrix-dockers that referenced this issue Jun 26, 2019
this option is stressful & reduce server performance.

ref:
- matrix-org/synapse#2619
- matrix-org/synapse#2917
@richvdh
Copy link
Member

richvdh commented Feb 12, 2020

My current thinking on this is:

  • there should be a configuration option federation_enabled, which is set to false in the generated config (I'd love it to be false by default, but that will break everybody's existing setup)

  • synapse ignores attempts to enable federation listeners (and make outbound federation connections) if federation_enabled is set to false. (It should instead warn. This approach means that federation can be temporarily disabled without editing half the config file).

  • when you enable federation_enabled, we can do some sanity checks including refusing to start if the database engine is sqlite and you haven't set i_want_my_server_to_catch_on_fire: true.

@richvdh
Copy link
Member

richvdh commented Jul 30, 2020

there was some discussion about whether this should apply to existing servers, or not.

On the one hand, it does seem quite harsh to break existing, federating setups. On the other hand: how do we differentiate between:

  • "existing, federating setup"
  • "existing setup which has not yet done any federation"

Doing so is tricky, and I feel pretty strongly that we want to catch the latter class, particularly given that the recommended installation procedure is to get things working locally first and then worry about federation.

I think I'm more in favour of the nuclear option of just doing it. We can always tell people about i_want_my_server_to_catch_on_fire in UPGRADE.rst.

@richvdh
Copy link
Member

richvdh commented Jul 30, 2020

I also wonder how important this is now that INSTALL.md has been clarified thanks to https://github.com/matrix-org/synapse/pull/7899/files#diff-7d442b7eb49f5fc377f51e74b291cfc1R32.

@aaronraimist
Copy link
Contributor

aaronraimist commented Jul 31, 2020

I still believe this is very important for several reasons.

  1. Admins of existing installs won't notice the change to the documentation but they will notice federation not working
  2. Many people use random old tutorials from around the internet rather than using the official install documentation
  3. Even if they do read the official documentation they might miss it, think it doesn't apply to them ("I'm sure my server will have a light workload"), or just not feel like setting up Postgres right now

If you want to lessen the impact on existing installs, the change could happen in a couple of phases. The first phase could be a mention in UPGRADE.rst that federation will be disabled for SQLite installs in an upcoming version and log a bunch of warnings/errors each time Synapse starts. Then a few versions later you actually disable it. This issue has been open for years so another couple of months delay in actually disabling it isn't a big deal.

@aaronraimist
Copy link
Contributor

aaronraimist commented Jul 31, 2020

Also fixing #2317 / #4270 would be a good first step.

@richvdh
Copy link
Member

richvdh commented Aug 17, 2020

it might (also) be worth investigating if fixing #8105 makes this less of a problem.

@MadLittleMods MadLittleMods added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Jul 7, 2021
@MadLittleMods
Copy link
Contributor

Related to #6401

@thestinger
Copy link

thestinger commented Jul 9, 2021

At the moment, the issue is not that SQLite is slow but that Synapse is using SQLite incorrectly. It's not simply a performance issue with SQLite and it won't be fixed simply by enabling WAL mode and allowing multiple threads. I do genuinely mean that it's using SQLite incorrectly rather than simply in a mode that has low performance. Until this is fixed, you should really not be supporting SQLite as a backend at all because it's broken even for small deployments right now.

First of all, WAL mode should be enabled unconditionally and should be the only supported way to use it. WAL mode supports N concurrent readers with 1 concurrent writer. Non-WAL mode only supports either N concurrent readers or 1 writer, so writes can indefinitely block reads which makes things drastically worse. However, you'll still get busy errors before the timeout elapses unless you fix your code to do transactions properly.

If you're using WAL mode, the database won't ever be busy for reads. It can still be busy for writes, but if you do things properly it will only report that the database is busy once the transaction attempting to perform the write has waited for the entire busy timeout.

You should configure the busy timeout appropriately so that the write transactions will wait for a potentially long time if a lot of writes are being done, particularly via long transactions.

A deferred write transaction will immediately fail as the database being busy when it enters write mode if a concurrent write occurred which could have invalidated earlier reads. You should completely avoid deferred write transactions to avoid this.

Any transaction where writes MAY be performed needs to be started with BEGIN IMMEDIATE rather than BEGIN in order to obtain the writer lock as part of starting the transaction. This makes sure that the transaction will be able to proceed. Using BEGIN implies using a deferred writer lock which means that the transaction needs to fail as the database being busy if a write occurred which may have modified the state of the database seen by a read in the transaction.

Once you fix these things by using WAL mode unconditionally and avoiding deferred write transactions, you should permit many concurrent connections rather than restricting it to 1 thread. You should really permit at least 64 concurrent connections to optimize for machines with an SSD. Using an HDD isn't viable once the working set of the database stops fitting in memory anyway.

@ShadowJonathan
Copy link
Contributor

There is one caveat to using WAL which i ran into recently, which is that the WAL file itself can grow unconditionally if readers are constantly accessing it. The SQLite documentation says;

A checkpoint operation takes content from the WAL file and transfers it back into the original database file. A checkpoint can run concurrently with readers, however the checkpoint must stop when it reaches a page in the WAL that is past the end mark of any current reader. The checkpoint has to stop at that point because otherwise it might overwrite part of the database file that the reader is actively using. The checkpoint remembers (in the wal-index) how far it got and will resume transferring content from the WAL to the database from where it left off on the next invocation.

Thus a long-running read transaction can prevent a checkpointer from making progress. But presumably every read transaction will eventually end and the checkpointer will be able to continue.

Whenever a write operation occurs, the writer checks how much progress the checkpointer has made, and if the entire WAL has been transferred into the database and synced and if no readers are making use of the WAL, then the writer will rewind the WAL back to the beginning and start putting new transactions at the beginning of the WAL. This mechanism prevents a WAL file from growing without bound.

I ran into this issue when adding support for a sqlite backend for conduit, SQLite is - by default - in PRAGMA wal_checkpoint=PASSIVE;, meaning that it will try to do as much as it can, but unless nothing is reading from the file, it'll grow without bounds. (see also the PR description of above link)

SQLite documentation on wal_checkpoint=PASSIVE;

Checkpoint as many frames as possible without waiting for any database readers or writers to finish. Sync the db file if all frames in the log are checkpointed.


In conclusion, I think a core reason why WAL might've not been turned in the past (and why I can see the hesitance to enable it now) would be because, under load, that file can grow without bounds. The solution to this is to rwlock all database access and perform noop with wal_checkpoint=TRUNCATE, which'll force the WAL file to shrink, but that introduces extra logic to the storage part of the codebase.

@richvdh
Copy link
Member

richvdh commented Sep 10, 2021

Honestly the only reason that we haven't investigated things like WAL (or indeed just using multiple threads) is that sqlite performance isn't really a priority for us - Postgres is unquestionably a more appropriate solution for deployment at scale (apart from anything else, it makes it much easier to separate your application servers from your database servers), so we've ended up focussing on that.

If any enthusiastic contributors would like to investigate supporting WALs and creating a PR to support them, that would be great, but it's unlikely the Synapse core team will find time to prioritise that sort of work in the near future.

@erikjohnston erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. O-Occasional Affects or can be seen by some users regularly or most users rarely A-Performance Performance, both client-facing and admin-facing and removed z-p1 labels Sep 7, 2022
@asymmetric
Copy link
Contributor

asymmetric commented Sep 24, 2022

I do genuinely mean that it's using SQLite incorrectly rather than simply in a mode that has low performance.

@thestinger we have been running a very small (~20 users), federated instance and have never had any issues, performance- or otherwise.

Could you or someone elucidate how specifically this incorrectness would manifest in an SQLite setup?

Because given our experience, trading off SQLite for PostgreSQL doesn't really seem worth the additional overhead.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

No branches or pull requests