Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Rework the storage framework to work nicely with PEP 249 #9295

Closed
ShadowJonathan opened this issue Feb 2, 2021 · 12 comments
Closed

Rework the storage framework to work nicely with PEP 249 #9295

ShadowJonathan opened this issue Feb 2, 2021 · 12 comments
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@ShadowJonathan
Copy link
Contributor

This is a "maybe" issue, but some elements of PEP 249 have already been implicitly coded into the codebase, such as synapse.storage.types, however, it could be easier to directly work with PEP 249's DB-API-v2 schema and approach to simplify the storage framework even further, and make (potential) future database engines easier to support.

(Things such as module .threadsafety properties can maybe help futher granularize database pooling mechanisms dynamically.)

@richvdh
Copy link
Member

richvdh commented Feb 2, 2021

what would this actually involve, in your mind? The codebase assumes database APIs which comply PEP249, as both psycopg2 and sqlite do.

@ShadowJonathan
Copy link
Contributor Author

More directly work with PEP249 interfaces which's quirks are caught in the database engine modules, instead of embedding the quirks all over the place.

Basically: Only consume PEP249 APIs, specialise on database type, and the database driver needs to be invisible.

Ofc only if this is possible with the way the framework is set up.

The pros of this will be that it's incredibly easy to work with multiple database engine types, and possibly multiple SQL database types, but that's far-off.

@clokep
Copy link
Member

clokep commented Feb 2, 2021

I'm still not sure what concrete changes are proposed here -- can you clarify what "storage framework" means to you? (Is it a particular class? A particular module?)

@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Feb 2, 2021

Ah sorry, it's quite ambiguous, but i mean the target area of every part of synapse that either;

  • Holds a reference to an *Engine, Connection, Cursor or Transaction object (to keep those parts of the codebase in mind)
  • Directly works on aforementioned objects with SQL statements
  • Needs to use or utilises certain quirks from database drivers and directly or indirectly invokes them by passing values that end up being used by those drivers.

With "storage framework" i meant the entirety of synapse.storage, but i have a feeling even that module-based scope might not be good enough, but we'll see.

@richvdh
Copy link
Member

richvdh commented Feb 2, 2021

can you give us some examples of what would change here? As far as I'm aware almost all of the code you mention already uses PEP-249-compliant interfaces.

@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Feb 2, 2021

I saw quite a lot of .set_session usage when going over changes needing to be made for #9299, .set_session is a psycopg2-specific method on Connection objects.

It's about the prospect these were used regardless of aforementioned supposed compliance in the codebase, which makes me think there's much more uncareful usage of adapter quirks like that, i think much more subtle usages like these exist such as in MultiWriterIdGenerator or the like, and postgres-specific replication workers (which have had the privilege of never worrying about another database adapter than psycopg2), that's what im referring to.

This is all in anticipation for pg8000, yes, but i think it's also helpful to clean up the codebase in general to prevent bugs from popping up in the future due to a driver-implementation-specific feature, regardless.

@erikjohnston
Copy link
Member

erikjohnston commented Feb 4, 2021

Right, yes we do use some extensions when using psycopg2 for performance reasons, and those tend to be gated behind a check to make sure we're using the PostgresEngine, rather than the SQLiteEngine. At the moment there isn't a distinction between connecting to a particular database (and thus being able to use non standard SQL features) vs using a particular connector (and thus being able to use particular connector features). It's absolutely possible to split those concepts up, but that obviously adds complexity.

Could this issue be better described as adding support for pg8000 library and/or other sqlite and postgres adaptors? Ah, I've just seen #9294, which is basically that.

@erikjohnston erikjohnston added T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. X-Needs-Discussion and removed X-Needs-Discussion labels Feb 4, 2021
@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Feb 4, 2021

It's absolutely possible to split those concepts up, but that obviously adds complexity.

I can counter this with 4 lines from python zen;

Simple is better than complex.
Complex is better than complicated.

Readability counts.

In the face of ambiguity, refuse the temptation to guess.

While i'm not going to hold this up as the SSOT in regards to how we should approach this, it is how I look at most projects, and right now there is both ambiguity with PostgresEngine as both an engine for psycopg2 and "the" postgres engine, so I want to split these two concepts up and make both explicit, so that in the future it is clear what is happening at any one time.

psycopg2-speedups can stay psycopg2-speedups, but it should be clear that they're psycopg2-specific, not postgres-specific, with which there is ambiguity in many places in the codebase at the moment.

@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Feb 4, 2021

Extra note; Don't just think i'm just doing this for pg8000, I just found asyncpg, an asyncio-based postgres database driver which claims to be 3x faster than psycopg2, which might be worth looking into, so a generic database interface would help wonders maintaining compatibility while testing drivers like these for speedups and extra performance.

@callahad
Copy link
Contributor

I had a chance to discuss this with the Synapse platform team today.

We concluded that we are unlikely to take up this proposal: the necessary refactoring would be fairly invasive, and thus it would only be justified if it would address a specific, identified need. Until then, we are unfortunately resigned to conflating "Postgres" with "psycopg2".

We wouldn't immediately reject a pull request which provides better abstraction, but it would face a particularly skeptical review.

Example factors which could affect this conclusion:

  • The psycopg2 driver entering end-of-life
  • Explicit measurements showing that the database driver was a performance bottleneck and that another driver would solve this
  • Element determining that supporting another RDBMS is necessary for its viability as a business

@callahad
Copy link
Contributor

(I'm happy to leave this open as a P5, but I'd need a new title which more clearly encapsulates a specific task that this issue represents)

@ShadowJonathan
Copy link
Contributor Author

Explicit measurements showing that the database driver was a performance bottleneck and that another driver would solve this

I currently believe this is the case, as you said in #9294 (comment), before re-opening this - once i have enough time to get around to it - i'll try to prove it, maybe with the load test. Thanks for taking the time to think about this, though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

No branches or pull requests

5 participants