-
-
Notifications
You must be signed in to change notification settings - Fork 887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove id column and use different primary key on some tables #4093
Conversation
@@ -0,0 +1,4 @@ | |||
ALTER TABLE post_saved | |||
DROP COLUMN id, | |||
ADD PRIMARY KEY (post_id, person_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that old databases could have duplicate entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's a unique constraint on the table already, it's fine. But if there isn't one on that given table, we need to find and delete duplicates before adding that new primary key.
edit: also, I'm not sure, but you might need to remove the previous unique constraint manually, probably after adding the new PK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table has UNIQUE
and NOT NULL
constraints. I changed the migration to remove them after adding the primary key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOT NULL
is no longer removed because it does not show any effect on the schema when using pg_dump.
@@ -0,0 +1,7 @@ | |||
ALTER TABLE post_saved | |||
DROP COLUMN id, | |||
ADD PRIMARY KEY (post_id, person_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this index would be more efficient backwards? person_id as the first index element means it's quick to load all your saved posts, but maybe there's already an index for (person_id, published)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already an index on just person_id.
I changed the primary key to (person_id, post_id) and removed the old person_id index so now there's only 1 index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, this is definitely a lot cleaner, esp for bridge tables. Before we merge I wanna test that the migrations work with production data tho, so gimme a few days on that.
@dullbananas Check this one out: dullbananas#17 I merged from main, fixed a few lints, and the scheduled jobs. |
* Also order reports by oldest first (ref LemmyNet#4123) (LemmyNet#4129) * Support signed fetch for federation (fixes LemmyNet#868) (LemmyNet#4125) * Support signed fetch for federation (fixes LemmyNet#868) * taplo * add federation queue state to get_federated_instances api (LemmyNet#4104) * add federation queue state to get_federated_instances api * feature gate * move retry sleep function * move stuff around * Add UI setting for collapsing bot comments. Fixes LemmyNet#3838 (LemmyNet#4098) * Add UI setting for collapsing bot comments. Fixes LemmyNet#3838 * Fixing clippy check. * Only keep sent and received activities for 7 days (fixes LemmyNet#4113, fixes LemmyNet#4110) (LemmyNet#4131) * Only check auth secure on release mode. (LemmyNet#4127) * Only check auth secure on release mode. * Fixing wrong js-client. * Adding is_debug_mode var. * Fixing the desktop image on the README. (LemmyNet#4135) * Delete dupes and add possibly missing unique constraint on person_aggregates. * Fixing clippy lints. --------- Co-authored-by: Nutomic <me@nutomic.com> Co-authored-by: phiresky <phireskyde+git@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Had to kill all the database administrators in my head saying "always use autoincrement integer primary keys", but I don't think that's necessary, especially for aggregate tables.
Lets have @phiresky take a look also, since this is a bigger DB change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely a good idea for the aggregate tables.
For the other things, I'd be a bit more cautious:
Having an incrementing integer primary key has the advantage of ensuring locality / improving the access pattern for large indexes. Using something random like UUID makes this worse which is bad as soon as parts of an index are evicted from RAM.
For example, for posts you always have a hot set which is what 90% of access goes to - when the rows are keyed by an int these are all close together and thus need only a small part of the index to find. With a random key like UUID or the ap_id they are scattered everywhere. Here's a random article about that. In addition, it makes future partitioning and sharding changes harder (e.g. we might want to partition the posts table by post_id/1e6 for index size and performance, with older partitions living on cheaper / compressed storage and falling out of ram cache).
That said, this is really only relevant for the largest tables - tables that have at least like 10 million rows and are expected to grow linearily - in Lemmy that would be posts, comments, votes. E.g. for captcha_answer
, it shouldn't matter much. It's also only relevant when you look at the primary look up method for something - whatever is used to primarily look up a row should have good locality. (this is something we should consider if we switch to more readable urls in the future)
Compound indexes also increase the index size but this should be irrelevant since all the primary keys added should have a corresponding unique index that's being dropped.
For the 1:1 tables (mostly aggregates), using the same id as the main table is great all around. For most of the other changed tables, it looks good as well.
These are the tables where it's not 100% obvious if it's good:
- comment_like,post_like: the id is replaced with a compound id (person_id, comment_id). Since it looks like comment_likes are never actually were looked up based on their id it's fine
- image_upload: these seem to be mostly queried by alias not id anyways so it's fine. I'm guessing the primary lookup (when an image is viewed) happens inside pictrs not lemmy?
For future changes, be careful maybe. What I'm saying is mainly relevant if you change what keys rows are looked up by - removing id columns that aren't really used anyways is fine from a perf perspective.
I didn't check the down migration and I didn't exactly verify whether the triggers/tasks do what they are supposed to. But LGTM
Yes image requests are passed to pictrs, Lemmy doesnt perform any db queries for them. https://github.com/LemmyNet/lemmy/blob/main/crates/routes/src/images.rs |
Head branch was pushed to by a user without write access
Fixes #3933
This will also allow more concise filters using the diesel
find
method. For db views, I will do that in other pull requests.Some
is_not_null
checks that previously usedid
are now a little more confusing, but that will be fixed when I refactor all db views.