Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove id column and use different primary key on some tables #4093

Merged
merged 46 commits into from
Nov 13, 2023

Conversation

dullbananas
Copy link
Collaborator

@dullbananas dullbananas commented Oct 24, 2023

Fixes #3933

This will also allow more concise filters using the diesel find method. For db views, I will do that in other pull requests.

Some is_not_null checks that previously used id are now a little more confusing, but that will be fixed when I refactor all db views.

@@ -0,0 +1,4 @@
ALTER TABLE post_saved
DROP COLUMN id,
ADD PRIMARY KEY (post_id, person_id);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that old databases could have duplicate entries?

Copy link
Member

@dessalines dessalines Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a unique constraint on the table already, it's fine. But if there isn't one on that given table, we need to find and delete duplicates before adding that new primary key.

edit: also, I'm not sure, but you might need to remove the previous unique constraint manually, probably after adding the new PK.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table has UNIQUE and NOT NULL constraints. I changed the migration to remove them after adding the primary key.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOT NULL is no longer removed because it does not show any effect on the schema when using pg_dump.

@@ -0,0 +1,7 @@
ALTER TABLE post_saved
DROP COLUMN id,
ADD PRIMARY KEY (post_id, person_id),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this index would be more efficient backwards? person_id as the first index element means it's quick to load all your saved posts, but maybe there's already an index for (person_id, published)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already an index on just person_id.

I changed the primary key to (person_id, post_id) and removed the old person_id index so now there's only 1 index.

@dullbananas dullbananas marked this pull request as ready for review November 4, 2023 16:15
Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, this is definitely a lot cleaner, esp for bridge tables. Before we merge I wanna test that the migrations work with production data tho, so gimme a few days on that.

@dessalines
Copy link
Member

@dullbananas Check this one out: dullbananas#17

I merged from main, fixed a few lints, and the scheduled jobs.

dessalines and others added 3 commits November 7, 2023 14:49
* Also order reports by oldest first (ref LemmyNet#4123) (LemmyNet#4129)

* Support signed fetch for federation (fixes LemmyNet#868) (LemmyNet#4125)

* Support signed fetch for federation (fixes LemmyNet#868)

* taplo

* add federation queue state to get_federated_instances api (LemmyNet#4104)

* add federation queue state to get_federated_instances api

* feature gate

* move retry sleep function

* move stuff around

* Add UI setting for collapsing bot comments. Fixes LemmyNet#3838 (LemmyNet#4098)

* Add UI setting for collapsing bot comments. Fixes LemmyNet#3838

* Fixing clippy check.

* Only keep sent and received activities for 7 days (fixes LemmyNet#4113, fixes LemmyNet#4110) (LemmyNet#4131)

* Only check auth secure on release mode. (LemmyNet#4127)

* Only check auth secure on release mode.

* Fixing wrong js-client.

* Adding is_debug_mode var.

* Fixing the desktop image on the README. (LemmyNet#4135)

* Delete dupes and add possibly missing unique constraint on person_aggregates.

* Fixing clippy lints.

---------

Co-authored-by: Nutomic <me@nutomic.com>
Co-authored-by: phiresky <phireskyde+git@gmail.com>
Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Had to kill all the database administrators in my head saying "always use autoincrement integer primary keys", but I don't think that's necessary, especially for aggregate tables.

Lets have @phiresky take a look also, since this is a bigger DB change.

Copy link
Collaborator

@phiresky phiresky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a good idea for the aggregate tables.
For the other things, I'd be a bit more cautious:

Having an incrementing integer primary key has the advantage of ensuring locality / improving the access pattern for large indexes. Using something random like UUID makes this worse which is bad as soon as parts of an index are evicted from RAM.

For example, for posts you always have a hot set which is what 90% of access goes to - when the rows are keyed by an int these are all close together and thus need only a small part of the index to find. With a random key like UUID or the ap_id they are scattered everywhere. Here's a random article about that. In addition, it makes future partitioning and sharding changes harder (e.g. we might want to partition the posts table by post_id/1e6 for index size and performance, with older partitions living on cheaper / compressed storage and falling out of ram cache).

That said, this is really only relevant for the largest tables - tables that have at least like 10 million rows and are expected to grow linearily - in Lemmy that would be posts, comments, votes. E.g. for captcha_answer, it shouldn't matter much. It's also only relevant when you look at the primary look up method for something - whatever is used to primarily look up a row should have good locality. (this is something we should consider if we switch to more readable urls in the future)

Compound indexes also increase the index size but this should be irrelevant since all the primary keys added should have a corresponding unique index that's being dropped.

For the 1:1 tables (mostly aggregates), using the same id as the main table is great all around. For most of the other changed tables, it looks good as well.

These are the tables where it's not 100% obvious if it's good:

  • comment_like,post_like: the id is replaced with a compound id (person_id, comment_id). Since it looks like comment_likes are never actually were looked up based on their id it's fine
  • image_upload: these seem to be mostly queried by alias not id anyways so it's fine. I'm guessing the primary lookup (when an image is viewed) happens inside pictrs not lemmy?

For future changes, be careful maybe. What I'm saying is mainly relevant if you change what keys rows are looked up by - removing id columns that aren't really used anyways is fine from a perf perspective.

I didn't check the down migration and I didn't exactly verify whether the triggers/tasks do what they are supposed to. But LGTM

@dessalines dessalines enabled auto-merge (squash) November 8, 2023 14:07
@Nutomic
Copy link
Member

Nutomic commented Nov 8, 2023

Yes image requests are passed to pictrs, Lemmy doesnt perform any db queries for them.

https://github.com/LemmyNet/lemmy/blob/main/crates/routes/src/images.rs

auto-merge was automatically disabled November 9, 2023 12:27

Head branch was pushed to by a user without write access

@dessalines dessalines merged commit 8e2cbc9 into LemmyNet:main Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

id column is redundant in some tables
5 participants