Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): library refresh go brrr #14456

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

etnoy
Copy link
Contributor

@etnoy etnoy commented Dec 2, 2024

This PR significantly improves library scanning performance. Wherever suitable, we are doing jobs in batches, and many looped database interactions are replaced with SQL queries.

A library scan with 22k items where nothing has changed since the last scan used to take 1m 22s, now it's below 10 seconds, an improvement of 87 percent!

Highlights:

  • File paths crawled on disk are compared in sql to discard already-imported files
  • Modified files are scanned in batches, then a single db call updates all of them
  • Missing files are identified in batches, then a single db call marks all of them as offline
  • Import paths and exclusion patterns are matched against library assets in a single sql query

Bonus:

  • Greatly improved log messages related to library scans
  • More e2e tests for handling when offline files go back online, leading to one major bug fixed

@etnoy etnoy force-pushed the feat/inline-offline-check branch from 80aa615 to 8ecde3b Compare December 2, 2024 21:46
Copy link
Contributor

@mertalev mertalev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start! I think there are still a lot of untapped potential improvements here.

server/src/services/library.service.ts Outdated Show resolved Hide resolved
server/src/services/library.service.ts Outdated Show resolved Hide resolved
@mertalev
Copy link
Contributor

mertalev commented Dec 4, 2024

The update to fileCreatedAt, fileModifiedAt and originalFileName is unnecessary and can be handled in metadata extraction since this will be queued anyway. This makes the batched update for isOffline and deletedAt simpler since there'll be no values that are unique to each asset.

@etnoy
Copy link
Contributor Author

etnoy commented Dec 8, 2024

Thanks for your comments @mertalev ! I'll first attempt to do the import path and exclusion pattern checks in SQL and then move to your suggestions

@etnoy etnoy force-pushed the feat/inline-offline-check branch 2 times, most recently from d394654 to 8b2a48c Compare December 9, 2024 21:34
@etnoy etnoy force-pushed the feat/inline-offline-check branch 3 times, most recently from 6d69307 to c26f6aa Compare December 10, 2024 16:41
@etnoy etnoy force-pushed the feat/inline-offline-check branch from c26f6aa to a3be620 Compare December 10, 2024 20:39
@etnoy etnoy changed the title feat(server): run all offline checks in a single job feat(server): library refresh go brrr Dec 10, 2024
.where({ isOffline: false })
.andWhere(
new Brackets((qb) => {
qb.where('originalPath NOT SIMILAR TO :paths', {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use LIKE instead of SIMILAR TO.

The exclusions and import paths are also specific to a particular library, right? So you need to specify the library in the query.

Also, can you generate SQL for this and confirm with EXPLAIN ANALYZE that it uses an index?

.update()
.set({
isOffline: true,
deletedAt: new Date(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status also needs to be set. This is why I don't really like the status field. The same info is stored in multiple places so it's so easy for it to go out of sync like this.

@etnoy etnoy force-pushed the feat/inline-offline-check branch 4 times, most recently from 4dafcc9 to 775b817 Compare December 12, 2024 07:48
@etnoy etnoy force-pushed the feat/inline-offline-check branch from 775b817 to 69b273d Compare December 12, 2024 20:59
@etnoy
Copy link
Contributor Author

etnoy commented Dec 12, 2024

The update to fileCreatedAt, fileModifiedAt and originalFileName is unnecessary and can be handled in metadata extraction since this will be queued anyway. This makes the batched update for isOffline and deletedAt simpler since there'll be no values that are unique to each asset.

Never thought of that, I've implemented your suggestion. I'm also considering changing the initial import code to ignore file mtime, this allows us to not do any file system calls except for the crawl. Metadata extraction will have to do the heavy lifting instead

@mertalev
Copy link
Contributor

mertalev commented Dec 12, 2024

The update to fileCreatedAt, fileModifiedAt and originalFileName is unnecessary and can be handled in metadata extraction since this will be queued anyway. This makes the batched update for isOffline and deletedAt simpler since there'll be no values that are unique to each asset.

Never thought of that, I've implemented your suggestion. I'm also considering changing the initial import code to ignore file mtime, this allows us to not do any file system calls except for the crawl. Metadata extraction will have to do the heavy lifting instead

Would that mean you queue them for metadata extraction even if they're unchanged? You can test it but I think it'd be more overhead than the stat calls.

Edit: also if you do this with the source set to upload, it would definitely be worse because it would queue a bunch of other things after metadata extraction.

@etnoy
Copy link
Contributor Author

etnoy commented Dec 12, 2024

The update to fileCreatedAt, fileModifiedAt and originalFileName is unnecessary and can be handled in metadata extraction since this will be queued anyway. This makes the batched update for isOffline and deletedAt simpler since there'll be no values that are unique to each asset.

Never thought of that, I've implemented your suggestion. I'm also considering changing the initial import code to ignore file mtime, this allows us to not do any file system calls except for the crawl. Metadata extraction will have to do the heavy lifting instead

Would that mean you queue them for metadata extraction even if they're unchanged? You can test it but I think it'd be more overhead than the stat calls.

Edit: also if you do this with the source set to upload, it would definitely be worse because it would queue a bunch of other things after metadata extraction.

I was referring to new imports, files that are new to immich. I hoped to improve the ingest performance by removing the stat call. After testing, there are two issues:

  • assetRepository.create requires mtime, which we can only get from stat. We could work around that by setting it to new Date(), but ideally it should be undefined
  • We still check for the existence of a sidecar, and this complicates things

If we can mitigate the two issues above, I can rewrite the library import feature and do that in batches as well!

@mertalev
Copy link
Contributor

mertalev commented Dec 12, 2024

I don't see why fileModifiedAt needs a non-null constraint in the DB. Might just be an oversight that didn't matter because it didn't affect our usage. I think you can change the asset entity and generate a migration to remove that constraint.

For sidecar files, maybe you could add.xmp to the glob filter and enable the option to make the files come in sorted order? That way you could make sure they're in the same batch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants