Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap for fixing cover archival #7257

Closed
9 tasks
Tracked by #6822
mekarpeles opened this issue Dec 9, 2022 · 1 comment · Fixed by #8208
Closed
9 tasks
Tracked by #6822

Roadmap for fixing cover archival #7257

mekarpeles opened this issue Dec 9, 2022 · 1 comment · Fixed by #8208
Assignees
Labels
Affects: Operations Affects the IA DevOps folks Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Module: Cover Service Cover Store (book covers service) Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Priority: 2 Important, as time permits. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed]

Comments

@mekarpeles
Copy link
Member

mekarpeles commented Dec 9, 2022

Part of #6822

Background

Open Library runs a coverserver with ~1TB of space. Re: #3386, at least monthly, the coverstore is backed up at the disk level.

Originally posted by @mekarpeles in #3386 (comment)

This doesn't prevent it from filling up. Typically, when new covers are added to Open Library, they end up (a) on the coverstore mounted storage drive and (b) tracked in the coverstore database.

Since landing #7230, we now have documentation and a tested/understood process for moving covers off disk into archives that can be uploaded to archive.org -- See README.

Problem(s)

There are 4 problems with the current state of manual cover archival:

  1. When we move covers off disk into .tar archives and upload them to archive.org, we don't update the cover's location in the db to point to resulting archive.org path. This makes it so we have to guess if the file is in a zip or tar.
    • Presently, all covers with IDs between 8M and 8.81M are in .tar format on archive.org items. These db rows can easily be updated so their location is correct in the db
    • Moving forward, we should update the archival script so each row of the db is updated with the right archive.org path
  2. Investigate migrating cover tars -> zips #7478 The current archival process results in .tars which are much slower to access from archive.org than .zips
    • We want to update our existing archive process to make .zips instead of .tars
    • A cleanup job is desired to go back and convert our hundreds of existing .tar dumps (e.g. every tar and index partial in https://archive.org/download/covers_0008) to zip. And then to update the database references, as is specified in [1].
  3. Monthly Cron to archive book cover files: Creating a daily cron job that...
    • Moves covers off of disk into .zip files
    • Uploads these zips into archive.org items
    • Confirms (using the audit mechanism) that the zip partials have been uploaded
    • Updates the cover db paths to the corresponding archived .zip url
  4. Internal basic cover archival documentations & troubleshooting
  5. Cover archival tests
@mekarpeles mekarpeles added Module: Cover Service Cover Store (book covers service) Priority: 2 Important, as time permits. [managed] Affects: Operations Affects the IA DevOps folks labels Dec 9, 2022
@mekarpeles mekarpeles added this to the Next (proposed) milestone Dec 9, 2022
@anandology
Copy link
Collaborator

The challenge with time based backups was that cover archive expects fixed
number of images per archive.

It may be good idea to seperate the backup and the archive.

A nightly job backups all the covers uploaded in the previous day as a tar
ball with date in the filename to ol-covers-backup item.

A separate monthly job would archive the covers on the machine, only when
enough covers are avilable for an item.

What do you think @mekarpeles ?

@mekarpeles mekarpeles modified the milestones: Next (proposed), 2023 Jan 26, 2023
@mekarpeles mekarpeles changed the title Monthly Cron to archive book cover files Roadmap for fixing cover archival Mar 21, 2023
@mekarpeles mekarpeles added Type: Epic A feature or refactor that is big enough to require subissues. [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Lead: @scottbarnes Issues overseen by Scott (Community Imports) labels Mar 21, 2023
@cdrini cdrini modified the milestones: 2023, Sprint 2023-04 Mar 27, 2023
@mekarpeles mekarpeles added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] and removed Priority: 2 Important, as time permits. [managed] labels May 1, 2023
@mekarpeles mekarpeles added Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] and removed Lead: @scottbarnes Issues overseen by Scott (Community Imports) labels May 22, 2023
@mekarpeles mekarpeles added Priority: 2 Important, as time permits. [managed] and removed Priority: 1 Do this week, receiving emails, time sensitive, . [managed] labels Jul 17, 2023
@mekarpeles mekarpeles removed this from the Sprint 2023-07 milestone Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Operations Affects the IA DevOps folks Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Module: Cover Service Cover Store (book covers service) Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Priority: 2 Important, as time permits. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed]
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants