Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make retain_package_versions more useful at sync time #2479

Closed
dralley opened this issue Apr 9, 2022 · 8 comments · Fixed by #2547
Closed

Make retain_package_versions more useful at sync time #2479

dralley opened this issue Apr 9, 2022 · 8 comments · Fixed by #2547

Comments

@dralley
Copy link
Contributor

dralley commented Apr 9, 2022

Provide a way to download only the latest versions of each package during an immediate sync.

The "retain_package_versions" feature currently only applies at the repository version level. This is good because it ensures that uploads will properly kick out old content, but it's bad because in immediate mode, it will download a tremendous amount of unnecessary content and then immediately orphan it.

To give an idea of how much waste there potentially is, I've updated my rpmrepo tool to select the newest package of each name from the repo.

(pulp) [vagrant@pulp3-source-fedora35 devel]$ rpmrepo stats ~/devel/repos/rhel7/

Number of packages:                              32797   
└─ Number of unique packages (latest versions):  4182    
Packages total size:                             59.82 GB
└─ Size of unique packages (latest versions):    4.11 GB 
Metadata total size:                             2.16 GB 
└─ Main metadata total size:                     1.14 GB 
Metadata total size (decompressed):              11.01 GB
└─ Main metadata total size (decompressed):      5.57 GB 

TL;DR - If we were to only download the latest version of every RHEL7 package and kept the others on-demand, a full immediate sync with the retain_package_versions feature enabled would only use 4.1 GB of disk / network IO instead of 60 GB.

You could also conceivably do things like a full mirror sync but with N-3 older packages synced on-demand.

Upsides

Large disk, time and network bandwidth savings for immediate mode syncs with additive sync policy (mirroring is not compatible with retain_package_versions)

Smaller savings for on_demand + additive syncs due to not needing to process as many packages - a 5 minute sync might take 1 minute instead.

Downsides

I think you'd have to process the primary.xml metadata once to get the list of pkgids that you want to keep before proceeding with the sync process. This is easy and cheap time-wise but adds a little bit of complication to the code.

Users who use mirroring cannot benefit from this, at least not without a more complicated implementation (like downloading N packages immediate and the rest on-demand).

Alternative implementation

If the artifact downloads didn't start until after the repository version was created, this wouldn't be necessary. The single existing mechanism would handle everything (although you wouldn't get any on_demand sync improvements)

Other benefits:

  • We would need this anyway to implement "immediate download of everything in the repository version".
  • Syncs would be more resilient to download problems. The repository version would get created first and if the download fails you don't have to repeat as much work.

Downside:

Artifact download has to wait a while - think of the python or container plugins, it might be hours before the downloads start.

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2109260

@bmbouter
Copy link
Member

Conceptually this sounds good, but I'm unclear on what the actual changes in pulpcore are. Also I'm confused because the repository versions are created prior to any downloading (or even pipeline execution) so I'm doubly confused.

@bmbouter
Copy link
Member

Oh I was reading this from the context of pulpcore, but I see this is entirely pulp_rpm! Is there any change in pulpcore needed?

@dralley
Copy link
Contributor Author

dralley commented Apr 11, 2022

Oh I was reading this from the context of pulpcore, but I see this is entirely pulp_rpm! Is there any change in pulpcore needed?

Created but not completed. That would be the change - downloads would happen after the repository versions are completed, which means downloading would happen after all the finalization / verification steps that might kick out content from the version. It would also mean download failures don't immediately destroy a sync operation, if it fails during the download phase you still have the repository version just with on-demand content. That could be a double-edged sword though I suppose.

But all this only applies to the alternate proposal, the primary one doesn't suggest any changes to pulpcore.

@bmbouter
Copy link
Member

I could imagine a situation where a mixture of on_demand and immediate content was being used for a repo version, but if a sync has N content units that are immediate I think we should not call it complete until those N are downloaded. I guess for most workflows it would be ok to call it complete without those N, but then consider things like export which has to have the content locally. How did you imagine that type of workflow going?

@dralley
Copy link
Contributor Author

dralley commented Apr 11, 2022

I think we're getting off track which is my fault, really. Don't focus on the "N immediate + M on_demand" part. My question is, is the "alternative proposal" compelling at all or should we just implement filtering in the first stage (main proposal).

The main reason I bring up the "alternative proposal" is that we would already have to implement something like this if we want to "download all remote artifacts in a repository" for the sake of export - which is the inverse of the "revert this repository to on_demand for space savings" feature we currently have, and because the design seems cleaner. Otherwise doing everything in the first stage is more flexible and less invasive.

@dralley
Copy link
Contributor Author

dralley commented Apr 12, 2022

@bmbouter See also: pulp/pulpcore#1843 and pulp/pulpcore#1976

@ipanova
Copy link
Member

ipanova commented May 25, 2022

This might be handy #2515 (comment)

@dralley dralley self-assigned this May 26, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 26, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 26, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 26, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 26, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 26, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 27, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 31, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 31, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 31, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 31, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 31, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 31, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 31, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue May 31, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jun 1, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jun 3, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jun 3, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jun 3, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jun 3, 2022
@dralley dralley added this to the 3.18 milestone Jun 16, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jun 24, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jun 24, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jun 26, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jun 26, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jul 3, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jul 18, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jul 18, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jul 18, 2022
dralley added a commit that referenced this issue Jul 20, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jul 26, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jul 26, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jul 26, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jul 26, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Jul 26, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Aug 2, 2022
dralley added a commit to dralley/pulp_rpm that referenced this issue Aug 2, 2022
dralley added a commit that referenced this issue Aug 3, 2022
@pulpbot
Copy link
Member

pulpbot commented Aug 29, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants