-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make retain_package_versions more useful at sync time #2479
Comments
Conceptually this sounds good, but I'm unclear on what the actual changes in pulpcore are. Also I'm confused because the repository versions are created prior to any downloading (or even pipeline execution) so I'm doubly confused. |
Oh I was reading this from the context of pulpcore, but I see this is entirely pulp_rpm! Is there any change in pulpcore needed? |
Created but not completed. That would be the change - downloads would happen after the repository versions are completed, which means downloading would happen after all the finalization / verification steps that might kick out content from the version. It would also mean download failures don't immediately destroy a sync operation, if it fails during the download phase you still have the repository version just with on-demand content. That could be a double-edged sword though I suppose. But all this only applies to the alternate proposal, the primary one doesn't suggest any changes to pulpcore. |
I could imagine a situation where a mixture of on_demand and immediate content was being used for a repo version, but if a sync has N content units that are immediate I think we should not call it complete until those N are downloaded. I guess for most workflows it would be ok to call it complete without those N, but then consider things like export which has to have the content locally. How did you imagine that type of workflow going? |
I think we're getting off track which is my fault, really. Don't focus on the "N immediate + M on_demand" part. My question is, is the "alternative proposal" compelling at all or should we just implement filtering in the first stage (main proposal). The main reason I bring up the "alternative proposal" is that we would already have to implement something like this if we want to "download all remote artifacts in a repository" for the sake of export - which is the inverse of the "revert this repository to on_demand for space savings" feature we currently have, and because the design seems cleaner. Otherwise doing everything in the first stage is more flexible and less invasive. |
@bmbouter See also: pulp/pulpcore#1843 and pulp/pulpcore#1976 |
This might be handy #2515 (comment) |
Provide a way to download only the latest versions of each package during an immediate sync.
The "retain_package_versions" feature currently only applies at the repository version level. This is good because it ensures that uploads will properly kick out old content, but it's bad because in immediate mode, it will download a tremendous amount of unnecessary content and then immediately orphan it.
To give an idea of how much waste there potentially is, I've updated my
rpmrepo
tool to select the newest package of eachname
from the repo.TL;DR - If we were to only download the latest version of every RHEL7 package and kept the others on-demand, a full immediate sync with the
retain_package_versions
feature enabled would only use 4.1 GB of disk / network IO instead of 60 GB.You could also conceivably do things like a full mirror sync but with N-3 older packages synced on-demand.
Upsides
Large disk, time and network bandwidth savings for
immediate
mode syncs withadditive
sync policy (mirroring is not compatible withretain_package_versions
)Smaller savings for
on_demand
+additive
syncs due to not needing to process as many packages - a 5 minute sync might take 1 minute instead.Downsides
I think you'd have to process the
primary.xml
metadata once to get the list ofpkgids
that you want to keep before proceeding with the sync process. This is easy and cheap time-wise but adds a little bit of complication to the code.Users who use mirroring cannot benefit from this, at least not without a more complicated implementation (like downloading N packages
immediate
and the rest on-demand).Alternative implementation
If the artifact downloads didn't start until after the repository version was created, this wouldn't be necessary. The single existing mechanism would handle everything (although you wouldn't get any
on_demand
sync improvements)Other benefits:
Downside:
Artifact download has to wait a while - think of the python or container plugins, it might be hours before the downloads start.
BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2109260
The text was updated successfully, but these errors were encountered: