image-pruner: prune images in their own jobs #19468

miminar · 2018-04-23T08:33:41Z

Instead of pruning in phases:

all streams -> all layers -> all blobs -> manifests -> images

Prune individual images in parallel jobs:

all streams -> parallel [
   image1's layers -> image1's blobs -> ... -> image1,
   image2's layers -> image2's blobs -> ... -> image2,
   ...
]

A failure in streams prune phase is not fatal anymore.

Resolves: rhbz#1567657

Additionally, previously manifest blobs weren't removed from the blob store. This PR removes the manifests of deleted images from blob store as well.

Remarks:

parallel execution makes it a bit more complicated
in order to make it more legible and reviewable, the parallelism with the handling of blocked images could be moved to a follow-up PR
this doesn't aim to make the image pruning 100% safe - it just attempts to make it error tolerable while maintaining as much consistency as possible

TODOs:

- pretty print result summary
- avoid potential races caused by using shared API clients by intantiating them separately for each worker
- resolve races in tests where the counters are shared between go-routines
- detect changes to image streams and update graph accordingly on the fly
- and image creations/deletions and update graph accordingly

miminar · 2018-04-23T10:38:07Z

Publishing this in early state to gather some feedback.

@dmage, @legionus, @bparees, @coreydaley PTAL

legionus · 2018-04-23T13:46:16Z

Dynamically updated graph will not help to achieve greater consistency. You made a very comlicate code and it will speed up the work of the prunner, but this does not help with maintaining the integrity of the database. Removing an image is not atomic and in case of an error you will break the database.

In connection, I believe that this approach is not effective. I would suggest a different approach to prunning: mark-and-sweep. At the first iteration we mark objects with an annotation with the date of the planned deletion (for example, during 3 days). At the next iteration, we delete the objects whose label still stands. In this case, the graph is not updated dynamically.

dmage · 2018-04-23T12:43:40Z

pkg/oc/admin/prune/imageprune/worker.go

+			return
+		}
+		out <- *w.prune(job)
+	}


for job := range in { out <- *w.prune(job) }

In this case you don't need nil sentinels and you need just to close the channel to stop processing.

Good idea, I'll check that out.

Update: rewritten

dmage · 2018-04-23T13:57:45Z

pkg/oc/admin/prune/imageprune/prune.go


-	return true
+		select {


Why do you need select there?

It's supposed to handle events either from workers, image streams listener or image listener (which are not there yet).

dmage · 2018-04-23T14:05:11Z

pkg/oc/admin/prune/imageprune/prune.go

+
+	// UnreferencedImageComponentEdgeKind is an edge from an ImageNode to an ImageComponentNode denoting that
+	// the component is currently being unreferenced in a running job.
+	UnreferencedImageComponentEdgeKind = "UnreferencedImageComponentToDelete"


This feels very artificial. Why do we need these "negative" references?

Can we structure the pruner such a way that we remove only nodes which don't have any references?

Yeah, I see now that it makes the alg. more complex without having a value. I wanted to somehow track the blobs being deleted (in the jobs running right now) inside the graph, but it just adds inefficiency and complexity. The tracking can be easily done outside of the graph.
If it weren't for the parallelism, the tracking wouldn't be necessary at all.

Can we structure the pruner such a way that we remove only nodes which don't have any references?

To answer that, we need to first answer https://github.com/openshift/origin/pull/19468/files/99829b95497e6c39d0bdafc4fa00b6f017e23a6e#r183420937. If we continue to stick with the current behaviour of keeping the image if any error for its components happens, then the unreferencing won't happen until the objects are deleted.

But I agree it would be more natural and it would simplify the algorithm.

+1 on tracking this elsewhere.

Reworked and simplified a bit.

dmage · 2018-04-23T14:05:53Z

pkg/oc/admin/prune/imageprune/prune.go

+	return
+}
+
+func strenghtenReferencesFromFailedImageStreams(g genericgraph.Graph, failures []Failure) {


s/strenghten/strengthen/

You're not going to believe it, but my offline dictionary contains both 😄. I cannot find it online though so I'm going to trust you.

dmage · 2018-04-23T14:11:50Z

pkg/oc/admin/prune/imageprune/prune.go

+						p.g.AddEdge(imageStreamNode, s, ReferencedImageManifestEdgeKind)
+						break
+					default:
+						panic(fmt.Sprintf("unhandeled image component type %q", cn.Type))


Can we return this error?

Well, I don't expect this to ever fire. But sure.

also should be "unhandled"

but yeah, i don't think we ever want to panic, even in the cli.

dmage · 2018-04-23T14:18:22Z

pkg/oc/admin/prune/imageprune/worker.go

+	res := &JobResult{Job: job}
+
+	// If namespace is specified prune only ImageStreams and nothing more.
+	if len(w.algorithm.namespace) > 0 {


When may this condition become true?

Good catch, the statement was misplaced. It's removed now.

michojel · 2018-04-23T14:19:33Z

@legionus I don't see how mark-and-sweep approach makes this safer and simpler. What prevents users from making new references to marked images during the sweep phase?
I know we discussed this before and could not agree on anything 100% bullet proof even when using locks in etcd. And introducing locking mechanism for the sake of safe image pruning will make the subject even more complex.

The aim of this PR is not to make the pruning safer. The purpose is to tolerate errors that prevent customers from pruning anything by doing the pruning iteratively - image after image which brings a vague atomicity to the pruning.

I don't think this change prevents us from making the pruning safer in the future. It does not prevents us from implementing mark-and-sweep, etcd or storage locking. All it does is just introduction of job-based pruning while building on top of what we already have.

legionus · 2018-04-23T13:49:16Z

pkg/oc/admin/prune/imageprune/prune.go

@@ -333,6 +356,7 @@ func (p *pruner) addImagesToGraph(images *imageapi.ImageList) []error {
 //
 // addImageStreamsToGraph also adds references from each stream to all the
 // layers it references (via each image a stream references).
+// TODO: identify streams with non-existing images for later cleanup
 func (p *pruner) addImageStreamsToGraph(streams *imageapi.ImageStreamList, limits map[string][]*kapi.LimitRange) []error {


This function always returns nil. In case of an error, it panics.

Good catch. Haven't even realized that. I'm more leaning towards the panics for errors not aimed at the end user but a developer (saying, you're code is buggy, go fix it) such as this one. But I don't feel that strong about it, so I'll look into returning it as an error.

Error reported now without a panic.

legionus · 2018-04-23T14:08:50Z

pkg/oc/admin/prune/imageprune/prune.go

+		case *imagegraph.ImageStreamNode:
+			// ignore
+		default:
+			panic(fmt.Sprintf("unhandeled graph node %t", d.Node))


it just attempts to make it error tolerable ? :)

I can return the error for sure. This was meant rather as a debug statement until all the types got handled.

Not panicking any more.

legionus · 2018-04-23T14:22:49Z

pkg/oc/admin/prune/imageprune/worker.go

+
+	if len(res.Failures) > 0 {
+		// TODO: include image as a failure as well for the sake of summary's completness
+		return res


If w.algorithm.pruneRegistry == true then here you will have broken Image. This image will not have blobs, but the registry will assume that these objects are there. This is a very bad condition.

It will be more correct to delete the image and then if there are no errors try to delete the objects.

True, I'm keeping the current behaviour that preserves image objects for future prunes so that blobs that failed to be pruned are recognized on the subsequent prunes. If we remove the image, we loose the option to prune its blobs using this kind of pruner.
But since we have a hard-pruner now, we could do the removal. I just hesitate to make it a default behaviour. Maybe a flag --keep-broken-images could be useful here. WDYT?

A scenario speaking against the image deletion regardless of prio failure is that the registry lost write permission to the storage and deletion of each blob fails (already happened to some customers). In this case all images are deleted but all the blobs remain on the storage.
Is it worse than broken images? Hard to tell. It depends on what the customer wants to achieve - either he's running out of storage space or etcd is too big and slow or even both.
Therefor I'm more inclined to having a flag like --keep-broken-images-on-failure defaulting to true (current behaviour) so that the customer can choose what he prefers.

@bparees thoughts?

As you note, hard-prune can be used to resolve that scenario (blobs being left in storage). I don't think we need to introduce new flags for something that i would hope is not a typical scenario.

I'm just afraid that hard-prune is not a very popular solution to this problem and am reluctant to make it a mandatory post-routine.

it won't be mandatory unless there are other issues in their registry. if we can't delete layers due to storage issues, they're going to have to fix their storage issues anyway... running hard prune again after doing so seems like a small additional burden.

Is there some other case where you think it would be better to leave the image data in place if we cannot remove the blob data?

Is there some other case where you think it would be better to leave the image data in place if we cannot remove the blob data?

Just a few similar scenarios where the user wants to prune especially storage like: connection issues to the registry or registry does not see the user as authorized to do the prune.

What about keeping the image only if there were no successful blob deletions? So that broken images get pruned, but healthy images can either be reused or pruned next time.

In this case all images are deleted but all the blobs remain on the storage.
Is it worse than broken images?

@miminar No, broken images mush worse. We have few ways how to remove blobs completely or even restore images in the etcd, but we do not have tools to restore blobs in the storage to fix images. We have only diagnostic tool which can help to find such broken images. User should remove image from etcd and re-push such image by hands to fix them. Until he does this, these images will be broken. There will be errors in push and pull. It seems to me that this is a much worse scenario, because the cluster goes into an inconsistent state and the correction requires manual intervention.

What about keeping the image only if there were no successful blob deletions?

Went for this one. Please let me know if you have any objections.

bparees · 2018-04-23T20:50:20Z

pkg/oc/admin/prune/imageprune/prune.go

+	return
+}
+
+func strenghtenReferencesFromFailedImageStreams(g genericgraph.Graph, failures []Failure) {


needs godoc

miminar · 2018-04-26T13:15:07Z

Rebased and added imagestream event handling.

miminar · 2018-04-26T14:26:13Z

Added image event handling.

miminar · 2018-05-11T07:36:33Z

/retest

miminar · 2018-05-11T11:40:32Z

Ready for review.

dmage · 2018-05-11T13:32:21Z

pkg/oc/graph/genericgraph/graph.go

-				g.internal.SetEdge(t, 1.0)
+				g.internal.SetEdge(t)
+			}
+		case simple.Edge:


It seems that before changes this function was able to handle only genericgraph.Edge. Is this case ever executed?

concrete.WeightedEdge used to be handled before. simple.Edge is its equivalent. But no, we don't use upstream edges in our code directly. Nothing prevents us from doing that in the future though.

bparees · 2018-05-14T20:37:38Z

pkg/oc/admin/prune/imageprune/prune.go

+	resultChan <-chan JobResult,
+) (deletions []Deletion, failures []Failure) {
+	imgUpdateChan := p.imageWatcher.ResultChan()
+	isUpdateChan := p.imageStreamWatcher.ResultChan()


what's the purpose of adding these watches? Trying to handle the case that a reference to a layer is added to an image or imagestream while we're in the middle of pruning? It seems unrelated to the fundamental purpose of the PR (to parallelize the image pruning operations)

what's the purpose of adding these watches? Trying to handle the case that a reference to a layer is added to an image or imagestream while we're in the middle of pruning?

exactly

It seems unrelated to the fundamental purpose of the PR

Yes, it's unrelated, but the way the code is structured, it's now easy to add that safety mechanism. From the past we know that pruning can take hours to complete and that's way too big window for changes and inconsistencies to happen.
For the sake of simplicity (or rather lesser complexity), I can extract that and move to a follow-up if desired.

it's fine, just wanted to understand why it was being done here, thanks.

bparees · 2018-05-14T20:44:52Z

pkg/oc/admin/prune/imageprune/worker.go

+	imagegraph "github.com/openshift/origin/pkg/oc/graph/imagegraph/nodes"
+)
+
+// ComponentRetention knows all the places where image componenet needs to be pruned (e.g. global blob store


s/componenet/component/

bparees · 2018-05-15T20:15:23Z

/approve
/hold
(hold until 3.11)

miminar · 2018-05-30T12:23:19Z

Rebased.

/test extended_image_registry

bparees · 2018-06-19T15:54:40Z

/hold cancel
/lgtm

openshift-bot · 2018-06-20T00:44:00Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2018-06-20T02:44:01Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2018-06-20T04:43:59Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2018-06-20T06:43:59Z

/retest

Please review the full test history for this PR and help us cut down flakes.

Bumped to the kube level. Signed-off-by: Michal Minář <miminar@redhat.com>

Signed-off-by: Michal Minář <miminar@redhat.com>

Instead of pruning in phases: all streams -> all layers -> all blobs -> manifests -> images Prune individual images in parallel jobs: all streams -> parallel [ image1's layers -> image1's blobs -> ... -> image1, image2's layers -> image2's blobs -> ... -> image2, ... ] A failure in streams prune phase is not fatal anymore. Signed-off-by: Michal Minář <miminar@redhat.com>

miminar · 2018-06-20T07:54:57Z

Rebased

bparees · 2018-06-20T13:34:06Z

@deads2k any concerns w/ the bump commit here?

bparees · 2018-06-22T15:28:54Z

/lgtm

openshift-ci-robot · 2018-06-22T15:29:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bparees, miminar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [bparees]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 23, 2018

openshift-ci-robot requested review from enj and soltysh April 23, 2018 08:33

openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 23, 2018

miminar force-pushed the master branch from f135c83 to 99829b9 Compare April 23, 2018 09:05

openshift deleted a comment from michojel Apr 23, 2018

dmage reviewed Apr 23, 2018

View reviewed changes

legionus suggested changes Apr 23, 2018

View reviewed changes

bparees reviewed Apr 23, 2018

View reviewed changes

miminar force-pushed the master branch from 8028522 to 53ef20c Compare April 24, 2018 12:37

openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 26, 2018

miminar closed this Apr 26, 2018

miminar force-pushed the master branch from 2d04866 to a7199cf Compare April 26, 2018 13:12

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 26, 2018

miminar reopened this Apr 26, 2018

openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 26, 2018

openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 26, 2018

miminar force-pushed the master branch from ca5dfd3 to 7d9782f Compare April 26, 2018 15:28

dmage reviewed May 11, 2018

View reviewed changes

bparees reviewed May 14, 2018

View reviewed changes

openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 15, 2018

openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2018

miminar force-pushed the master branch from 9ed8bb0 to 39bffa5 Compare May 30, 2018 12:22

openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2018

miminar force-pushed the master branch from 09bb8d3 to 413e952 Compare June 15, 2018 15:41

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jun 19, 2018

Michal Minář added 3 commits June 20, 2018 09:47

bump(github.com/gonum/graph): fix node deletion bug

834be19

Bumped to the kube level. Signed-off-by: Michal Minář <miminar@redhat.com>

post-bump(gonum/graph): use the new interface

de44f64

Signed-off-by: Michal Minář <miminar@redhat.com>

miminar force-pushed the master branch from 413e952 to b4dea92 Compare June 20, 2018 07:54

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 20, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2018

openshift-merge-robot merged commit dabec00 into openshift:master Jun 22, 2018

image-pruner: prune images in their own jobs #19468

image-pruner: prune images in their own jobs #19468

Conversation

miminar commented Apr 23, 2018 • edited Loading

miminar commented Apr 23, 2018

legionus commented Apr 23, 2018

Choose a reason for hiding this comment

miminar Apr 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michojel commented Apr 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miminar commented Apr 26, 2018

miminar commented Apr 26, 2018

miminar commented May 11, 2018

miminar commented May 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bparees commented May 15, 2018

miminar commented May 30, 2018

bparees commented Jun 19, 2018

openshift-bot commented Jun 20, 2018

openshift-bot commented Jun 20, 2018

openshift-bot commented Jun 20, 2018

openshift-bot commented Jun 20, 2018

miminar commented Jun 20, 2018

bparees commented Jun 20, 2018

bparees commented Jun 22, 2018

openshift-ci-robot commented Jun 22, 2018

miminar commented Apr 23, 2018 •

edited

Loading

miminar Apr 23, 2018 •

edited

Loading