-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicate layer IDs to handle cases where duplicate layers are in the layer tree #1172
Deduplicate layer IDs to handle cases where duplicate layers are in the layer tree #1172
Conversation
I'm okay with this implementation, but also want to point out that we could also remove the empty layers before we build the maps/associations (ie before we iterate over the layers). Though, with that type of implementation Im not sure what the cutoff point for It would be nice to know others thoughts on various approaches as well. |
+1 @acornett21 I briefly tested skipping empty layers based on size - and just ended up at "how do I confidently determine that the layer is truly empty" aside from looking at content. The smallest value I saw returned from |
This is where I landed as well, I thought I was missing something as well with the size calculation. The one advantage of this I see, is it could speed up our processing by spiking some layers. But if don't have confidence on what the min size should be, they the saying cycles point is probably moot. |
from change #1172: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
I hope this will get reviewed soon - have a partner waiting on the new Preflight version with this fix... |
@jfrancin Yep, sorry for the delay - I've been trying to replicate the failing image with little success, and finally cracked it just a bit ago. Whipping up a test case, and we should be set. |
@jfrancin we are well aware of the urgency, like mentioned we've been trying to understand how the partner built an image with multiple of the same layer (and empty ones at that), and come with with a repeatable test so we do not have an regressions. Figuring out how they got an image in this state, has been extremely difficult to replicate. |
2ef1122
to
8477aa7
Compare
8477aa7
to
542630e
Compare
@acornett21 @bcrochet A note - This PR reconfigures HasModifiedFiles such that the packageDist value is extracted in a separate function from the package file map building process. This simplifies testing of the file mapping process by allowing us to build an As a result, several return signatures have changed, dropping the returned packageDist. Also note that the test written here uses a static fixture. The containerfile used to build it is included - but using a static fixture saves us storing the image in-repo (~110m). It's been documented as a TODO item to optimize at a later date. |
542630e
to
64272ca
Compare
…he layer tree Signed-off-by: Jose R. Gonzalez <komish@flutes.dev>
64272ca
to
517a88b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments, nothing blocking.
from change #1172: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: acornett21, bcrochet, komish The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
In certain edge cases, a container builder may produce multiple "empty" layers that may have the same digest value. When parsing those layers, we fail to find an RPM database (which is expected) and therefore carry the previous layer's package list forward.
We run into issues when layers with the same ID and no modified files (e.g. "empty" layers) are separated by layers WITH package changes, such that when we carry forward file lists from previous layers, we would be impacting an earlier entry's file list because we use a map where the key is the layer ID. Concrete example below[1].
This PR changes our mapping to glue together a layer's ID and its relative position in the image, which prevents overwriting previous layer values and fixes a bug in how we detect modified files layer-over-layer.
It also adds logging to map DiffID, LayerID, and the deduped/calculated value in trace logs to assist with debugging.
[1]
Assume we have 4 layers with these modified files.
We parse layer 0 and see it doesn't have any files (nor RPMDB), so we store a mapping of that layer's ID ("sha256:1111")to a list of packages that might be associated (which will be empty in this example because the previous, omitted layer is the base layer.
As soon as layer 1 (first layer to contain an RPMDB) is evaluated, that becomes our source of truth. We store the package lists associated with it and use those to inform the "do not modify" list. In our mapping, the package information is mapped to this layer's ID as the map key ("sha256:2222").
When layer 2 is parsed and found to be empty, we map its package list to the previous layer (1) package list. Because the layer ID of layer 2 matches that of layer 0, we've now inadvertently populated layer 0's package list. This is the problem.
Later, when we validate each of these layers, we detect that layer 0 has packages (the same as layer 1), and treat that as our source of truth. Layer 1, then, is considered to have modified all of the same files as layer 0 - which fails the check.
To solve this, I bind the layer ID with its index position (e.g. "00-sha256:1111"). It makes things a little bit heavier, but barring refactoring to remove the map AND array usage and/or change the way this works - this seemed simpler and effective enough.