Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash copied files #2055

Merged
merged 13 commits into from
Nov 27, 2019
Merged

Hash copied files #2055

merged 13 commits into from
Nov 27, 2019

Conversation

bsideup
Copy link
Member

@bsideup bsideup commented Nov 9, 2019

depends on #2051

# Conflicts:
#	core/src/test/java/org/testcontainers/containers/ReusabilityUnitTests.java
@bsideup bsideup requested a review from rnorth November 10, 2019 20:19
checksum.update(MountableFile.getUnixFileMode(file.toPath()));
if (file.isDirectory()) {
try (Stream<Path> stream = Files.walk(file.toPath())) {
stream.filter(it -> !Files.isDirectory(it)).forEach(path -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the file mode included in the checksum at line 506 which is cool...

But here, when we walk the contents, are we only walking sub_files_ (i.e. direct child files and child files of subdirectories)?

If that's the case then we'd not capture the file mode of subdirectories - is that the correct interpretation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Fixed.

try {
return (int) Files.getAttribute(path, "unix:mode");
return (int) Files.readAttributes(path, "unix:mode").get("mode");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious as to why we'd need to use readAttributes here - we're still only fetching the mode, so getAttribute perhaps doesn't really have a performance disadvantage.

However, now this makes me wonder if we should also be hashing basic file attributes like created/modified timestamps etc 😭

WDYT? I don't mind deferring this, TBH...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we checksum content & file mode (the only bits that actually go into the tar archive), I guess we can defer the basic file attributes until someone reports that the hashing works incorrectly for them.

I will also check what Dockerfile builder is using for hashing COPY

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, sounds sensible 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://stackoverflow.com/a/59073724/1826422

For the ADD and COPY instructions, the contents of the file(s) in the image are examined and a checksum is calculated for each file. The last-modified and last-accessed times of the file(s) are not considered in these checksums. During the cache lookup, the checksum is compared against the checksum in the existing images. If anything has changed in the file(s), such as the contents and metadata, then the cache is invalidated.

okay, looks pretty aligned with what we're doing

@bsideup bsideup merged commit 3cb3379 into master Nov 27, 2019
@delete-merged-branch delete-merged-branch bot deleted the copied_files_hash branch November 27, 2019 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants