Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checksum cache #102

Merged
merged 14 commits into from
Dec 22, 2017
Merged

Add checksum cache #102

merged 14 commits into from
Dec 22, 2017

Conversation

allenh1
Copy link
Contributor

@allenh1 allenh1 commented Dec 19, 2017

This is an experiment to speed up the generation of OE recipes.

Copy link
Member

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments.

Does computing the hashes really become the bottle neck for the builds? And to confirm the archive name fully bakes the version etc?

md5_cache = pickle.load(md5_file)
md5_file.close()
except IOError:
md5_cache = dict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be None so it's the same as if the option wasn't passed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my line of thought there was they want to do a cache, so we'll initialize a dictionary to store that option, so we can check later that it's not None.

Sorry, this branch was far from ready for review... I was just opening this so you could see it.

sha256_cache = pickle.load(sha256_file)
sha256_file.close()
except IOError:
sha256_cache = dict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same None here as above?

md5_cache = dict()
try:
sha256_file = open('%s/md5_cache.pickle', 'rb')
sha256_cache = pickle.load(sha256_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there matching write/dump calls to update the cache with the new hashes somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet -- this is still a WIP. Still deciding where to out those.

@allenh1
Copy link
Contributor Author

allenh1 commented Dec 19, 2017

Does computing the hashes really become the bottle neck for the builds?

I'm not sure. OE is significantly slower than the ebuild generation, so I'm trying to figure out the source of the slow down. This is really the only extra step that's not in the ebuild generator, so I'm trying to speed things up here. Could be this does nothing.

But this also gives us the ability to remove the tar archives (as only the checksums are needed).

@allenh1
Copy link
Contributor Author

allenh1 commented Dec 21, 2017

@tfoote Ok, this is the full changeset. Let me know what you think. I'm going to do a benchmark to see if this helps at all.

This holds the full path from the file, including the distro and version.

@allenh1
Copy link
Contributor Author

allenh1 commented Dec 21, 2017

Ok, here's the benchmark.

Without cache:

  • real: 9 minutes and 3 seconds.
  • user: 17.7 seconds
  • sys: 3.304 seconds

With cache:

  • real: 6 minutes and 37 seconds
  • user: 15.5 seconds
  • sys: 2.640 seconds

That's somewhat significant, especially in the real time. This also has the added advantage that we can remove the tars after the pickle cache generates.

Copy link
Member

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this ended up with a lot of suggestions on a new approach with a HashCache class suggested to allow code reuse and increase readability/decrease branching.

Let me know if my suggestion is clear enough.

sha256_cache = None
if args.tar_archive_dir:
try:
md5_file = open('%s/md5_cache.pickle' % args.tar_archive_dir, 'rb')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using a context manager here to open and close the file automatically.

https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/

except IOError:
md5_cache = dict()
try:
sha256_file = open(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context manager here too

self.downloadArchive()
md5_cache[self.getArchiveName()] = hashlib.md5(
open(self.getArchiveName(), 'rb').read()).hexdigest()
md5_file = open('%s/md5_cache.pickle' % tar_dir, 'wb')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context manager here too

if self.getArchiveName() not in sha256_cache:
sha256_cache[self.getArchiveName()] = hashlib.sha256(
open(self.getArchiveName(), 'rb').read()).hexdigest()
sha256_file = open('%s/sha256_cache.pickle' % tar_dir, 'wb')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context manager here too

open(self.getArchiveName(), 'rb').read()).hexdigest()
if md5_cache is not None:
if self.getArchiveName() not in md5_cache:
self.downloadArchive()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, this is likely the performance bottleneck. It has to download the archive before computing the hash.

open(self.getArchiveName(), 'rb').read()).hexdigest()
if md5_cache is not None:
if self.getArchiveName() not in md5_cache:
self.downloadArchive()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you download the archive in both the if and else, but then don't download it in the sha256 since it's already there.

Also if you load the cache at the higher level you should write it out again at the higher level instead of each time you process a new package and just update the internal (md5)_cache object for each computation. And write the new cache values out on exit.

It might be worth making a dedicated class that loads holds, and computes the hashes conditionally. It can be parameterized on the hash function. And then you just pass it the md5_cache.get_hash(filename) and it will query if it's stored else compute it. If md5_cache is a context manager, you can pass it the cache filename and hash function. And it will load the file going in and dump the cache on the way out automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you load the cache at the higher level you should write it out again at the higher level instead of each time you process a new package and just update the internal (md5)_cache object for each computation. And write the new cache values out on exit.

That definitely sounds like a more pythonic way to do this. I'll get right on that.

@allenh1
Copy link
Contributor Author

allenh1 commented Dec 21, 2017

Sorry this ended up with a lot of suggestions on a new approach with a HashCache class suggested to allow code reuse and increase readability/decrease branching.

@tfoote Upon implementing, this turned into the CacheManager class (which is just a bit more generic).

Copy link
Member

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change otherwise lgtm. I'm about to get on the plane so feel free to merge w/o rereview.

open(self.getArchiveName(), 'rb').read()).hexdigest()
self.src_md5 = hashlib.md5(
open(self.getArchiveName(), 'rb').read()).hexdigest()
if self.getArchiveName() not in md5_cache and \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an or in case one or the other is preinitialized but not the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this in #105 and merge this, since you're mid-travel.

if self.getArchiveName() not in md5_cache and \
self.getArchiveName() not in sha256_cache:
self.downloadArchive()
md5_cache[self.getArchiveName()] = hashlib.md5(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic could be collapsed into the cache itself if you pass it the hashing function and a lambda or something that would allow it to call self.downloadArchive. But the complexity would be similar so this is fine.

@allenh1 allenh1 merged commit 212ac2f into master Dec 22, 2017
@allenh1 allenh1 deleted the add-checksum-cache branch December 22, 2017 22:21
zffgithub pushed a commit to zffgithub/superflore that referenced this pull request Apr 11, 2023
* Import pickle to allow the tar dictionary to be read in (to speed up recipe generation).

* Added logic to use the cached version of the hash in the yocto files.

* Wrote a CacheManager class to implement a context manager on the cache files.

* Fixed copyright line in TempFileManaager.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants