-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add checksum cache #102
Add checksum cache #102
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few comments.
Does computing the hashes really become the bottle neck for the builds? And to confirm the archive name fully bakes the version etc?
superflore/generators/bitbake/run.py
Outdated
md5_cache = pickle.load(md5_file) | ||
md5_file.close() | ||
except IOError: | ||
md5_cache = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be None so it's the same as if the option wasn't passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my line of thought there was they want to do a cache, so we'll initialize a dictionary to store that option, so we can check later that it's not None
.
Sorry, this branch was far from ready for review... I was just opening this so you could see it.
superflore/generators/bitbake/run.py
Outdated
sha256_cache = pickle.load(sha256_file) | ||
sha256_file.close() | ||
except IOError: | ||
sha256_cache = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same None here as above?
superflore/generators/bitbake/run.py
Outdated
md5_cache = dict() | ||
try: | ||
sha256_file = open('%s/md5_cache.pickle', 'rb') | ||
sha256_cache = pickle.load(sha256_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there matching write/dump calls to update the cache with the new hashes somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet -- this is still a WIP. Still deciding where to out those.
I'm not sure. OE is significantly slower than the ebuild generation, so I'm trying to figure out the source of the slow down. This is really the only extra step that's not in the ebuild generator, so I'm trying to speed things up here. Could be this does nothing. But this also gives us the ability to remove the tar archives (as only the checksums are needed). |
50b0f47
to
1fece31
Compare
…recipe generation).
…an IO exception occurs.
c13e896
to
f956ae3
Compare
@tfoote Ok, this is the full changeset. Let me know what you think. I'm going to do a benchmark to see if this helps at all. This holds the full path from the file, including the distro and version. |
Ok, here's the benchmark. Without cache:
With cache:
That's somewhat significant, especially in the real time. This also has the added advantage that we can remove the tars after the pickle cache generates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this ended up with a lot of suggestions on a new approach with a HashCache class suggested to allow code reuse and increase readability/decrease branching.
Let me know if my suggestion is clear enough.
superflore/generators/bitbake/run.py
Outdated
sha256_cache = None | ||
if args.tar_archive_dir: | ||
try: | ||
md5_file = open('%s/md5_cache.pickle' % args.tar_archive_dir, 'rb') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend using a context manager here to open and close the file automatically.
https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/
superflore/generators/bitbake/run.py
Outdated
except IOError: | ||
md5_cache = dict() | ||
try: | ||
sha256_file = open( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context manager here too
self.downloadArchive() | ||
md5_cache[self.getArchiveName()] = hashlib.md5( | ||
open(self.getArchiveName(), 'rb').read()).hexdigest() | ||
md5_file = open('%s/md5_cache.pickle' % tar_dir, 'wb') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context manager here too
if self.getArchiveName() not in sha256_cache: | ||
sha256_cache[self.getArchiveName()] = hashlib.sha256( | ||
open(self.getArchiveName(), 'rb').read()).hexdigest() | ||
sha256_file = open('%s/sha256_cache.pickle' % tar_dir, 'wb') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context manager here too
open(self.getArchiveName(), 'rb').read()).hexdigest() | ||
if md5_cache is not None: | ||
if self.getArchiveName() not in md5_cache: | ||
self.downloadArchive() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, this is likely the performance bottleneck. It has to download the archive before computing the hash.
open(self.getArchiveName(), 'rb').read()).hexdigest() | ||
if md5_cache is not None: | ||
if self.getArchiveName() not in md5_cache: | ||
self.downloadArchive() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you download the archive in both the if and else, but then don't download it in the sha256 since it's already there.
Also if you load the cache at the higher level you should write it out again at the higher level instead of each time you process a new package and just update the internal (md5)_cache object for each computation. And write the new cache values out on exit.
It might be worth making a dedicated class that loads holds, and computes the hashes conditionally. It can be parameterized on the hash function. And then you just pass it the md5_cache.get_hash(filename) and it will query if it's stored else compute it. If md5_cache is a context manager, you can pass it the cache filename and hash function. And it will load the file going in and dump the cache on the way out automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if you load the cache at the higher level you should write it out again at the higher level instead of each time you process a new package and just update the internal (md5)_cache object for each computation. And write the new cache values out on exit.
That definitely sounds like a more pythonic way to do this. I'll get right on that.
@tfoote Upon implementing, this turned into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small change otherwise lgtm. I'm about to get on the plane so feel free to merge w/o rereview.
open(self.getArchiveName(), 'rb').read()).hexdigest() | ||
self.src_md5 = hashlib.md5( | ||
open(self.getArchiveName(), 'rb').read()).hexdigest() | ||
if self.getArchiveName() not in md5_cache and \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an or
in case one or the other is preinitialized but not the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add this in #105 and merge this, since you're mid-travel.
if self.getArchiveName() not in md5_cache and \ | ||
self.getArchiveName() not in sha256_cache: | ||
self.downloadArchive() | ||
md5_cache[self.getArchiveName()] = hashlib.md5( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic could be collapsed into the cache itself if you pass it the hashing function and a lambda or something that would allow it to call self.downloadArchive. But the complexity would be similar so this is fine.
* Import pickle to allow the tar dictionary to be read in (to speed up recipe generation). * Added logic to use the cached version of the hash in the yocto files. * Wrote a CacheManager class to implement a context manager on the cache files. * Fixed copyright line in TempFileManaager.py
This is an experiment to speed up the generation of OE recipes.