-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
serial extract_archive to prevent unnecessary extractions #786
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #786 +/- ##
===========================================
- Coverage 100.00% 99.94% -0.06%
===========================================
Files 74 74
Lines 3419 3425 +6
Branches 613 617 +4
===========================================
+ Hits 3419 3423 +4
- Misses 0 1 +1
- Partials 0 1 +1 ☔ View full report in Codecov by Sentry. |
071e0bc
to
c65939f
Compare
using this script from pathlib import Path
import pymovements as pm
pm.utils.archives.extract_archive(Path('gazebasevr.zip')) and predownloaded gazebase-vr -- I executed this command time python t.py for this branch: Extracting gazebasevr.zip to .
real 6m53,025s
user 0m31,852s
sys 0m8,898s for current main: Extracting gazebasevr.zip to .
real 7m12,829s
user 0m33,634s
sys 0m9,835s different workloads might affect the performance but it should be not too much slower. (in the example above the loop variant was even faster (?!?)) |
161b3ec
to
265687f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, thanks a lot! that was what I had in mind when creating #488.
We should introduce a new argument to let the user decide if continuing is desired or extracting should be done for all members. Something like continue: bool = True
would be already sufficient.
3bd6e20
to
75b0b34
Compare
e8eec21
to
4f4f0ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks a lot for your work! There are some small issues left to work out, but we're getting there
@@ -138,6 +138,7 @@ def _extract_tar( | |||
source_path: Path, | |||
destination_path: Path, | |||
compression: str | None, | |||
skip: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that skip
is a bit confusing as a name. I suggested continue
but that's a reserved keyword, so let's call it resume
. this way it's already clear what the argument does without reading the docs.
if ( | ||
os.path.exists(os.path.join(destination_path, member)) and | ||
member[-4:] not in _ARCHIVE_EXTRACTORS and | ||
tarfile.TarInfo(os.path.join(destination_path, member)).size > 0 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't really check for correct size. it's just checking if the archive member's size is greater than zero.
you need to check that the size of the member is the same as the size of the already existing file. Otherwise a partially extracted file (i.e. the existing file is smaller than the archive member) will be skipped.
os.path.exists(os.path.join(destination_path, member)) and | ||
member[-4:] not in _ARCHIVE_EXTRACTORS and | ||
tarfile.TarInfo(os.path.join(destination_path, member)).size > 0 and | ||
skip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably it makes sense to check skip
or resume
first, because if it's False
you don't need to check all the other conditions and will extract either way.
archive.extractall(destination_path, filter='tar') | ||
for member in archive.getnames(): | ||
if ( | ||
os.path.exists(os.path.join(destination_path, member)) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should probably break this up into separate conditions for readability.
create a variable like destination_filepath = os.path.join(destination_path, member)
archive.extractall(destination_path) | ||
for member in archive.namelist(): | ||
if ( | ||
os.path.exists(os.path.join(destination_path, member)) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will need the same logic here as in _extract_tar()
for member in archive.getnames(): | ||
if ( | ||
os.path.exists(os.path.join(destination_path, member)) and | ||
member[-4:] not in _ARCHIVE_EXTRACTORS and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably it would be nice to add a short note that nested archives are extracted regardless of the value of skip
/ resume
), | ||
], | ||
) | ||
def test_extract_archive_destination_path_not_None_no_remove_top_level_no_remove_finished_twice( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure of I understand what you are testing here. the test would already pass without this PR, right?
As I don't think that it's worth much effort to test this thoroughly, I'd probably just check the calls to TarFile.extract()
for using the expected arguments via unittest.mock
(for the second extract_archive()
call).
resolves #488 eventually
currently, tox does not work locally for whatever reason.
TODO: