Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serial extract_archive to prevent unnecessary extractions #786

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

SiQube
Copy link
Member

@SiQube SiQube commented Aug 24, 2024

resolves #488 eventually

currently, tox does not work locally for whatever reason.

TODO:

  • time vs old extraction
  • coverage

Copy link

codecov bot commented Aug 24, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 99.94%. Comparing base (82e87d5) to head (4f4f0ff).

Files with missing lines Patch % Lines
src/pymovements/utils/archives.py 80.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##              main     #786      +/-   ##
===========================================
- Coverage   100.00%   99.94%   -0.06%     
===========================================
  Files           74       74              
  Lines         3419     3425       +6     
  Branches       613      617       +4     
===========================================
+ Hits          3419     3423       +4     
- Misses           0        1       +1     
- Partials         0        1       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SiQube SiQube force-pushed the prevent-unnecessary-extraction branch 2 times, most recently from 071e0bc to c65939f Compare August 25, 2024 00:37
@SiQube
Copy link
Member Author

SiQube commented Aug 25, 2024

using this script

from pathlib import Path
import pymovements as pm

pm.utils.archives.extract_archive(Path('gazebasevr.zip'))

and predownloaded gazebase-vr -- I executed this command

time python t.py

for this branch:

Extracting gazebasevr.zip to .

real	6m53,025s
user	0m31,852s
sys	0m8,898s

for current main:

Extracting gazebasevr.zip to .

real	7m12,829s
user	0m33,634s
sys	0m9,835s

different workloads might affect the performance but it should be not too much slower. (in the example above the loop variant was even faster (?!?))

@SiQube SiQube marked this pull request as ready for review August 25, 2024 00:48
@SiQube SiQube added bug Something isn't working enhancement New feature or request labels Aug 26, 2024
@SiQube SiQube force-pushed the prevent-unnecessary-extraction branch 2 times, most recently from 161b3ec to 265687f Compare September 30, 2024 06:24
Copy link
Contributor

@dkrako dkrako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thanks a lot! that was what I had in mind when creating #488.

We should introduce a new argument to let the user decide if continuing is desired or extracting should be done for all members. Something like continue: bool = True would be already sufficient.

src/pymovements/utils/archives.py Outdated Show resolved Hide resolved
@SiQube SiQube force-pushed the prevent-unnecessary-extraction branch 2 times, most recently from 3bd6e20 to 75b0b34 Compare November 17, 2024 20:08
@SiQube SiQube force-pushed the prevent-unnecessary-extraction branch from e8eec21 to 4f4f0ff Compare December 5, 2024 22:02
Copy link
Contributor

@dkrako dkrako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks a lot for your work! There are some small issues left to work out, but we're getting there

@@ -138,6 +138,7 @@ def _extract_tar(
source_path: Path,
destination_path: Path,
compression: str | None,
skip: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that skip is a bit confusing as a name. I suggested continue but that's a reserved keyword, so let's call it resume. this way it's already clear what the argument does without reading the docs.

if (
os.path.exists(os.path.join(destination_path, member)) and
member[-4:] not in _ARCHIVE_EXTRACTORS and
tarfile.TarInfo(os.path.join(destination_path, member)).size > 0 and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't really check for correct size. it's just checking if the archive member's size is greater than zero.

you need to check that the size of the member is the same as the size of the already existing file. Otherwise a partially extracted file (i.e. the existing file is smaller than the archive member) will be skipped.

os.path.exists(os.path.join(destination_path, member)) and
member[-4:] not in _ARCHIVE_EXTRACTORS and
tarfile.TarInfo(os.path.join(destination_path, member)).size > 0 and
skip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably it makes sense to check skip or resume first, because if it's False you don't need to check all the other conditions and will extract either way.

archive.extractall(destination_path, filter='tar')
for member in archive.getnames():
if (
os.path.exists(os.path.join(destination_path, member)) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should probably break this up into separate conditions for readability.

create a variable like destination_filepath = os.path.join(destination_path, member)

archive.extractall(destination_path)
for member in archive.namelist():
if (
os.path.exists(os.path.join(destination_path, member)) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will need the same logic here as in _extract_tar()

for member in archive.getnames():
if (
os.path.exists(os.path.join(destination_path, member)) and
member[-4:] not in _ARCHIVE_EXTRACTORS and
Copy link
Contributor

@dkrako dkrako Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably it would be nice to add a short note that nested archives are extracted regardless of the value of skip/ resume

),
],
)
def test_extract_archive_destination_path_not_None_no_remove_top_level_no_remove_finished_twice(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of I understand what you are testing here. the test would already pass without this PR, right?

As I don't think that it's worth much effort to test this thoroughly, I'd probably just check the calls to TarFile.extract() for using the expected arguments via unittest.mock (for the second extract_archive() call).

@dkrako dkrako removed the bug Something isn't working label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't extract dataset archives twice
2 participants