New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

serial extract_archive to prevent unnecessary extractions #786

Open

SiQube wants to merge 4 commits into main from prevent-unnecessary-extraction

+75 −5

Member

SiQube commented Aug 24, 2024 •

edited

Loading

resolves #488 eventually

currently, tox does not work locally for whatever reason.

TODO:

time vs old extraction
coverage

codecov bot commented Aug 24, 2024 •

edited

Loading

Codecov Report

Attention: Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 99.94%. Comparing base (82e87d5) to head (4f4f0ff).

Files with missing lines	Patch %	Lines
src/pymovements/utils/archives.py	80.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##              main     #786      +/-   ##
===========================================
- Coverage   100.00%   99.94%   -0.06%     
===========================================
  Files           74       74              
  Lines         3419     3425       +6     
  Branches       613      617       +4     
===========================================
+ Hits          3419     3423       +4     
- Misses           0        1       +1     
- Partials         0        1       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SiQube force-pushed the prevent-unnecessary-extraction branch 2 times, most recently from 071e0bc to c65939f Compare

August 25, 2024 00:37

Member Author

SiQube commented Aug 25, 2024

using this script

from pathlib import Path
import pymovements as pm

pm.utils.archives.extract_archive(Path('gazebasevr.zip'))

and predownloaded gazebase-vr -- I executed this command

time python t.py

for this branch:

Extracting gazebasevr.zip to .

real	6m53,025s
user	0m31,852s
sys	0m8,898s

for current main:

Extracting gazebasevr.zip to .

real	7m12,829s
user	0m33,634s
sys	0m9,835s

different workloads might affect the performance but it should be not too much slower. (in the example above the loop variant was even faster (?!?))

SiQube marked this pull request as ready for review

August 25, 2024 00:48

SiQube requested review from dkrako and prassepaul as code owners

August 25, 2024 00:48

SiQube added bug enhancement labels

SiQube force-pushed the prevent-unnecessary-extraction branch 2 times, most recently from 161b3ec to 265687f Compare

September 30, 2024 06:24

dkrako requested changes

View reviewed changes

Contributor

dkrako left a comment

great, thanks a lot! that was what I had in mind when creating #488.

We should introduce a new argument to let the user decide if continuing is desired or extracting should be done for all members. Something like continue: bool = True would be already sufficient.

src/pymovements/utils/archives.py Outdated Show resolved Hide resolved

SiQube force-pushed the prevent-unnecessary-extraction branch 2 times, most recently from 3bd6e20 to 75b0b34 Compare

November 17, 2024 20:08

SiQube added 4 commits

December 5, 2024 22:43


          serial extract_archive to prevent unnecessary extractions

fd7e257


          add tests

b72fbab


          test for size of the file before skipping

99008e8


          add skip argument to extract_archive

4f4f0ff

SiQube force-pushed the prevent-unnecessary-extraction branch from e8eec21 to 4f4f0ff Compare

December 5, 2024 22:02

dkrako requested changes

View reviewed changes

Contributor

dkrako left a comment

Great, thanks a lot for your work! There are some small issues left to work out, but we're getting there

src/pymovements/utils/archives.py

@@ @@ -138,6 +138,7 @@ def _extract_tar( @@
                       source_path: Path,
                       destination_path: Path,
                       compression: str | None,
+                      skip: bool = True,

Contributor

dkrako Dec 8, 2024

I'd say that skip is a bit confusing as a name. I suggested continue but that's a reserved keyword, so let's call it resume. this way it's already clear what the argument does without reading the docs.

src/pymovements/utils/archives.py

+                          if (
+                                  os.path.exists(os.path.join(destination_path, member)) and
+                                  member[-4:] not in _ARCHIVE_EXTRACTORS and
+                                  tarfile.TarInfo(os.path.join(destination_path, member)).size > 0 and

Contributor

dkrako Dec 8, 2024

this won't really check for correct size. it's just checking if the archive member's size is greater than zero.

you need to check that the size of the member is the same as the size of the already existing file. Otherwise a partially extracted file (i.e. the existing file is smaller than the archive member) will be skipped.

src/pymovements/utils/archives.py

+                                  os.path.exists(os.path.join(destination_path, member)) and
+                                  member[-4:] not in _ARCHIVE_EXTRACTORS and
+                                  tarfile.TarInfo(os.path.join(destination_path, member)).size > 0 and
+                                  skip

Contributor

dkrako Dec 8, 2024

probably it makes sense to check skip or resume first, because if it's False you don't need to check all the other conditions and will extract either way.

src/pymovements/utils/archives.py

-                          archive.extractall(destination_path, filter='tar')
+                      for member in archive.getnames():
+                          if (
+                                  os.path.exists(os.path.join(destination_path, member)) and

Contributor

dkrako Dec 8, 2024

you should probably break this up into separate conditions for readability.

create a variable like destination_filepath = os.path.join(destination_path, member)

src/pymovements/utils/archives.py

-                      archive.extractall(destination_path)
+                      for member in archive.namelist():
+                          if (
+                              os.path.exists(os.path.join(destination_path, member)) and

Contributor

dkrako Dec 8, 2024

you will need the same logic here as in _extract_tar()

src/pymovements/utils/archives.py

+                      for member in archive.getnames():
+                          if (
+                                  os.path.exists(os.path.join(destination_path, member)) and
+                                  member[-4:] not in _ARCHIVE_EXTRACTORS and

Contributor

dkrako Dec 8, 2024 •

edited

Loading

probably it would be nice to add a short note that nested archives are extracted regardless of the value of skip/ resume

tests/unit/utils/archives_test.py

+                      ),
+                  ],
+              )
+              def test_extract_archive_destination_path_not_None_no_remove_top_level_no_remove_finished_twice(

Contributor

dkrako Dec 8, 2024

I'm not sure of I understand what you are testing here. the test would already pass without this PR, right?

As I don't think that it's worth much effort to test this thoroughly, I'd probably just check the calls to TarFile.extract() for using the expected arguments via unittest.mock (for the second extract_archive() call).

dkrako removed the bug label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels