Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't attempt to add the same file multiple times to the ZIP #1234

Merged
merged 1 commit into from
Dec 8, 2017

Conversation

whisperity
Copy link
Contributor

Related to #1219. Not sure yet if this closes it!

When a file is added multiple times to a ZIP, at the .close() of the ZipFile context manager, warnings are raised about this. (ZIP files seemingly can contain the same path twice, but clients can get nondeterministic about which actual content to use if the contents differ between the paths!)

I am absolutely unsure why the analysis manager for @sylvestre attempts to put the same file twice into the ZIP. But instead of letting the warnings flow at the end of failure-zip creation, we preemptively check if the file already contains a path, and show a message about this fact.

@whisperity whisperity added bugfix 🔨 analyzer 📈 Related to the analyze commands (analysis driver) labels Dec 7, 2017
# The file was not in the ZIP file, because we already have the
# content. Let's check if we already have a file record in the
# database or we need to add one.
# The file was not in the ZIP file, because we already
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this file irrelevant to the PR, but the commit failed to generate locally for me on pep8 errors:

libcodechecker/server/api/report_server.py:1924:80: E501 line too long (83 > 79 characters)
libcodechecker/server/api/report_server.py:1925:80: E501 line too long (82 > 79 characters)
libcodechecker/server/api/report_server.py:1934:80: E501 line too long (80 > 79 characters)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed them in #1166 too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, undid the changes.

Copy link
Contributor

@gyorb gyorb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking if the the file is already in the content can be done only based on the path?

Otherwise LGTM.

# The file was not in the ZIP file, because we already have the
# content. Let's check if we already have a file record in the
# database or we need to add one.
# The file was not in the ZIP file, because we already
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed them in #1166 too.

@whisperity
Copy link
Contributor Author

@gyorb I think we are safe to assume that once the user has /home/foo/bar/baz.cpp and we save it as sources-root/..., it won't change. Hopefully. If it does, we are done for...

@whisperity whisperity force-pushed the analysis-zip-multiadd branch from 9a19ee5 to 00c34d0 Compare December 7, 2017 14:50
@whisperity whisperity requested a review from gyorb December 7, 2017 14:50
Copy link
Contributor

@Xazax-hun Xazax-hun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG!

os.path.join("sources-root",
archive_path),
zipfile.ZIP_DEFLATED)
_ = archive.getinfo(archive_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this can be a bottleneck? An alternative approach would be to query the file list upfront and do not query the info for each path. In case this query is cheap, it should not be a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ZipFile class (see /usr/lib/python2.7/zipfile.py) contains a NameToInfo dict which maps paths to ZipInfo objects. Dicts have O(n) worst-case lookup time.

However, this is currently a strange patch. If you look at the code, you can see we use sets everywhere for file lists, which also have a linear worst-case complexity. As explained earlier, I have no idea why some files are marked to be added multiple times to the ZIP. Perhaps something is wrong with parsing the analyzer output, perhaps this simply happens from @sylvestre using --enable debug checkers which produce strange output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we are using string keys, so this is hopefully fast enough. We also aren't expecting a lot of analyses to fail. The ZIP file already goes over its internal structure at .close() (to finish up the ZIP structure, calculte CRCs and such), where it already figures out path duplications (hence the warning in #1219).

@Xazax-hun
Copy link
Contributor

Maybe it would be great to open a bug to investigate the underlying issue.

@whisperity
Copy link
Contributor Author

There already is. And it's mentioned in this PR in the first line.

@gyorb gyorb added this to the release 6.3 milestone Dec 8, 2017
@gyorb gyorb merged commit eceb906 into Ericsson:master Dec 8, 2017
@whisperity whisperity deleted the analysis-zip-multiadd branch December 18, 2017 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analyzer 📈 Related to the analyze commands (analysis driver) bugfix 🔨
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants