-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't attempt to add the same file multiple times to the ZIP #1234
Conversation
# The file was not in the ZIP file, because we already have the | ||
# content. Let's check if we already have a file record in the | ||
# database or we need to add one. | ||
# The file was not in the ZIP file, because we already |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes in this file irrelevant to the PR, but the commit failed to generate locally for me on pep8 errors:
libcodechecker/server/api/report_server.py:1924:80: E501 line too long (83 > 79 characters)
libcodechecker/server/api/report_server.py:1925:80: E501 line too long (82 > 79 characters)
libcodechecker/server/api/report_server.py:1934:80: E501 line too long (80 > 79 characters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed them in #1166 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, undid the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking if the the file is already in the content can be done only based on the path?
Otherwise LGTM.
# The file was not in the ZIP file, because we already have the | ||
# content. Let's check if we already have a file record in the | ||
# database or we need to add one. | ||
# The file was not in the ZIP file, because we already |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed them in #1166 too.
@gyorb I think we are safe to assume that once the user has |
9a19ee5
to
00c34d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG!
os.path.join("sources-root", | ||
archive_path), | ||
zipfile.ZIP_DEFLATED) | ||
_ = archive.getinfo(archive_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this can be a bottleneck? An alternative approach would be to query the file list upfront and do not query the info for each path. In case this query is cheap, it should not be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ZipFile
class (see /usr/lib/python2.7/zipfile.py
) contains a NameToInfo
dict which maps paths to ZipInfo
objects. Dicts have O(n)
worst-case lookup time.
However, this is currently a strange patch. If you look at the code, you can see we use set
s everywhere for file lists, which also have a linear worst-case complexity. As explained earlier, I have no idea why some files are marked to be added multiple times to the ZIP. Perhaps something is wrong with parsing the analyzer output, perhaps this simply happens from @sylvestre using --enable debug
checkers which produce strange output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we are using string keys, so this is hopefully fast enough. We also aren't expecting a lot of analyses to fail. The ZIP file already goes over its internal structure at .close()
(to finish up the ZIP structure, calculte CRCs and such), where it already figures out path duplications (hence the warning in #1219).
Maybe it would be great to open a bug to investigate the underlying issue. |
There already is. And it's mentioned in this PR in the first line. |
When a file is added multiple times to a ZIP, at the
.close()
of theZipFile
context manager, warnings are raised about this. (ZIP files seemingly can contain the same path twice, but clients can get nondeterministic about which actual content to use if the contents differ between the paths!)I am absolutely unsure why the analysis manager for @sylvestre attempts to put the same file twice into the ZIP. But instead of letting the warnings flow at the end of failure-zip creation, we preemptively check if the file already contains a path, and show a message about this fact.