Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File download via Zipdownloader tool creates damaged archives? #11207

Open
kbrueckmann opened this issue Feb 3, 2025 · 12 comments
Open

File download via Zipdownloader tool creates damaged archives? #11207

kbrueckmann opened this issue Feb 3, 2025 · 12 comments
Labels
Type: Bug a defect

Comments

@kbrueckmann
Copy link

What steps does it take to reproduce the issue?

What happens?
The extraction of files from the zip fails:

unzip -v dataverse_files.zip

Archive: dataverse_files.zip
warning [dataverse_files.zip]: 10342 extra bytes at beginning or within zipfile (attempting to process anyway)

Length Method Size Cmpr Date Time CRC-32 Name
10485117 Defl:N 10400432 1% 2025-02-03 16:26 deeb27c3 FUQ.pdf
21014 Defl:N 18286 13% 2025-02-03 16:26 df8d22ca Derived_requirements.docx
...

Apparently, each file has a bad zipfile offset.
We tested the multi file download for different datasets, on different machines, checked the Payara and Apache logs but found nothing obvious there. The single file download (= not using the zipper tool) works perfectly.

To whom does it occur (all users, curators, superusers)?
All users.

Which version of Dataverse are you using?
v 6.5

Any related open or closed issues to this bug report?
I found none.

@kbrueckmann kbrueckmann added the Type: Bug a defect label Feb 3, 2025
@landreev
Copy link
Contributor

landreev commented Feb 3, 2025

This is not a brand new installation of the standalone zipper, is it? - In other words, is this something that was/used to be working properly, then just stopped working?
The Apache logs (/var/log/httpd/ssl_error_log specifically?) would be the place to look for relevant error message, yes. So, if there is nothing interesting there, that makes it more puzzling.

@landreev
Copy link
Contributor

landreev commented Feb 3, 2025

If you look at the first few bytes of the zip file, you see these 6 extra bytes above the normal zip header:

head -c8 zipdownload_heiDATA.zip
2000
PK%                                                                             

(the 6 bytes are "2000" followed by the 2-byte DOS newline, i.e. "2000\r\n")
Simply removing this first line doesn't fix it either - since these extra lines keep appearing throughout the rest of the zip file.
... this in turn looks like an extra chunking encoding block size header added to the stream. Has anything changed recently in how access to the script is configured under Apache?

@kbrueckmann
Copy link
Author

Thanks for the quick reply! Yes, it used to work properly in previous versions. We're not 100% sure it only stopped working after our upgrade to v6.5 from 6.3 (via 6.4), but as far as I know we changed nothing in the configuration of the script or access to it. I'll ask a colleague to have a look at the specific log you mentioned; hopefully, we'll know more soon.

@lmaylein
Copy link
Contributor

lmaylein commented Feb 4, 2025

The Apache errorlog looks quite strange. For a single zip download, I get the same message almost 1,300 times:

[Tue Feb 04 11:15:33.769698 2025] [cgid:error] [pid 80706:tid 80835] [client 147.142.xxx.xxx:47178] AH01215: stderr from /var/www/cgi-bin/zipdownload: offset: 0, length: 8192, referer: https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/N1T5T8
[Tue Feb 04 11:15:33.769774 2025] [cgid:error] [pid 80706:tid 80835] [client 147.142.xxx.xxx:47178] AH01215: stderr from /var/www/cgi-bin/zipdownload: offset: 0, length: 8192, referer: https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/N1T5T8

....

[Tue Feb 04 11:15:34.221621 2025] [cgid:error] [pid 80706:tid 80835] [client 147.142.xxx.xxx:47178] AH01215: stderr from /var/www/cgi-bin/zipdownload: offset: 0, length: 8192, referer: https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/N1T5T8
[Tue Feb 04 11:15:34.223736 2025] [cgid:error] [pid 80706:tid 80835] [client 147.142.xxx.xxx:47178] AH01215: stderr from /var/www/cgi-bin/zipdownload: offset: 0, length: 5544, referer: https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/N1T5T8

@lmaylein
Copy link
Contributor

lmaylein commented Feb 4, 2025

And at least the head of the zip file is okay:


head -c8  'dataverse_files.zip'
2000
PK

@landreev
Copy link
Contributor

landreev commented Feb 4, 2025

And at least the head of the zip file is okay:


head -c8  'dataverse_files.zip'
2000
PK

Well, it's clearly not okay - it has the 6 extra bytes added before the normal "PK" header.
This "2000\r\n" line comes from here in the zipper:

private static final int BUFFER_SIZE = 8192;
private static final byte[] CHUNK_CLOSE = "\r\n".getBytes();
private static final String CHUNK_SIZE_FORMAT = "%x\r\n";

("2000" being hex for 8192).
The zipper implements the http Transfer-Encoding: chunked - since it doesn't know the total size of the zip stream and cannot tell the client how many bytes to expect ahead of time, it sends 8KB at a time, preceding each block with this line, telling the client to expect that many bytes.

... but, what appears to be happening in your case, instead of sending this byte stream generated by the zipper to the client as is, Apache (for whatever reason) decides to chunk-encode it again, so the client receives it with these extra header and closing bytes; which of course breaks the zip format.

I can't imagine that the Dataverse-side upgrade from (to 6.5) could have anything to do with this. It would be more likely that it was a change in the version of Apache installed, or maybe a change in the Apache configuration (?). Looking at the headers from your zipper, you appear to be using the Apache version 2.4.62 and ours is 2.4.37 (we are also using the zipper, and not experiencing this issue).

... OK, that must have been too much unnecessary/not particularly useful information. In more practical terms, I can try and build a version of the zipper that does not apply the chunking encoding to the stream for you, and see if that fixes it.
But I would very much like to find out why this has started happening.

@landreev
Copy link
Contributor

landreev commented Feb 4, 2025

The Apache errorlog looks quite strange. For a single zip download, I get the same message almost 1,300 times:
...

This is clearly dumped in the log for every 8KB of the output that the zipper produces. Whether this is actually a symptom of the zip stream issue you are experiencing, I'm not sure. It's been a while now, but I am having a recollection of our production admins reporting that the zipper was flooding the logs with some repeating message back when we first deployed it (even though it was working properly)... and we must have just addressed that by suppressing the log messages, by changing
LogLevel warn
to
LogLevel crit
in ssl.conf. ... I can't find that email thread and/or slack exchange though, so I'm not 100% sure if it was the same error message you are reporting.

@lmaylein
Copy link
Contributor

lmaylein commented Feb 4, 2025

Okay. I was about to write that this also happens when I call the Zipper via the shell (with export QUERY_STRING=...). But then that is the intention.
On 22 January there was an update of the Apache packages on the machine:

    Upgrade  httpd-2.4.62-1.el9_5.2.x86_64                 @rhel-9-for-x86_64-appstream-rpms
    Upgraded httpd-2.4.62-1.el9.x86_64                     @@System
    Upgrade  httpd-core-2.4.62-1.el9_5.2.x86_64            @rhel-9-for-x86_64-appstream-rpms
    Upgraded httpd-core-2.4.62-1.el9.x86_64                @@System
    Upgrade  httpd-filesystem-2.4.62-1.el9_5.2.noarch      @rhel-9-for-x86_64-appstream-rpms
    Upgraded httpd-filesystem-2.4.62-1.el9.noarch          @@System
    Upgrade  httpd-tools-2.4.62-1.el9_5.2.x86_64           @rhel-9-for-x86_64-appstream-rpms
    Upgraded httpd-tools-2.4.62-1.el9.x86_64               @@System
    Upgrade  mod_lua-2.4.62-1.el9_5.2.x86_64               @rhel-9-for-x86_64-appstream-rpms
    Upgraded mod_lua-2.4.62-1.el9.x86_64                   @@System
    Upgrade  mod_ssl-1:2.4.62-1.el9_5.2.x86_64             @rhel-9-for-x86_64-appstream-rpms
    Upgraded mod_ssl-1:2.4.62-1.el9.x86_64                 @@System

But we have not changed anything in the Apache configuration itself.

@landreev
Copy link
Contributor

landreev commented Feb 4, 2025

The zipper already has a -ziponly option, which can be used (java -jar ZipDownloadService-1.0.0.jar -ziponly) to produce the output without the chunking blocks... but then it also skips the http headers, which we need in order to run it under cgi.
I'm going to try and quickly add something like -nochunking, to only skip the encoding.

landreev added a commit that referenced this issue Feb 4, 2025
landreev added a commit that referenced this issue Feb 4, 2025
landreev added a commit that referenced this issue Feb 4, 2025
@landreev
Copy link
Contributor

landreev commented Feb 4, 2025

Please try this experimental version: https://github.com/IQSS/dataverse/raw/refs/heads/11207-external-zipper-chunking-issue/scripts/zipdownload/target/zipdownloader-0.0.1-test.jar, with the -nochunking option.
I.e., drop the new jar file in your /var/www/cgi-bin (or its equivalent) and modify your zipdownload script so that the last line looks like

java ...  -jar zipdownloader-0.0.1-test.jar -nochunking

and see what happens? - It may or may not work; no promises.

@lmaylein
Copy link
Contributor

lmaylein commented Feb 5, 2025

and see what happens? - It may or may not work; no promises.

It works. Great. Thank you very much.

@landreev
Copy link
Contributor

landreev commented Feb 5, 2025

Interesting. I'm going to operate under the assumption that this was due to a change in how newer versions of Apache handle content generated under cgi-bin. So, I'll get this new option merged in and I'll update the documentation accordingly; and add a release note explaining that instances using the tool may need the new version...

That said, I'm not entirely sure how many other Dataverse installations, other than yours and ours, are in fact using this zipper tool at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug a defect
Projects
None yet
Development

No branches or pull requests

3 participants