File download via Zipdownloader tool creates damaged archives? #11207

kbrueckmann · 2025-02-03T15:48:01Z

What steps does it take to reproduce the issue?

go to any dataset using the standalone zipper tool (e.g. https://doi.org/10.11588/DATA/N1T5T8)
press button "access dataset" -> Download ZIP
try to extract the files in the ZIP

What happens?
The extraction of files from the zip fails:

unzip -v dataverse_files.zip

Archive: dataverse_files.zip
warning [dataverse_files.zip]: 10342 extra bytes at beginning or within zipfile (attempting to process anyway)

Length	Method	Size	Cmpr	Date	Time	CRC-32	Name
10485117	Defl:N	10400432	1%	2025-02-03	16:26	deeb27c3	FUQ.pdf
21014	Defl:N	18286	13%	2025-02-03	16:26	df8d22ca	Derived_requirements.docx
...

Apparently, each file has a bad zipfile offset.
We tested the multi file download for different datasets, on different machines, checked the Payara and Apache logs but found nothing obvious there. The single file download (= not using the zipper tool) works perfectly.

To whom does it occur (all users, curators, superusers)?
All users.

Which version of Dataverse are you using?
v 6.5

Any related open or closed issues to this bug report?
I found none.

The text was updated successfully, but these errors were encountered:

landreev · 2025-02-03T16:20:34Z

This is not a brand new installation of the standalone zipper, is it? - In other words, is this something that was/used to be working properly, then just stopped working?
The Apache logs (/var/log/httpd/ssl_error_log specifically?) would be the place to look for relevant error message, yes. So, if there is nothing interesting there, that makes it more puzzling.

landreev · 2025-02-03T16:37:36Z

If you look at the first few bytes of the zip file, you see these 6 extra bytes above the normal zip header:

head -c8 zipdownload_heiDATA.zip
2000
PK%

(the 6 bytes are "2000" followed by the 2-byte DOS newline, i.e. "2000\r\n")
Simply removing this first line doesn't fix it either - since these extra lines keep appearing throughout the rest of the zip file.
... this in turn looks like an extra chunking encoding block size header added to the stream. Has anything changed recently in how access to the script is configured under Apache?

kbrueckmann · 2025-02-04T10:57:28Z

Thanks for the quick reply! Yes, it used to work properly in previous versions. We're not 100% sure it only stopped working after our upgrade to v6.5 from 6.3 (via 6.4), but as far as I know we changed nothing in the configuration of the script or access to it. I'll ask a colleague to have a look at the specific log you mentioned; hopefully, we'll know more soon.

lmaylein · 2025-02-04T11:09:48Z

The Apache errorlog looks quite strange. For a single zip download, I get the same message almost 1,300 times:

[Tue Feb 04 11:15:33.769698 2025] [cgid:error] [pid 80706:tid 80835] [client 147.142.xxx.xxx:47178] AH01215: stderr from /var/www/cgi-bin/zipdownload: offset: 0, length: 8192, referer: https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/N1T5T8
[Tue Feb 04 11:15:33.769774 2025] [cgid:error] [pid 80706:tid 80835] [client 147.142.xxx.xxx:47178] AH01215: stderr from /var/www/cgi-bin/zipdownload: offset: 0, length: 8192, referer: https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/N1T5T8

....

[Tue Feb 04 11:15:34.221621 2025] [cgid:error] [pid 80706:tid 80835] [client 147.142.xxx.xxx:47178] AH01215: stderr from /var/www/cgi-bin/zipdownload: offset: 0, length: 8192, referer: https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/N1T5T8
[Tue Feb 04 11:15:34.223736 2025] [cgid:error] [pid 80706:tid 80835] [client 147.142.xxx.xxx:47178] AH01215: stderr from /var/www/cgi-bin/zipdownload: offset: 0, length: 5544, referer: https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/N1T5T8

lmaylein · 2025-02-04T11:10:54Z

And at least the head of the zip file is okay:


head -c8  'dataverse_files.zip'
2000
PK

landreev · 2025-02-04T15:45:06Z

And at least the head of the zip file is okay:
head -c8  'dataverse_files.zip'
2000
PK

Well, it's clearly not okay - it has the 6 extra bytes added before the normal "PK" header.
This "2000\r\n" line comes from here in the zipper:

dataverse/scripts/zipdownload/src/main/java/edu/harvard/iq/dataverse/custom/service/download/ChunkingOutputStream.java

Lines 32 to 34 in 6be4f20

    
           private static final int BUFFER_SIZE = 8192; 
        
           private static final byte[] CHUNK_CLOSE = "\r\n".getBytes(); 
        
           private static final String CHUNK_SIZE_FORMAT = "%x\r\n";

("2000" being hex for 8192).
The zipper implements the http Transfer-Encoding: chunked - since it doesn't know the total size of the zip stream and cannot tell the client how many bytes to expect ahead of time, it sends 8KB at a time, preceding each block with this line, telling the client to expect that many bytes.

... but, what appears to be happening in your case, instead of sending this byte stream generated by the zipper to the client as is, Apache (for whatever reason) decides to chunk-encode it again, so the client receives it with these extra header and closing bytes; which of course breaks the zip format.

I can't imagine that the Dataverse-side upgrade from (to 6.5) could have anything to do with this. It would be more likely that it was a change in the version of Apache installed, or maybe a change in the Apache configuration (?). Looking at the headers from your zipper, you appear to be using the Apache version 2.4.62 and ours is 2.4.37 (we are also using the zipper, and not experiencing this issue).

... OK, that must have been too much unnecessary/not particularly useful information. In more practical terms, I can try and build a version of the zipper that does not apply the chunking encoding to the stream for you, and see if that fixes it.
But I would very much like to find out why this has started happening.

landreev · 2025-02-04T16:15:02Z

The Apache errorlog looks quite strange. For a single zip download, I get the same message almost 1,300 times:
...

This is clearly dumped in the log for every 8KB of the output that the zipper produces. Whether this is actually a symptom of the zip stream issue you are experiencing, I'm not sure. It's been a while now, but I am having a recollection of our production admins reporting that the zipper was flooding the logs with some repeating message back when we first deployed it (even though it was working properly)... and we must have just addressed that by suppressing the log messages, by changing
LogLevel warn
to
LogLevel crit
in ssl.conf. ... I can't find that email thread and/or slack exchange though, so I'm not 100% sure if it was the same error message you are reporting.

lmaylein · 2025-02-04T16:16:55Z

Okay. I was about to write that this also happens when I call the Zipper via the shell (with export QUERY_STRING=...). But then that is the intention.
On 22 January there was an update of the Apache packages on the machine:

    Upgrade  httpd-2.4.62-1.el9_5.2.x86_64                 @rhel-9-for-x86_64-appstream-rpms
    Upgraded httpd-2.4.62-1.el9.x86_64                     @@System
    Upgrade  httpd-core-2.4.62-1.el9_5.2.x86_64            @rhel-9-for-x86_64-appstream-rpms
    Upgraded httpd-core-2.4.62-1.el9.x86_64                @@System
    Upgrade  httpd-filesystem-2.4.62-1.el9_5.2.noarch      @rhel-9-for-x86_64-appstream-rpms
    Upgraded httpd-filesystem-2.4.62-1.el9.noarch          @@System
    Upgrade  httpd-tools-2.4.62-1.el9_5.2.x86_64           @rhel-9-for-x86_64-appstream-rpms
    Upgraded httpd-tools-2.4.62-1.el9.x86_64               @@System
    Upgrade  mod_lua-2.4.62-1.el9_5.2.x86_64               @rhel-9-for-x86_64-appstream-rpms
    Upgraded mod_lua-2.4.62-1.el9.x86_64                   @@System
    Upgrade  mod_ssl-1:2.4.62-1.el9_5.2.x86_64             @rhel-9-for-x86_64-appstream-rpms
    Upgraded mod_ssl-1:2.4.62-1.el9.x86_64                 @@System

But we have not changed anything in the Apache configuration itself.

landreev · 2025-02-04T16:35:02Z

The zipper already has a -ziponly option, which can be used (java -jar ZipDownloadService-1.0.0.jar -ziponly) to produce the output without the chunking blocks... but then it also skips the http headers, which we need in order to run it under cgi.
I'm going to try and quickly add something like -nochunking, to only skip the encoding.

… formatting in the readme file. #11207

landreev · 2025-02-04T20:11:28Z

Please try this experimental version: https://github.com/IQSS/dataverse/raw/refs/heads/11207-external-zipper-chunking-issue/scripts/zipdownload/target/zipdownloader-0.0.1-test.jar, with the -nochunking option.
I.e., drop the new jar file in your /var/www/cgi-bin (or its equivalent) and modify your zipdownload script so that the last line looks like

java ...  -jar zipdownloader-0.0.1-test.jar -nochunking

and see what happens? - It may or may not work; no promises.

lmaylein · 2025-02-05T07:46:21Z

and see what happens? - It may or may not work; no promises.

It works. Great. Thank you very much.

landreev · 2025-02-05T15:56:28Z

Interesting. I'm going to operate under the assumption that this was due to a change in how newer versions of Apache handle content generated under cgi-bin. So, I'll get this new option merged in and I'll update the documentation accordingly; and add a release note explaining that instances using the tool may need the new version...

That said, I'm not entirely sure how many other Dataverse installations, other than yours and ours, are in fact using this zipper tool at this point.

kbrueckmann added the Type: Bug a defect label Feb 3, 2025

landreev added a commit that referenced this issue Feb 4, 2025

a quick experiment with disabling chunking encoding; also, fixing the…

6d4c58d

… formatting in the readme file. #11207

landreev added a commit that referenced this issue Feb 4, 2025

adding a built jar file to the branch (temporarily!) #11207

1859029

landreev added a commit that referenced this issue Feb 4, 2025

cosmetic #11207

7a7537c

landreev added a commit that referenced this issue Feb 4, 2025

cosmetic #11207

9ab7efb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File download via Zipdownloader tool creates damaged archives? #11207

File download via Zipdownloader tool creates damaged archives? #11207

kbrueckmann commented Feb 3, 2025

landreev commented Feb 3, 2025

landreev commented Feb 3, 2025 •

edited

Loading

kbrueckmann commented Feb 4, 2025

lmaylein commented Feb 4, 2025

lmaylein commented Feb 4, 2025

landreev commented Feb 4, 2025

landreev commented Feb 4, 2025

lmaylein commented Feb 4, 2025

landreev commented Feb 4, 2025

landreev commented Feb 4, 2025

lmaylein commented Feb 5, 2025

landreev commented Feb 5, 2025

File download via Zipdownloader tool creates damaged archives? #11207

File download via Zipdownloader tool creates damaged archives? #11207

Comments

kbrueckmann commented Feb 3, 2025

landreev commented Feb 3, 2025

landreev commented Feb 3, 2025 • edited Loading

kbrueckmann commented Feb 4, 2025

lmaylein commented Feb 4, 2025

lmaylein commented Feb 4, 2025

landreev commented Feb 4, 2025

landreev commented Feb 4, 2025

lmaylein commented Feb 4, 2025

landreev commented Feb 4, 2025

landreev commented Feb 4, 2025

lmaylein commented Feb 5, 2025

landreev commented Feb 5, 2025

landreev commented Feb 3, 2025 •

edited

Loading