Adjustments for new addReadStreamLazy API #87

fkocherovsky · 2024-11-21T18:24:42Z

Hi Josh,
I am glad to see you are back to maintain the package. I just tested your new addReadStreamLazy API, which should solve the issue (#74). Please correct me if I am wrong, but it looks like I found two cases with this API.
One case is when addReadStreamLazy is called multiple times, but streams are not created lazily, on demand, when they should actually start to be pumped. They are created immediately by the API call. The second issue is that the 'getReadStreamFunction' callback is called twice for some entries.
For the case of not lazy stream creation:
My idea to solve issue #74 was to create the next readable stream only when the currently streaming entry is finished, and then the pumpEntries function pops the next entity to pipe. The stream for specific entry should be created at this point, just before the entry is going to be pumped. Your current implementation creates the stream immediately by calling addReadStreamLazy function, which calls pumpEntries(self) at the end. So if you call N subsequent addReadStreamLazy, all N streams are created at this point. It is not much better from the point of view of closing idle streams than creating N streams and to call the old version of addReadStream N times. Some streams might be closed by idle socket timeout before YAZL starts to pump them.
For the case when 'getReadStreamFunction' callback is called twice:
Let’s assume you call addReadStreamLazy(...) and then end(). The first callback is called by addReadStreamLazy -> pumpEntries, and the second one by end() -> pumpEntries. This second call pops an entry which is currently being pumped and isn’t 'FILE_DATA_DONE' yet. As a result, you pump some data twice and also get a broken zip file. I guess you tested the new API with pretty small streams, so you didn’t face the race condition when an entry is still being pumped, and the second pumpEntries call takes this entry to pump again. With bigger streams, the issue is easy to reproduce.
From my point of view, pumpEntries should not be called at the end of addReadStreamLazy. Removing this pumpEntries call solves both issues I described. This is also how I implemented it in my PR (#75) two years ago, and this manually patched version has been running on our PROD servers for two years already. It’s just an option; maybe you have another way to solve it.
Anyway, thank you for your work!

thejoshwolfe · 2024-11-23T13:23:02Z

Oh! you found a bug. I was setting the in-progress state for the entry at the wrong time. good catch!

My idea to solve issue #74 was to create the next readable stream only when the currently streaming entry is finished

yep that's exactly my idea as well. I just implemented it wrong.

fix published in version 3.3.1. let me know if you still encounter buggy behavior.

thejoshwolfe closed this as completed in 687e507 Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjustments for new addReadStreamLazy API #87

Adjustments for new addReadStreamLazy API #87

fkocherovsky commented Nov 21, 2024 •

edited

Loading

thejoshwolfe commented Nov 23, 2024

Adjustments for new addReadStreamLazy API #87

Adjustments for new addReadStreamLazy API #87

Comments

fkocherovsky commented Nov 21, 2024 • edited Loading

thejoshwolfe commented Nov 23, 2024

fkocherovsky commented Nov 21, 2024 •

edited

Loading