Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjustments for new addReadStreamLazy API #87

Closed
fkocherovsky opened this issue Nov 21, 2024 · 1 comment
Closed

Adjustments for new addReadStreamLazy API #87

fkocherovsky opened this issue Nov 21, 2024 · 1 comment

Comments

@fkocherovsky
Copy link

fkocherovsky commented Nov 21, 2024

Hi Josh,
I am glad to see you are back to maintain the package. I just tested your new addReadStreamLazy API, which should solve the issue (#74). Please correct me if I am wrong, but it looks like I found two cases with this API.
One case is when addReadStreamLazy is called multiple times, but streams are not created lazily, on demand, when they should actually start to be pumped. They are created immediately by the API call. The second issue is that the 'getReadStreamFunction' callback is called twice for some entries.
For the case of not lazy stream creation:
My idea to solve issue #74 was to create the next readable stream only when the currently streaming entry is finished, and then the pumpEntries function pops the next entity to pipe. The stream for specific entry should be created at this point, just before the entry is going to be pumped. Your current implementation creates the stream immediately by calling addReadStreamLazy function, which calls pumpEntries(self) at the end. So if you call N subsequent addReadStreamLazy, all N streams are created at this point. It is not much better from the point of view of closing idle streams than creating N streams and to call the old version of addReadStream N times. Some streams might be closed by idle socket timeout before YAZL starts to pump them.
For the case when 'getReadStreamFunction' callback is called twice:
Let’s assume you call addReadStreamLazy(...) and then end(). The first callback is called by addReadStreamLazy -> pumpEntries, and the second one by end() -> pumpEntries. This second call pops an entry which is currently being pumped and isn’t 'FILE_DATA_DONE' yet. As a result, you pump some data twice and also get a broken zip file. I guess you tested the new API with pretty small streams, so you didn’t face the race condition when an entry is still being pumped, and the second pumpEntries call takes this entry to pump again. With bigger streams, the issue is easy to reproduce.
From my point of view, pumpEntries should not be called at the end of addReadStreamLazy. Removing this pumpEntries call solves both issues I described. This is also how I implemented it in my PR (#75) two years ago, and this manually patched version has been running on our PROD servers for two years already. It’s just an option; maybe you have another way to solve it.
Anyway, thank you for your work!

@thejoshwolfe
Copy link
Owner

Oh! you found a bug. I was setting the in-progress state for the entry at the wrong time. good catch!

My idea to solve issue #74 was to create the next readable stream only when the currently streaming entry is finished

yep that's exactly my idea as well. I just implemented it wrong.

fix published in version 3.3.1. let me know if you still encounter buggy behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants