-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download from Blob storage gets stuck sometimes and never completes #22321
Comments
Moving this back over, I got confused by the linked repro having 'java-example' in the name and thought this had been misfiled. @kotewar I assume the package you are using is actually |
@xirzec - I used v12.10.0 and I am able to reproduce the same issue. Logs here - https://github.com/kotewar/cron-action-test-download-bug/runs/7047939232?check_suite_focus=true |
For reference, https://github.com/actions/toolkit/blob/main/packages/cache/src/internal/downloadUtils.ts#L212 shows how
which is set to
And from the screenshots provided in the earlier comments, it appears this value isn't having the intended effect as we would expect the stuck request to timeout, possibly with a few retries, much faster than what's seen. |
Hi @kotewar , I'm trying to get a reproduce with purely calling download interface within @azure/storage-blob. At the meantime, could you share how to use @actions/cache to trigger downloading from storage? It's difficult to debug it from a workflow in github. Thanks |
Hi @EmmaZhu, |
In case it is helpful, you can find many examples of this error occurring within the failures of this workflow. All or nearly all failures that show a runtime of |
I don't know which section of your work flow would need to download with the storage SDK, thus I cannot find needed log from it. @kotewar , Thanks for pointing me to the code. I have copied your code to try to have a repro and have succeeded with downloading a 8GB blob. From the screenshot you pasted, seems it's stuck at downloaded length: 2143289343. I'm guessing that it may need to spend long time to write to file which appears of no progress update for a long time. Could you help to add more logs to see whether the workflow is working on writing the buffer to file? Meanwhile, I'll keep on running my reproduce sample on blobs with various lengths to see whether I can get a repro. Thanks |
@EmmaZhu, apologies for not specifying. Here's a direct link. You'll find these failures in the |
@EmmaZhu, the failure as I mentioned doesn't happen always, so if you check the successful jobs in the same run, you'll notice that same file or segment where it got stuck was downloaded successfully by other jobs. I am attaching the whole workflow logs and failure execution logs, we had enabled debugging in Azure SDK to enable showing the download progress with headers and response. hope this helps - |
Any update on this? Here are two more occurences within the same workflow run: |
👋 , we are seeing multiple customer issues created for this issue. Do we have any update for this |
@EmmaZhu bumping this up as this is increasingly affecting more and more customers. The worse part here is that the download gets stuck forever and hence the GitHub Actions workflow run gets stuck. Since the run minutes are billed, it is causing a lot of concern. There are two issues in play here:
|
Hi @bishal-pdMSFT , The stuck can have several causes. For now, I can repro a stuck with unplugging network cable from my local machine. For a downloading, the SDK will split it into small pieces and send out a downloading request for each piece in parallel. We'll need to set a timeout for each request to download the small chunks. I'll make a fix ASAP. Thanks |
👋 @EmmaZhu, Any update for fix or if you can provide the timeline for the fix. We are constantly getting more issues related to this on actions/cache repo |
We are planning to have a preview release in August and a GA release in September. Would it work for you? Or does a daily build package work for you? Thanks |
@EmmaZhu this would qualify for a hot-patch. Can we take that route? We need a quick resolution. |
docker push action has this issue too, but it uses self implemented code |
@islishude, does it use the same SDK, although its a self implemented code? or it is completely self implemented upload? |
docker push action uses buildkit, buildkit is written by golang and it supports github action cache there is a log snapshot from a my private repository, it takes 1 hour to save cache |
Hi @kotewar , Could you share some details about the test environment:
I'll do a similar test to try to get a repro. Thanks |
Hi @EmmaZhu,
These results are from production environment. We use the storage accounts to store lots (thousands) of caches on a daily basis. The blob types are all compressed files (example: tar) that we download and extract and the size ranges from few KBs to few GBs (below 10GB). We've seen this issue only when the file size is big though. Is there any possibility of server side throttling here?
The VMs run on The client is a GitHub action that can be found here.
Shared the remaining info on teams. |
Use `actions/cache` directly instead of relying on `actions/setup-node` to see if this solves [the hanging restore](https://github.com/vercel/vercel/actions/runs/3660219547/jobs/6187125554). - Related to actions/cache#810 - Related to Azure/azure-sdk-for-js#22321 This also shaves a minute off cache restore time since we are caching 250MB instead of multiple GB.
Hi @EmmaZhu, Do we have any update on this? |
Hi @EmmaZhu, 👋🏽 |
Hi, I'm having the same issue with using python azure sdk. Download of rather large files (several GBs) sometimes gets stuck (we have 48 times per day and in 10%-15% of tries it just gets stuck) and never times out even when I have set the timeout. Looks like it gets stuck close to the end of file it tries to download. Here is the part of log showing where how it gets stuck:
|
@ilyadinaburg, thanks for bringing this up. I was assuming until today that this issue is only with the JavaScript SDK. |
Btw sometimes when download is stuck I also see the following: |
If you have a recent repro for it, could you share the account name and the time segment when the download is happening? We can take a look from service side to find some clue. |
Hi @EmmaZhu, happen to search this ticket, i have a similar question for downloading azure blob stuck, i have a ticket here https://learn.microsoft.com/en-us/answers/questions/1122618/azure-blob-download-alway-stuck-with-no-error-afte.html, in general, our project needs to keep reading data from gen2 storage, then the spark job do some aggregation, but from the log, sometimes, the reading blob data just stuck there, no error, no response, code can not go on, just stuck there, can you take a look at this? |
@EmmaZhu , were you able to take a look from service side for such events? Do let us know if anything is being done regarding this. |
Hi @kotewar , We would need an account and the time segment when the issue happens to look into. If you have a recent repro, could you share it to me to take a look? |
I'll try to simulate and share info on the same. |
@EmmaZhu 👋🏽 |
I have look into the account you shared to me. From service side, I see that there was a timeout when trying to write to client, which means network might have some issue and service cannot write to client for the moment. The JS SDK now should timeout if there's no data to download for a long time, did you see the timeout error or the progress updating just stopped without any error reported? |
Thanks for taking a look at this @EmmaZhu Currently we have our custom timeout implemented for stopping in case anything like this happens because the last time I tested the timeouts were happening but not consistent. Can you please confirm if they are actually network issues or is there any kind of throttling happening from server end because we keep on pushing and pulling a lot of data from the blob storage continuously. The reason why I am focusing on throttling is, when we try downloading data from blob storage using our http-client we don't see this issue. |
@EmmaZhu can you please share more details about this? Is it the source side error (i.e. the client not able to connect to storage endpoint) or storage endpoint side error (i.e. storage throttling us or having routing issues etc)? |
Talked with @kotewar . Currently, the download interface in the SDK would only send one request for download operation. For downloading a large file, it requires to keep connection alive for pretty long time, there's big chance to meet connection breaking during this time. I'll put the downloading/uploading improvement in our backlog. |
I have found that the native read stream implementation from both AWS S3 and Azure Blob Storage struggle when tasked with large files. They can also struggle on relatively small files if you are doing a lot of processing while streaming. Taking what @EmmaZhu said about downloading pieces of the file at a time; I wrote a package for the nodejs users. This basically takes the BlobClient class's Just thought I would share in case anyone was struggling with a solution: |
Hi @kotewar, we deeply appreciate your input into this project. Regrettably, this issue has remained unresolved for over 2 years and inactive for 30 days, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support. |
Describe the bug
Download from blob storage fails to complete and gets stuck for very long time. Here's an example run for the same (with
info
logs from SDK)- https://github.com/kotewar/cron-action-test-download-bug/runs/6935926824?check_suite_focus=trueThis doesn't happen always but once in a while it is getting stuck for many users. There are many issues raised regarding the same by the users of Actions/Cache for the same.
References-
To Reproduce
Steps to reproduce the behavior:
As this is an intermittent issue, this can be reproduced by having a github action scheduled that creates a cache huge in size. This workflow file can be used for the same.
Expected behavior
The download should complete 100% everytime and not get stuck intermittently.
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Same file gets downloaded by multiple runners most of the times. And this issue is mostly seen when the same file is getting downloaded in parallel from Azure Blob Storage.
The text was updated successfully, but these errors were encountered: