Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download from Blob storage gets stuck sometimes and never completes #22321

Closed
6 tasks
kotewar opened this issue Jun 21, 2022 · 54 comments
Closed
6 tasks

Download from Blob storage gets stuck sometimes and never completes #22321

kotewar opened this issue Jun 21, 2022 · 54 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. feature-request This issue requires a new behavior in the product in order be resolved. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Storage Storage Service (Queues, Blobs, Files)

Comments

@kotewar
Copy link

kotewar commented Jun 21, 2022

  • Package Name: Azure/azure-storage-node
  • Package Version: v12.9.0
  • Operating system: Ubuntu 20.04.4 LTS
  • nodejs
    • version: v16.13.0
  • browser
    • name/version: NA
  • typescript
    • version:
  • Is the bug related to documentation in

Describe the bug
Download from blob storage fails to complete and gets stuck for very long time. Here's an example run for the same (with info logs from SDK)- https://github.com/kotewar/cron-action-test-download-bug/runs/6935926824?check_suite_focus=true

Screenshot 2022-06-20 at 11 11 10 AM

This doesn't happen always but once in a while it is getting stuck for many users. There are many issues raised regarding the same by the users of Actions/Cache for the same.

References-

To Reproduce
Steps to reproduce the behavior:
As this is an intermittent issue, this can be reproduced by having a github action scheduled that creates a cache huge in size. This workflow file can be used for the same.

Expected behavior
The download should complete 100% everytime and not get stuck intermittently.

Screenshots
If applicable, add screenshots to help explain your problem.
Screenshot 2022-06-20 at 11 11 10 AM

Additional context
Same file gets downloaded by multiple runners most of the times. And this issue is mostly seen when the same file is getting downloaded in parallel from Azure Blob Storage.

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jun 21, 2022
@azure-sdk azure-sdk added Client This issue points to a problem in the data-plane of the library. needs-team-triage Workflow: This issue needs the team to triage. Storage Storage Service (Queues, Blobs, Files) labels Jun 21, 2022
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jun 21, 2022
@xirzec xirzec transferred this issue from Azure/azure-sdk-for-js Jun 21, 2022
@ghost ghost added the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jun 21, 2022
@xirzec xirzec transferred this issue from Azure/azure-sdk-for-java Jun 21, 2022
@xirzec
Copy link
Member

xirzec commented Jun 21, 2022

Moving this back over, I got confused by the linked repro having 'java-example' in the name and thought this had been misfiled.

@kotewar I assume the package you are using is actually @azure/storage-blob ? It looks like your version is slightly out of date, can you reproduce this issue with the latest (12.10.0)?

@xirzec xirzec removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. needs-team-triage Workflow: This issue needs the team to triage. labels Jun 21, 2022
@kotewar
Copy link
Author

kotewar commented Jun 26, 2022

@xirzec - I used v12.10.0 and I am able to reproduce the same issue.

Logs here - https://github.com/kotewar/cron-action-test-download-bug/runs/7047939232?check_suite_focus=true

@dhadka
Copy link

dhadka commented Jun 27, 2022

For reference, https://github.com/actions/toolkit/blob/main/packages/cache/src/internal/downloadUtils.ts#L212 shows how actions/cache is using the Azure SDK. Note it is passing in

     tryTimeoutInMs: options.timeoutInMs`

which is set to 30000 (30 seconds) by default. From the docs:

Optional. Indicates the maximum time in ms allowed for any single try of an HTTP request. A value of zero or undefined means no default timeout on SDK client, Azure Storage server's default timeout policy will be used.

And from the screenshots provided in the earlier comments, it appears this value isn't having the intended effect as we would expect the stuck request to timeout, possibly with a few retries, much faster than what's seen.

@EmmaZhu
Copy link
Member

EmmaZhu commented Jun 29, 2022

Hi @kotewar ,

I'm trying to get a reproduce with purely calling download interface within @azure/storage-blob.

At the meantime, could you share how to use @actions/cache to trigger downloading from storage? It's difficult to debug it from a workflow in github.

Thanks
Emma

@kotewar
Copy link
Author

kotewar commented Jun 29, 2022

Hi @EmmaZhu,
downloadCacheStorageSDK function is what we use for downloading from Azure Blob storage using the BlockBlobClient

@djaglowski
Copy link

In case it is helpful, you can find many examples of this error occurring within the failures of this workflow. All or nearly all failures that show a runtime of 6h Xm Ys are due to this problem. There are roughly 40 examples in the last week.

@EmmaZhu
Copy link
Member

EmmaZhu commented Jul 4, 2022

@djaglowski ,

I don't know which section of your work flow would need to download with the storage SDK, thus I cannot find needed log from it.

@kotewar , Thanks for pointing me to the code. I have copied your code to try to have a repro and have succeeded with downloading a 8GB blob.

From the screenshot you pasted, seems it's stuck at downloaded length: 2143289343.
In my testing, the following line of log is printed after when it completes downloading the first segment and starts to write the buffer to file: "Received 2143289343 of 8589934592 (25.0%), 12.5 MBs/sec" .
After it writes this segment to file, it prints following log and start another segment's downloading:
"Received 2147483647 of 8589934592 (25.0%), 1.1 MBs/sec"

I'm guessing that it may need to spend long time to write to file which appears of no progress update for a long time. Could you help to add more logs to see whether the workflow is working on writing the buffer to file?

Meanwhile, I'll keep on running my reproduce sample on blobs with various lengths to see whether I can get a repro.

Thanks
Emma

@djaglowski
Copy link

@EmmaZhu, apologies for not specifying.

Here's a direct link. You'll find these failures in the lint-matrix and unittest-matrix jobs, under the Cache Go step.

@kotewar
Copy link
Author

kotewar commented Jul 6, 2022

@EmmaZhu, the failure as I mentioned doesn't happen always, so if you check the successful jobs in the same run, you'll notice that same file or segment where it got stuck was downloaded successfully by other jobs.
So yes, this might take time for you to reproduce, but one observation from our end has been the following -
This issue mostly occurs when the same file is getting downloaded in parallel by multiple jobs
Not sure if our finding helps you, but I just thought I'll bring this up here that might give you any clue.

I am attaching the whole workflow logs and failure execution logs, we had enabled debugging in Azure SDK to enable showing the download progress with headers and response. hope this helps -
logs_83.zip
Failure logs.txt

@tiwarishub
Copy link

👋 , we are seeing multiple customer issues created for this issue. Do we have any update for this

@bishal-pdMSFT
Copy link

@EmmaZhu bumping this up as this is increasingly affecting more and more customers. The worse part here is that the download gets stuck forever and hence the GitHub Actions workflow run gets stuck. Since the run minutes are billed, it is causing a lot of concern.

There are two issues in play here:

  1. The download gets stuck - the repro steps and SDK logs have been provided already but please let us know if any more information is needed to debug further
  2. The timeout is not honoured - I think it is higher priority to fix this first as this as this will atleast fail the stuck download faster.

@EmmaZhu
Copy link
Member

EmmaZhu commented Jul 26, 2022

Hi @bishal-pdMSFT ,

The stuck can have several causes. For now, I can repro a stuck with unplugging network cable from my local machine. For a downloading, the SDK will split it into small pieces and send out a downloading request for each piece in parallel. We'll need to set a timeout for each request to download the small chunks. I'll make a fix ASAP.

Thanks
Emma

@tiwarishub
Copy link

We'll need to set a timeout for each request to download the small chunks. I'll make a fix ASAP.

👋 @EmmaZhu, Any update for fix or if you can provide the timeline for the fix. We are constantly getting more issues related to this on actions/cache repo

@EmmaZhu
Copy link
Member

EmmaZhu commented Jul 29, 2022

@tiwarishub ,

We are planning to have a preview release in August and a GA release in September. Would it work for you? Or does a daily build package work for you?

Thanks
Emma

@bishal-pdMSFT
Copy link

@EmmaZhu this would qualify for a hot-patch. Can we take that route? We need a quick resolution.

@islishude
Copy link

docker push action has this issue too, but it uses self implemented code

@kotewar
Copy link
Author

kotewar commented Nov 16, 2022

@islishude, does it use the same SDK, although its a self implemented code? or it is completely self implemented upload?
Also the docker registry where you are trying to push, does it use Azure Blob storage for storing the images?

@islishude
Copy link

@islishude, does it use the same SDK, although its a self implemented code? or it is completely self implemented upload?

Also the docker registry where you are trying to push, does it use Azure Blob storage for storing the images?

docker push action uses buildkit, buildkit is written by golang and it supports github action cache

there is a log snapshot from a my private repository, it takes 1 hour to save cache

image

@EmmaZhu
Copy link
Member

EmmaZhu commented Nov 17, 2022

Hi @kotewar ,

Could you share some details about the test environment:

  • the data number: How many blobs in there, blob types and blob size
  • Where is the client and the account located: the client is on a VM which is in the same DC with the account, or the client is on a VM which is in a different DC, or is the client on a machine outside Azure?
  • If the client is located on a VM, could you also share size of the VM?

I'll do a similar test to try to get a repro.

Thanks
Emma

@kotewar
Copy link
Author

kotewar commented Nov 22, 2022

Hi @EmmaZhu,

the data number: How many blobs in there, blob types and blob size

These results are from production environment. We use the storage accounts to store lots (thousands) of caches on a daily basis. The blob types are all compressed files (example: tar) that we download and extract and the size ranges from few KBs to few GBs (below 10GB). We've seen this issue only when the file size is big though. Is there any possibility of server side throttling here?

Where is the client and the account located: the client is on a VM which is in the same DC with the account, or the client is on a VM which is in a different DC, or is the client on a machine outside Azure?

The VMs run on Standard_DS2_v2 virtual machines in Microsoft Azure, more details here. Regarding the location of the DC, they seem to be spread across multiple regions

The client is a GitHub action that can be found here.

If the client is located on a VM, could you also share size of the VM?

Shared the remaining info on teams.

kodiakhq bot pushed a commit to vercel/vercel that referenced this issue Dec 10, 2022
Use `actions/cache` directly instead of relying on `actions/setup-node` to see if this solves [the hanging restore](https://github.com/vercel/vercel/actions/runs/3660219547/jobs/6187125554).

- Related to actions/cache#810
- Related to Azure/azure-sdk-for-js#22321


This also shaves a minute off cache restore time since we are caching 250MB instead of multiple GB.
@kotewar
Copy link
Author

kotewar commented Dec 19, 2022

Hi @EmmaZhu,

Do we have any update on this?

@kotewar
Copy link
Author

kotewar commented Dec 27, 2022

Hi @EmmaZhu, 👋🏽
Do we have any update on this?

@ilyadinaburg
Copy link

Hi, I'm having the same issue with using python azure sdk. Download of rather large files (several GBs) sometimes gets stuck (we have 48 times per day and in 10%-15% of tries it just gets stuck) and never times out even when I have set the timeout. Looks like it gets stuck close to the end of file it tries to download. Here is the part of log showing where how it gets stuck:

[2023-01-09, 06:21:33 UTC] {connectionpool.py:442} DEBUG - [https://xxxx.blob.core.windows.net:443](https://xxxx.blob.core.windows.net/) "GET /folder/file.json?timeout=1800 HTTP/1.1" 206 4194304
...
...
...
[2023-01-09, 06:41:32 UTC] {connectionpool.py:442} DEBUG - [https://xxxx.blob.core.windows.net:443]
(https://xxxx.blob.core.windows.net/) "GET /folder/file.json?timeout=1800 HTTP/1.1" 206 4194304
[2023-01-09, 06:41:33 UTC] {connectionpool.py:442} DEBUG - [https://xxxx.blob.core.windows.net:443](https://xxxx.blob.core.windows.net/) "GET/folder/file.json?timeout=1800 HTTP/1.1" 206 592006

@kotewar
Copy link
Author

kotewar commented Jan 9, 2023

@ilyadinaburg, thanks for bringing this up. I was assuming until today that this issue is only with the JavaScript SDK.
@EmmaZhu can you please loop in the storage team into this issue? Seems like its getting more and more widespread and could be related to the backend/throttling that we might not be aware of. 🤔

@ilyadinaburg
Copy link

Btw sometimes when download is stuck I also see the following:
[connectionpool] [INFO] Resetting dropped connection: <hostname>
that probably means that Azure is dropping connections for some reason.

@EmmaZhu
Copy link
Member

EmmaZhu commented Jan 10, 2023

@ilyadinaburg ,

If you have a recent repro for it, could you share the account name and the time segment when the download is happening? We can take a look from service side to find some clue.

@nineteen528
Copy link

Hi @EmmaZhu, happen to search this ticket, i have a similar question for downloading azure blob stuck, i have a ticket here https://learn.microsoft.com/en-us/answers/questions/1122618/azure-blob-download-alway-stuck-with-no-error-afte.html, in general, our project needs to keep reading data from gen2 storage, then the spark job do some aggregation, but from the log, sometimes, the reading blob data just stuck there, no error, no response, code can not go on, just stuck there, can you take a look at this?

@kotewar
Copy link
Author

kotewar commented Jan 22, 2023

@EmmaZhu , were you able to take a look from service side for such events? Do let us know if anything is being done regarding this.

@EmmaZhu
Copy link
Member

EmmaZhu commented Jan 30, 2023

Hi @kotewar , We would need an account and the time segment when the issue happens to look into. If you have a recent repro, could you share it to me to take a look?

@kotewar
Copy link
Author

kotewar commented Jan 30, 2023

I'll try to simulate and share info on the same.

@kotewar
Copy link
Author

kotewar commented Feb 3, 2023

We would need an account and the time segment when the issue happens to look into. If you have a recent repro, could you share it to me to take a look?

@EmmaZhu 👋🏽
I have pinged you the details on teams.

@EmmaZhu
Copy link
Member

EmmaZhu commented Feb 10, 2023

I have look into the account you shared to me.

From service side, I see that there was a timeout when trying to write to client, which means network might have some issue and service cannot write to client for the moment. 

The JS SDK now should timeout if there's no data to download for a long time, did you see the timeout error or the progress updating just stopped without any error reported?

@kotewar
Copy link
Author

kotewar commented Feb 10, 2023

Thanks for taking a look at this @EmmaZhu

Currently we have our custom timeout implemented for stopping in case anything like this happens because the last time I tested the timeouts were happening but not consistent.
And right now our concern is more regarding the network issues that you mentioned because it is happening very frequently for us while trying to download anything from the blob storage.

Can you please confirm if they are actually network issues or is there any kind of throttling happening from server end because we keep on pushing and pulling a lot of data from the blob storage continuously.

The reason why I am focusing on throttling is, when we try downloading data from blob storage using our http-client we don't see this issue.

@bishal-pdMSFT
Copy link

From service side, I see that there was a timeout when trying to write to client, which means network might have some issue and service cannot write to client for the moment.

@EmmaZhu can you please share more details about this? Is it the source side error (i.e. the client not able to connect to storage endpoint) or storage endpoint side error (i.e. storage throttling us or having routing issues etc)?

@EmmaZhu
Copy link
Member

EmmaZhu commented Feb 15, 2023

Talked with @kotewar .
For the downloading failure, service had successfully handled the request and had sent response headers to client, then it was writing response body to client side. Service got a connect break error when partial of response body has been written to client.

Currently, the download interface in the SDK would only send one request for download operation. For downloading a large file, it requires to keep connection alive for pretty long time, there's big chance to meet connection breaking during this time.
A possible solution for downloading a large file more stably should be: split a large blob to small pieces and send download request for each piece.

I'll put the downloading/uploading improvement in our backlog.

@xirzec xirzec added feature-request This issue requires a new behavior in the product in order be resolved. and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Mar 30, 2023
@ghost ghost added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Mar 30, 2023
@about14sheep
Copy link

about14sheep commented Jun 17, 2024

I have found that the native read stream implementation from both AWS S3 and Azure Blob Storage struggle when tasked with large files. They can also struggle on relatively small files if you are doing a lot of processing while streaming.

Taking what @EmmaZhu said about downloading pieces of the file at a time; I wrote a package for the nodejs users. This basically takes the BlobClient class's downloadToBuffer method to download set chunks of the file. The functionality is wrapped in a NodeJS readable stream so it works pretty well as a drop-in replacement to the native readstream.

Just thought I would share in case anyone was struggling with a solution:
https://www.npmjs.com/package/az-blob-readstream

Copy link

Hi @kotewar, we deeply appreciate your input into this project. Regrettably, this issue has remained unresolved for over 2 years and inactive for 30 days, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jul 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. feature-request This issue requires a new behavior in the product in order be resolved. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

No branches or pull requests