Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controlling TransferManager in azblob 0.5.1 #19579

Closed
AsafMah opened this issue Nov 16, 2022 · 16 comments · Fixed by #19635
Closed

Controlling TransferManager in azblob 0.5.1 #19579

AsafMah opened this issue Nov 16, 2022 · 16 comments · Fixed by #19635
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. Storage Storage Service (Queues, Blobs, Files)
Milestone

Comments

@AsafMah
Copy link

AsafMah commented Nov 16, 2022

Hey, we want to migrate from azblob 0.3 to 0.5.1.

For memory saving reasons, currently we expose a way for the client to decide if the transfermanager is "StaticBuffer" or "Syncpool".

It seems that in 0.5.1 though, TransferManager became an internal type, and it doesn't seem to be editable by the user.
https://github.com/Azure/azure-sdk-for-go/blob/72b362adee0e743c1cbb80870384044d0f855480/sdk/storage/azblob/blockblob/models.go

Is there a way for the user to control it? There are the concurrency and blocksize arguments, but they seem now to only instanciate a "staticbuffer" kind.

Thanks.

@ghost ghost added the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Nov 16, 2022
@jhendrixMSFT jhendrixMSFT added Storage Storage Service (Queues, Blobs, Files) Client This issue points to a problem in the data-plane of the library. and removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Nov 16, 2022
@jhendrixMSFT
Copy link
Member

When we did the refactor, this was intentional. However, looks like we've regressed some functionality.

CC @JeffreyRichter

@JeffreyRichter
Copy link
Member

The Go SDK is the ONLY SDK that has a transfer manager. When I asked about its purpose, no one was able to explain why it was necessary. So, ideally, it would go away completely unless there is a great reason for it. To ship sooner and not think about it now, we made it internal.

@jhendrixMSFT
Copy link
Member

My understanding from looking at the code is the TransferManager interface is used to abstract different implementations for managing memory when uploading a blob, allowing apps to fine-tune how memory is allocated. At present, it looks to only be used by copier. We used to expose two different implementations, a static buffer and sync pool that customers could select, with the default being static buffer.

@JeffreyRichter
Copy link
Member

I found the publicly-exposed place where copyFromReader is called: https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/storage/azblob/blockblob/client.go#L484

@siminsavani-msft: Could you look through this repo's history and find the version of this function that didn't use the TransferManager and this copier stuff and revert the code back to that and rip out this other stuff?

@jhendrixMSFT
Copy link
Member

@AsafMah can you tell us a bit more about your scenario? Has the lack of access to "Syncpool" negatively impacted your app/performance?

@ncw
Copy link
Contributor

ncw commented Nov 23, 2022

Please undo this @jhendrixMSFT

I'm currently converting rclone from the old SDK and not having access to TransferManager means a loss of functionality. Rclone can allocate memory using mmap which really helps with the memory fragmentation.

It also means I can't make a work-around for #19612 which is sad :-(

@AsafMah
Copy link
Author

AsafMah commented Nov 23, 2022

Our use case: we publish the kusto/adx sdk in go, which uses storage as a dependency.
We have a client who wants to create a huge amount of sdk clients, but each sdk client has a storage client, and with each one having a static buffer the ram amount skyrockets.

The syncpool helped this.

@JeffreyRichter
Copy link
Member

Is it not possible to use the UploadFile or UploadBuffer methods? UploadFile memory maps the file and then calls UploadBuffer and so NO buffers are allocated by these methods at all and they are much faster than UploadStream because multiple parts of the file/buffer can be accessed concurrently while UploadStream method must read the stream sequentialy into allocated buffers (so it is worse on memory and on speed) than the UploadFile/Buffer methods.

@ncw
Copy link
Contributor

ncw commented Nov 23, 2022

@JeffreyRichter not for rclone's use - it streams data between cloud providers so the data is in an io.Reader not in a file.

@JeffreyRichter
Copy link
Member

I'm just not getting it. So, you have an io.Reader over a network connection that is downloading data from somewhere and you want to upload that data to an Azure blob. You can call our UploadStream method and this will work fine. So, this does work. The TransferManager just allowed the reuse of some buffers, but I find it hard to believe that the perf hit of this (and the GCs) is an issue compared to the huge perf hit of uploading blobs to Azure. So, we're talking about a minuscule perf issue here and not a functionality issue. Or, am I missing something? Please give me a detailed explanation of the scenario so I can appreciate it and determine the best sustainable thing for us to do.

@ncw
Copy link
Contributor

ncw commented Nov 24, 2022

@JeffreyRichter

I'm just not getting it. So, you have an io.Reader over a network connection that is downloading data from somewhere and you want to upload that data to an Azure blob. You can call our UploadStream method and this will work fine. So, this does work. The TransferManager just allowed the reuse of some buffers, but I find it hard to believe that the perf hit of this (and the GCs) is an issue compared to the huge perf hit of uploading blobs to Azure. So, we're talking about a minuscule perf issue here and not a functionality issue. Or, am I missing something? Please give me a detailed explanation of the scenario so I can appreciate it and determine the best sustainable thing for us to do.

Its not a performance issue, its a memory use issue.

As we all know Go is quite bad at causing memory fragmentation. If you allocate and deallocate buffers you'll very likely cause memory fragmentation and cause your program to use much more memory.

Rclone implements a mmap based buffer system so the buffers can be returned to the OS without causing any memory fragmentation - this turns out to be important for long running tasks for using the least OS memory.

If you implement your own transfer manager you can share buffers between concurrent uploads or not - its your choice on how much memory you wish to use.

I'll note also that without the TransferManager hook there appears to be no way to select the sync.Pool transfer manager that is already in the azblob code.

func NewSyncPool(size, concurrency int) (TransferManager, error) {

The transfer manager was put in in the original go SDK for azure. Its probably worth reading the commit message for that as it has a lot of good reasons in

Azure/azure-storage-blob-go@b759048

Note that this has been a troublesome component so it would be nice to re-use the working design from https://github.com/Azure/azure-storage-blob-go - eg

@JeffreyRichter
Copy link
Member

JeffreyRichter commented Nov 24, 2022

Thanks for this info. Let me explain a bit about our philosophy and thinking about our Azure Blob SDKs...

The Azure blob service supports uploading block(s) to a blob (via StageBlock method) and then committing those blocks (via the CommitBlockList method). Our BlockBlobClient must support these operations as they are the low-level building blocks for any other higher-level operations.

As a convenience to customers, our SDK also offers some higher-level operations such as UploadBuffer and UploadFile - these functions are very efficient (allocating no memory) and they internally create some goroutines for concurrency and call the StageBlock/CommitBlockList methods to accomplish the task quickly. These high-level functions are suitable for most of our customers and, if some customers find them insufficient for their needs, then these customers can write their own functions that call StageBlock/CommitBlockList directly to get the customized behavior they desire as there is nothing special about our convenience functions. We designed and tested our convenience functions and document/support them for all our customers in perpetuity. We also focus on ease of use for the mainstream customer scenarios, but these convenience methods do not expose all the options and bells/whistles that the Azure service offers.

Originally, the Blob SDK did not offer the UploadStream method at all. The reason was because all of our SDKs (in all programming languages) take network reliability very seriously. When a network connection to Azure fails for any reason, our SDKs retry the operation. To retry the Azure operation, the stream must be seekable back to position 0 so that the stream can be re-sent to the service. An io.Reader doesn't allow seeking and therefore we did not originally offer this method. However, we had many customers complain that we didn't offer this method and so we worked on a way to implement it in a reliable fashion: we read from the stream into a buffer and then we call StageBlock to upload the buffer. If the network operation fails, we can seek to the beginning of the buffer (offset 0) and then retry the operation. We then repeat this for each block within the io.Reader stream and, when done, we call CommitBlockList.

Even with this UploadStream method now in place, there is still a reliability problem: when our method calls Read from the customer-passed io.Reader, i tmay return an error due to a network problem. If this happens, there is nothing our SDK can do about it because we don't know how the io.Reader was obtained and so there is no way that we can retry an operation on it. However, we decided that this problem is not ours to fix and our UploadStream will just fail and return the io.Reader's Read error. We strongly believe that customers should design their apps to be reliable against network failures and so we don't think customers should be passing an io.Reader over a network stream (which is not seekable) to our UploadStream method and customers should instead write reliable code that call our StageBlock/CommitBlockList methods directly. Note that StageBlock requires an io.ReadSeekCloser to ensure that retries work correctly.

I'm curious as to how resilient the rcopy clone app is in the face of network failures when reading from the source file?

Now, even with all that said, we could certainly improve UploadSteam's implementation to re-use our own buffers and I'm happy to do this work. We would prefer to not expose a mechanism in the public interface that we have to document/test/support in perpetuity, and we worry that usability of this method will suffer and that customers can easily make mistakes with it. For example, our algorithm requires buffers of a calculated size and if the customer code allocates these buffers (via a sync.Pool or not), the buffer may not be sized correctly, and our method will just have to fail. Frequently, customers blame our SDK when its method fails and then customers desire our help to fix the problem; but the problem was that their code didn't allocate the right-sized buffer. We work hard in our SDK design to avoid the kinds of problems where our methods fail due to customer-callback code bug and then we must help customers debug their code.

Of course, if anyone ever has special needs that any of our convenience functions don't accommodate, then it is also possible to implement a custom algorithm calling StageBlock/CommitBlockList directly (just like our convenience methods do). And, customers are certainly welcome to start with the source code for our convenience methods as use this as a model. Our convenience methods are not designed to accommodate every customer scenario. That being said, if there are simple changes we can make to our methods to improve them for many customers, we are absolutely open to doing these things. The TransferManager is too heavy-handed, and it is too easy for a customer to mess things up causing support issues. The people who accepted the PR with TransferManager did not consider the supportability of the SDK and the potential customer issues and now that we are finally getting close to GA'ing this package, we are looking closely at these issues and cleaning up what it exposes.

By the way, I see that you said that Rclone implements a mmap based buffer system so the buffers can be returned to the OS without causing any memory fragmentation - this turns out to be important for long running tasks for using the least OS memory. Based on this, calling UploadStream doesn't seem like the right choice for you anyway as its intent is to read from the io.Reader stream into some collection of buffers. This means that it may be doing additional buffer copies that you don't want either. Instead, if the mmap buffer contains the entire file, then just call UploadBuffer. If the mmap contains subsets of the enire file, then wrap the mmap in a bytes.Reader and then call StageBlock directly. Both of these will avoid any additional copies or buffers at all - I think this will make things even faster and more memory efficient than what you have today. Disclaimer: I haven't seen the rcopy code and don't know exactly what it is doing so I may be wrong about this.

@ncw
Copy link
Contributor

ncw commented Nov 25, 2022

I'm curious as to how resilient the rcopy clone app is in the face of network failures when reading from the source file?

rclone will attempt to re-open the input stream and seek to the correct place in it rather than return an error from the Read() method in the io.Reader.

Of course, if anyone ever has special needs that any of our convenience functions don't accommodate, then it is also possible to implement a custom algorithm calling StageBlock/CommitBlockList directly (just like our convenience methods do)

That is exactly what rclone did do until the TransferManager became controllable. I still have that code and I can dig it out easily enough and convert it to the new SDK.

I just thought it was nice that rclone could re-use the SDK code rather than re-inventing the wheel.

If you don't want to expose the TransferManager then that's fine - I can bring back this code easily enough.

Instead, if the mmap buffer contains the entire file, then just call UploadBuffer.

Its an anonymous mmap buffer so not backed by a file. The data isn't in there, the mmap buffer gets used as the buffers for upload so the SDK would read the data from the io.Reader into the mmap buffer.

PS please can you fix #19612 so the buffer management isn't at risk of crashing peoples apps!

@JeffreyRichter
Copy link
Member

I just had a thought this morning: What if we change our UploadStream implementation to use mmap buffers instead of GC'd buffers and then we would explicitly free these buffers when UploadStream returned? This would prevent the delay of freeing these buffers. Then you could just pass your io.Reader to our UploadStream method and we don't have expose anything extra to customers and we wouldn't leak memory beyond the method's return. Would this work well for you?

And yes, we'll fix any bug (#19612) that causes apps to crash.

@jhendrixMSFT
Copy link
Member

Per discussion, we've switched to using anonymous memory-mapped files for buffer allocation. The change will go out in v0.6.0.

@ncw
Copy link
Contributor

ncw commented Dec 5, 2022

Nice one @jhendrixMSFT - that should help a lot.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants