Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b2 upload_file to accept stdin #152

Closed
olcto opened this issue May 7, 2016 · 39 comments
Closed

b2 upload_file to accept stdin #152

olcto opened this issue May 7, 2016 · 39 comments
Assignees

Comments

@olcto
Copy link

olcto commented May 7, 2016

A basic unix way of programs interacting with programs is with stdin and stdout using pipe (|). I have a use case where I want to send a backup of a ZFS snapshot to Backblaze B2 cloud storage with compression and encryption. For example on a unix command line:

zfs send storage@20160507 | gzip | openssl enc -aes-256-cbc -a -salt | b2 upload_file bucket-name - storage_20160507.gz.ssl

This has many advantages:

  • Local HDD space is saved by not having to save the ZFS snapshot to file
  • Compression is leveraged using gzip
  • The file is encrypted with openssl before it leaves the computer

I have a fork containing a working example of what I need with the B2 Command Line Tool: https://github.com/olcto/B2_Command_Line_Tool

@ppolewicz
Copy link
Collaborator

Your code buffers a part in memory. Maybe it is ok, maybe not - I'm just pointing it out for reconsideration.

By the way, there is a branch with encryption enabled natively in b2 CLI. It is being worked on.

@olcto
Copy link
Author

olcto commented May 8, 2016

I agree my implementation is not very memory efficient, especially if part_size were to be increased to the maximum allowable 5GB for multipart upload, the 32-bit python interpreter would most certainly crash (untested). Maybe a good solution would be caching the part_size to a temporary file? I am open to ideas and ways to implement this.

In my use case I would most likely stay with an external program from b2 CLI for encrypting my files.

@olcto
Copy link
Author

olcto commented May 8, 2016

Just played around with python tempfile library and the data is now chunked into a temporary file. This changes the memory footprint drastically from a couple hundred MBs to about 30 MB and should be invariant to part_size.

@ppolewicz
Copy link
Collaborator

Caching in a temporary file is not resource efficient either.

Properly solving this problem would require b2 API to accept files of unknown size.

@bwbeach, @svonohr - do you have any idea on how we could make this work with keeping resource consumption on a reasonable level?

@olcto
Copy link
Author

olcto commented May 8, 2016

Just to note, the temporary file is only of part_size which can range from 100MB to 5GB, not of the entire uploaded file. Seems like a reasonable compromise between HDD space vs RAM usage, especially on memory constrained systems.

@ppolewicz
Copy link
Collaborator

this is not a good solution for embedded systems. They don't have 5GB of temp space or 100MB of ram to spare.

@olcto
Copy link
Author

olcto commented May 8, 2016

Considering the minimum part_size for a multipart upload is 100MB, I would think that 100MB is the minimum required cached space (RAM or HDD) to store the streamed data before it can be uploaded by api.raw_api.upload_part. Unless api.raw_api.upload_part can be redesigned to take in a streaming input, but that is deeper into the api than I have dug.

@bwbeach
Copy link
Contributor

bwbeach commented May 8, 2016

I'm fine with using a temp file on disk. I think most users will have enough space to hold the size of a few parts.

@svonohr
Copy link

svonohr commented May 8, 2016

As much as I'd like to have this feature, I see no way to implement this without any drawbacks. The implementation either requires to cache large amounts of data or it's unable do deal with connectivity issues and reuploading failed parts. The latter is even impossible at the moment, because the hash of uploaded parts are required in advance, but that's soon to change. I don't know which drawback is easier to accept.

Is there even a way to determine or estimate the size of the stdin stream? Afaik there isn't.

@bwbeach
Copy link
Contributor

bwbeach commented May 8, 2016

The maximum file size in B2 is 10TB, and the maximum number of parts in a large file is 10k. Files this big require the part size to be at least 1GB.

@ppolewicz
Copy link
Collaborator

As you will use compression, any estimate can be very off depending on compression ratio.

II hope we can later reduce the resource consumption when B2 api starts to accept files of unknown size.

@bwbeach
Copy link
Contributor

bwbeach commented May 8, 2016

It would be possible to pass the size and sha1 on the command line, but it seems unlikely that the caller would have that information if they didn't have the whole file.

@svonohr
Copy link

svonohr commented May 8, 2016

Maybe it's OK to keep the part size just fixed to 100MB. This still allows uploading of a 1TB stream, until the limit of 10k parts is reached. I don't see anyone doing that anytime soon. Or if somebody actually wants to do this we could offer a commandline option to increase the part size for stream uploads.

@olcto
Copy link
Author

olcto commented May 8, 2016

Would it be best to keep the upload stream serialized?, so that if there was a temporary connectivity issue that part could be retried without having to cache multiple parts and then resume consuming the stream. If it was a permanent connection loss the stream would be lost and upload terminated with probably no way to resume that large file upload (up to the user to consider the risk of lost data).

If the stream exceeds file limit (10TB) or number of parts (10K), upload should be terminated and data upload would be incomplete. Again up to user to consider risk.

part_size should be a command line argument for all large uploads including local file and stream.

@olcto
Copy link
Author

olcto commented May 15, 2016

I have continued to refine my implementation to upload a stream with B2 CLI and have been using it daily to back up my ZFS (~5GB of data) without failure.

Current features include:

  • Taking in part_size via command line argument with upload file, added to doc string
  • Determines if stream is <= part_size, and will perform _upload_small_file, else initiates start_large_file. Both implementations perform checks and retries.
  • Checks if stream exceeds max number of parts, 10000, and raises exception.
  • Checks if stream exceeds max file size of 10TB, and raises exception.

TODO:

  • How to incorporate progress_listener?
  • Unit Tests
  • Time and $$ to test large limits of 10000 parts and 10TB stream size.

When would it be ready for a pull request? @bwbeach @svonohr @ppolewicz

@ppolewicz
Copy link
Collaborator

I think that in the first version of stdin support, progress_listener could not be supported. There is not much value in it, if total size is unknown. If someone will want it one day, maybe it can be added, but in such case I'd expect the user to say what he expects - and that will become very useful.

Unit tests are a must.

Maybe we don't need to run the test with 10TB, but instead we can check if the condition is met with an artificially decreased constant (down to, for example, 101MB).

We will need to read the source code of your changes and a pull request seems to be the easiest way to organize it - we will be able to discuss your changes on a per-line basis. I think that despite the lack of unit tests, it is already a good moment to create a PR (just mention in the description that it is not ready for merge yet).

@bwbeach
Copy link
Contributor

bwbeach commented May 16, 2016

[Sorry for the delay getting back to you. It was Sunday yesterday, and I took the day off. :-)]

I agree that it's fine to skip some features in the first pull request, such as the progress bar. And testing a 10TB file is just not practical. At Backblaze, we test 10TB files; it's a major project. And it takes LOTS of bandwidth that we don't want to pay for, so we do it from inside the data center.

In our development lab, we have a B2 system with a smaller minimum part size, which is useful for testing uploading multiple parts. I'm happy to run some tests there. (Someday, I'd like to expose a developer option to make this feature public.)

Unit tests are a requirement. If you want to make a pull request, we can collaborate on the unit tests.

@olcto olcto mentioned this issue May 16, 2016
2 tasks
@BtbN
Copy link

BtbN commented Jun 15, 2016

Just wanted to note that this feature missing is what prevents me from using b2 cloud.
I create my backups using zfs send, and there is simply not enough space available to store it as a temporary file, so piping via stdin is my only option.

@kazsulec
Copy link

kazsulec commented Dec 16, 2016

The inability of b2 CLI commands to consume stdin can probably be worked around nicely with xargs command. To find files with missing "File info" (issue 292), I have used this:

$ b2 ls --long CanonT2i 7z | sed 's/([^ ])(.)/\1/' | xargs -n 1 b2 get_file_info | grep 'millis|fileName'

Easy for me to spot the missing src_last_modified_millis lines, and the impacted file. Running this on Ubuntu, latest LTS.

@ppolewicz
Copy link
Collaborator

@kazsulec unfortunately in this case he actually needs to write the file to memory. One could imagine a bash script which uses a /dev/shm device for buffering, but it'd be much easier to integrate it in b2 cli.

@bwbeach
Copy link
Contributor

bwbeach commented Dec 16, 2016

Within a month or two, the B2 service will let you provide the SHA1 checksum at the end of the upload for both b2_upload_file and b2_upload_part. You will still be required to know the size of the file at the beginning.

@svonohr
Copy link

svonohr commented Dec 16, 2016 via email

@bwbeach
Copy link
Contributor

bwbeach commented Dec 16, 2016

That's right: each part.

@svonohr
Copy link

svonohr commented Dec 16, 2016 via email

@ppolewicz
Copy link
Collaborator

But thanks to the API enhancement, instead of buffering the whole file, we will just need to buffer one part in memory at the time, right?

@svonohr
Copy link

svonohr commented Dec 16, 2016 via email

@bwbeach
Copy link
Contributor

bwbeach commented Dec 16, 2016

If you know the size of the file, you shouldn't have to buffer anything. You'll be able to stream the data straight to B2.

If you don't know the size, you'll need to be able to buffer the minimum part size.

@svonohr
Copy link

svonohr commented Dec 16, 2016 via email

@bwbeach
Copy link
Contributor

bwbeach commented Dec 16, 2016

That's true. Without buffering, you would have to restart from the beginning if anything goes wrong.

We don't plan on omitting the content size. S3 doesn't let you do that either.

@steveh
Copy link

steveh commented Feb 8, 2017

I don't know whether or not S3 allows you to omit content size, but I note it supports streaming from stdin aws/aws-cli#903

@ppolewicz
Copy link
Collaborator

They keep data in memory, but S3 "chunks" are much smaller than b2 "parts". If we just implemented it in the same way, it would eat 20 times more memory than in case of aws-cli.

@jmealo
Copy link

jmealo commented Apr 29, 2017

I would like to send ZFS backups to B2 and use pipes and uploads of unknown sizes. Where did we land with this?

@jmealo
Copy link

jmealo commented Jun 18, 2017

It looks like the API now supports sending the SHA1 sum at the end of the transfer by sending the following HTTP header with your upload request: X-Bz-Content-Sha1: hex_digits_at_end (see docs).

A naive search suggested that this API option has been scaffolded out but not put to any real use.

It looks like it was implemented in #337.

@olcto: Can you validate if your use case works now/whether this issue can be closed?

@bwbeach
Copy link
Contributor

bwbeach commented Jun 18, 2017

Sending the SHA1 at the end doesn't help with the issues discussed above about knowing the size of the file (or part) before uploading.

The minimum part size is B2 was reduced to 5MB, which means that buffering parts in memory while streaming from stdin is a reasonable approach. (Backblaze still recommends a larger part size when feasible for better upload throughput.)

Anybody interesting in working on an implementation of streaming from stdin?

@icodeforlove
Copy link

icodeforlove commented Dec 10, 2018

@jmealo

I needed this badly so I took a stab at it.

https://github.com/icodeforlove/npm_b2pipe

For anyone that needs this edge case this is a great approach, and support concurrency.

@szenti
Copy link

szenti commented Nov 23, 2019

Sending the SHA1 at the end doesn't help with the issues discussed above about knowing the size of the file (or part) before uploading.

As as sidenote: there is a way to determine the upload size for ZFS streams (for whole snapshots and for incremental streams too):

zfs send --dryrun --verbose --parsable ${pool}/${dataset}@${snapshot_name}

A sample output is :

full	data/samba@20191123	356490712
size	356490712

The --compressed parameter is also taken into consideration:

full	data/samba@20191123	204165080
size	204165080

@adamreed90
Copy link

Did this ever get implemented into the CLI? I could really use this!

@ppolewicz
Copy link
Collaborator

It was merged into python sdk, but not to cli yet. I assigned it to a developer now, thanks for the reminder!

@mjurbanski-reef
Copy link
Contributor

Support for FIFO files and stdin has been implemented and released in b2>=3.10.0. closing. Happy uploads!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests