b2 upload_file to accept stdin #152

olcto · 2016-05-07T21:02:49Z

A basic unix way of programs interacting with programs is with stdin and stdout using pipe (|). I have a use case where I want to send a backup of a ZFS snapshot to Backblaze B2 cloud storage with compression and encryption. For example on a unix command line:

zfs send storage@20160507 | gzip | openssl enc -aes-256-cbc -a -salt | b2 upload_file bucket-name - storage_20160507.gz.ssl

This has many advantages:

Local HDD space is saved by not having to save the ZFS snapshot to file
Compression is leveraged using gzip
The file is encrypted with openssl before it leaves the computer

I have a fork containing a working example of what I need with the B2 Command Line Tool: https://github.com/olcto/B2_Command_Line_Tool

The text was updated successfully, but these errors were encountered:

ppolewicz · 2016-05-08T00:06:10Z

Your code buffers a part in memory. Maybe it is ok, maybe not - I'm just pointing it out for reconsideration.

By the way, there is a branch with encryption enabled natively in b2 CLI. It is being worked on.

olcto · 2016-05-08T12:47:17Z

I agree my implementation is not very memory efficient, especially if part_size were to be increased to the maximum allowable 5GB for multipart upload, the 32-bit python interpreter would most certainly crash (untested). Maybe a good solution would be caching the part_size to a temporary file? I am open to ideas and ways to implement this.

In my use case I would most likely stay with an external program from b2 CLI for encrypting my files.

olcto · 2016-05-08T16:02:32Z

Just played around with python tempfile library and the data is now chunked into a temporary file. This changes the memory footprint drastically from a couple hundred MBs to about 30 MB and should be invariant to part_size.

ppolewicz · 2016-05-08T17:24:38Z

Caching in a temporary file is not resource efficient either.

Properly solving this problem would require b2 API to accept files of unknown size.

@bwbeach, @svonohr - do you have any idea on how we could make this work with keeping resource consumption on a reasonable level?

olcto · 2016-05-08T17:33:51Z

Just to note, the temporary file is only of part_size which can range from 100MB to 5GB, not of the entire uploaded file. Seems like a reasonable compromise between HDD space vs RAM usage, especially on memory constrained systems.

ppolewicz · 2016-05-08T17:37:23Z

this is not a good solution for embedded systems. They don't have 5GB of temp space or 100MB of ram to spare.

olcto · 2016-05-08T17:55:11Z

Considering the minimum part_size for a multipart upload is 100MB, I would think that 100MB is the minimum required cached space (RAM or HDD) to store the streamed data before it can be uploaded by api.raw_api.upload_part. Unless api.raw_api.upload_part can be redesigned to take in a streaming input, but that is deeper into the api than I have dug.

bwbeach · 2016-05-08T17:57:41Z

I'm fine with using a temp file on disk. I think most users will have enough space to hold the size of a few parts.

svonohr · 2016-05-08T17:58:42Z

As much as I'd like to have this feature, I see no way to implement this without any drawbacks. The implementation either requires to cache large amounts of data or it's unable do deal with connectivity issues and reuploading failed parts. The latter is even impossible at the moment, because the hash of uploaded parts are required in advance, but that's soon to change. I don't know which drawback is easier to accept.

Is there even a way to determine or estimate the size of the stdin stream? Afaik there isn't.

bwbeach · 2016-05-08T17:59:13Z

The maximum file size in B2 is 10TB, and the maximum number of parts in a large file is 10k. Files this big require the part size to be at least 1GB.

ppolewicz · 2016-05-08T17:59:49Z

As you will use compression, any estimate can be very off depending on compression ratio.

II hope we can later reduce the resource consumption when B2 api starts to accept files of unknown size.

bwbeach · 2016-05-08T18:00:03Z

It would be possible to pass the size and sha1 on the command line, but it seems unlikely that the caller would have that information if they didn't have the whole file.

svonohr · 2016-05-08T18:04:25Z

Maybe it's OK to keep the part size just fixed to 100MB. This still allows uploading of a 1TB stream, until the limit of 10k parts is reached. I don't see anyone doing that anytime soon. Or if somebody actually wants to do this we could offer a commandline option to increase the part size for stream uploads.

olcto · 2016-05-08T18:33:26Z

Would it be best to keep the upload stream serialized?, so that if there was a temporary connectivity issue that part could be retried without having to cache multiple parts and then resume consuming the stream. If it was a permanent connection loss the stream would be lost and upload terminated with probably no way to resume that large file upload (up to the user to consider the risk of lost data).

If the stream exceeds file limit (10TB) or number of parts (10K), upload should be terminated and data upload would be incomplete. Again up to user to consider risk.

part_size should be a command line argument for all large uploads including local file and stream.

olcto · 2016-05-15T20:10:53Z

I have continued to refine my implementation to upload a stream with B2 CLI and have been using it daily to back up my ZFS (~5GB of data) without failure.

Current features include:

Taking in part_size via command line argument with upload file, added to doc string
Determines if stream is <= part_size, and will perform _upload_small_file, else initiates start_large_file. Both implementations perform checks and retries.
Checks if stream exceeds max number of parts, 10000, and raises exception.
Checks if stream exceeds max file size of 10TB, and raises exception.

TODO:

How to incorporate progress_listener?
Unit Tests
Time and $$ to test large limits of 10000 parts and 10TB stream size.

When would it be ready for a pull request? @bwbeach @svonohr @ppolewicz

ppolewicz · 2016-05-15T21:53:27Z

I think that in the first version of stdin support, progress_listener could not be supported. There is not much value in it, if total size is unknown. If someone will want it one day, maybe it can be added, but in such case I'd expect the user to say what he expects - and that will become very useful.

Unit tests are a must.

Maybe we don't need to run the test with 10TB, but instead we can check if the condition is met with an artificially decreased constant (down to, for example, 101MB).

We will need to read the source code of your changes and a pull request seems to be the easiest way to organize it - we will be able to discuss your changes on a per-line basis. I think that despite the lack of unit tests, it is already a good moment to create a PR (just mention in the description that it is not ready for merge yet).

bwbeach · 2016-05-16T16:22:26Z

[Sorry for the delay getting back to you. It was Sunday yesterday, and I took the day off. :-)]

I agree that it's fine to skip some features in the first pull request, such as the progress bar. And testing a 10TB file is just not practical. At Backblaze, we test 10TB files; it's a major project. And it takes LOTS of bandwidth that we don't want to pay for, so we do it from inside the data center.

In our development lab, we have a B2 system with a smaller minimum part size, which is useful for testing uploading multiple parts. I'm happy to run some tests there. (Someday, I'd like to expose a developer option to make this feature public.)

Unit tests are a requirement. If you want to make a pull request, we can collaborate on the unit tests.

BtbN · 2016-06-15T11:11:46Z

Just wanted to note that this feature missing is what prevents me from using b2 cloud.
I create my backups using zfs send, and there is simply not enough space available to store it as a temporary file, so piping via stdin is my only option.

kazsulec · 2016-12-16T06:15:26Z

The inability of b2 CLI commands to consume stdin can probably be worked around nicely with xargs command. To find files with missing "File info" (issue 292), I have used this:

$ b2 ls --long CanonT2i 7z | sed 's/([^ ])(.)/\1/' | xargs -n 1 b2 get_file_info | grep 'millis|fileName'

Easy for me to spot the missing src_last_modified_millis lines, and the impacted file. Running this on Ubuntu, latest LTS.

ppolewicz · 2016-12-16T12:03:39Z

@kazsulec unfortunately in this case he actually needs to write the file to memory. One could imagine a bash script which uses a /dev/shm device for buffering, but it'd be much easier to integrate it in b2 cli.

bwbeach · 2016-12-16T17:11:17Z

Within a month or two, the B2 service will let you provide the SHA1 checksum at the end of the upload for both b2_upload_file and b2_upload_part. You will still be required to know the size of the file at the beginning.

svonohr · 2016-12-16T17:21:22Z

You mean for each part, right? Even at the moment, starting of large files doesn't requiere a size. Am 16.12.2016 18:11 schrieb "Brian Beach" <notifications@github.com>:

…

Within a month or two, the B2 service will let you provide the SHA1 checksum at the *end* of the upload for both b2_upload_file and b2_upload_part. You will still be required to know the size of the file at the beginning. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQGT4eKGy1vYBQQ37jLA024p0qXGb_dAks5rIsY2gaJpZM4IZgmW> .

bwbeach · 2016-12-16T17:36:00Z

That's right: each part.

svonohr · 2016-12-16T17:36:56Z

Thinking about this some more, I believe there is little to be gained from sending hashes at the end, for this particular issue. We need some sort of caching anyway, so we know the size of the next part and we can recover in case of an error. While reading the next part from stdin we could already calculate the hash. Am 16.12.2016 18:21 schrieb "Sebastian von Ohr" <sebastian.von.ohr@uni- oldenburg.de>:

…

You mean for each part, right? Even at the moment, starting of large files doesn't requiere a size. Am 16.12.2016 18:11 schrieb "Brian Beach" ***@***.***>: > Within a month or two, the B2 service will let you provide the SHA1 > checksum at the *end* of the upload for both b2_upload_file and > b2_upload_part. You will still be required to know the size of the file > at the beginning. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#152 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AQGT4eKGy1vYBQQ37jLA024p0qXGb_dAks5rIsY2gaJpZM4IZgmW> > . >

ppolewicz · 2016-12-16T17:38:28Z

But thanks to the API enhancement, instead of buffering the whole file, we will just need to buffer one part in memory at the time, right?

svonohr · 2016-12-16T17:40:27Z

What API enhancement? Buffering only a single part should be possible already. Am 16.12.2016 18:36 schrieb "Sebastian von Ohr" < sebastian.von.ohr@uni-oldenburg.de>: Thinking about this some more, I believe there is little to be gained from sending hashes at the end, for this particular issue. We need some sort of caching anyway, so we know the size of the next part and we can recover in case of an error. While reading the next part from stdin we could already calculate the hash. Am 16.12.2016 18:21 schrieb "Sebastian von Ohr" < sebastian.von.ohr@uni-oldenburg.de>:

…

You mean for each part, right? Even at the moment, starting of large files doesn't requiere a size. Am 16.12.2016 18:11 schrieb "Brian Beach" ***@***.***>: > Within a month or two, the B2 service will let you provide the SHA1 > checksum at the *end* of the upload for both b2_upload_file and > b2_upload_part. You will still be required to know the size of the file > at the beginning. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#152 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AQGT4eKGy1vYBQQ37jLA024p0qXGb_dAks5rIsY2gaJpZM4IZgmW> > . >

bwbeach · 2016-12-16T17:46:15Z

If you know the size of the file, you shouldn't have to buffer anything. You'll be able to stream the data straight to B2.

If you don't know the size, you'll need to be able to buffer the minimum part size.

svonohr · 2016-12-16T17:50:01Z

Right, but what's the timeline on omitting the content size? Also, it would be impossible to recover from a transmission error. Am 16.12.2016 18:46 schrieb "Brian Beach" <notifications@github.com>:

…

If you know the size of the file, you shouldn't have to buffer anything. You'll be able to stream the data straight to B2. If you don't know the size, you'll need to be able to buffer the minimum part size. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQGT4QGoWTHbNJeayZGPTnd3PplkBcUQks5rIs5ogaJpZM4IZgmW> .

bwbeach · 2016-12-16T17:51:36Z

That's true. Without buffering, you would have to restart from the beginning if anything goes wrong.

We don't plan on omitting the content size. S3 doesn't let you do that either.

steveh · 2017-02-08T19:36:09Z

I don't know whether or not S3 allows you to omit content size, but I note it supports streaming from stdin aws/aws-cli#903

ppolewicz · 2017-02-08T20:52:50Z

They keep data in memory, but S3 "chunks" are much smaller than b2 "parts". If we just implemented it in the same way, it would eat 20 times more memory than in case of aws-cli.

jmealo · 2017-04-29T14:57:40Z

I would like to send ZFS backups to B2 and use pipes and uploads of unknown sizes. Where did we land with this?

jmealo · 2017-06-18T20:21:15Z

It looks like the API now supports sending the SHA1 sum at the end of the transfer by sending the following HTTP header with your upload request: X-Bz-Content-Sha1: hex_digits_at_end (see docs).

A naive search suggested that this API option has been scaffolded out but not put to any real use.

It looks like it was implemented in #337.

@olcto: Can you validate if your use case works now/whether this issue can be closed?

bwbeach · 2017-06-18T22:31:05Z

Sending the SHA1 at the end doesn't help with the issues discussed above about knowing the size of the file (or part) before uploading.

The minimum part size is B2 was reduced to 5MB, which means that buffering parts in memory while streaming from stdin is a reasonable approach. (Backblaze still recommends a larger part size when feasible for better upload throughput.)

Anybody interesting in working on an implementation of streaming from stdin?

icodeforlove · 2018-12-10T21:38:49Z

@jmealo

I needed this badly so I took a stab at it.

https://github.com/icodeforlove/npm_b2pipe

For anyone that needs this edge case this is a great approach, and support concurrency.

szenti · 2019-11-23T23:08:25Z

Sending the SHA1 at the end doesn't help with the issues discussed above about knowing the size of the file (or part) before uploading.

As as sidenote: there is a way to determine the upload size for ZFS streams (for whole snapshots and for incremental streams too):

zfs send --dryrun --verbose --parsable ${pool}/${dataset}@${snapshot_name}

A sample output is :

full	data/samba@20191123	356490712
size	356490712

The --compressed parameter is also taken into consideration:

full	data/samba@20191123	204165080
size	204165080

adamreed90 · 2023-05-11T15:12:31Z

Did this ever get implemented into the CLI? I could really use this!

ppolewicz · 2023-05-11T21:17:03Z

It was merged into python sdk, but not to cli yet. I assigned it to a developer now, thanks for the reminder!

mjurbanski-reef · 2023-10-04T07:56:40Z

Support for FIFO files and stdin has been implemented and released in b2>=3.10.0. closing. Happy uploads!

ppolewicz added the enhancement label May 8, 2016

olcto mentioned this issue May 16, 2016

stdin for b2 upload_file #161

Closed

2 tasks

ppolewicz assigned olcto Sep 17, 2016

zkrising mentioned this issue Aug 27, 2023

Can't upload FIFOs with upload-file #911

Closed

mjurbanski-reef closed this as completed Oct 4, 2023

b2 upload_file to accept stdin #152

b2 upload_file to accept stdin #152

Comments

olcto commented May 7, 2016

ppolewicz commented May 8, 2016

olcto commented May 8, 2016

olcto commented May 8, 2016

ppolewicz commented May 8, 2016

olcto commented May 8, 2016

ppolewicz commented May 8, 2016

olcto commented May 8, 2016

bwbeach commented May 8, 2016

svonohr commented May 8, 2016

bwbeach commented May 8, 2016

ppolewicz commented May 8, 2016

bwbeach commented May 8, 2016

svonohr commented May 8, 2016

olcto commented May 8, 2016

olcto commented May 15, 2016

ppolewicz commented May 15, 2016

bwbeach commented May 16, 2016

BtbN commented Jun 15, 2016

kazsulec commented Dec 16, 2016 • edited Loading

ppolewicz commented Dec 16, 2016

bwbeach commented Dec 16, 2016

svonohr commented Dec 16, 2016 via email

bwbeach commented Dec 16, 2016

svonohr commented Dec 16, 2016 via email

ppolewicz commented Dec 16, 2016

svonohr commented Dec 16, 2016 via email

bwbeach commented Dec 16, 2016

svonohr commented Dec 16, 2016 via email

bwbeach commented Dec 16, 2016

steveh commented Feb 8, 2017

ppolewicz commented Feb 8, 2017

jmealo commented Apr 29, 2017 • edited Loading

jmealo commented Jun 18, 2017 • edited Loading

bwbeach commented Jun 18, 2017

icodeforlove commented Dec 10, 2018 • edited Loading

szenti commented Nov 23, 2019 • edited Loading

adamreed90 commented May 11, 2023

ppolewicz commented May 11, 2023

mjurbanski-reef commented Oct 4, 2023

kazsulec commented Dec 16, 2016 •

edited

Loading

jmealo commented Apr 29, 2017 •

edited

Loading

jmealo commented Jun 18, 2017 •

edited

Loading

icodeforlove commented Dec 10, 2018 •

edited

Loading

szenti commented Nov 23, 2019 •

edited

Loading