-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
b2 upload_file to accept stdin #152
Comments
Your code buffers a part in memory. Maybe it is ok, maybe not - I'm just pointing it out for reconsideration. By the way, there is a branch with encryption enabled natively in b2 CLI. It is being worked on. |
I agree my implementation is not very memory efficient, especially if In my use case I would most likely stay with an external program from b2 CLI for encrypting my files. |
Just played around with python tempfile library and the data is now chunked into a temporary file. This changes the memory footprint drastically from a couple hundred MBs to about 30 MB and should be invariant to |
Just to note, the temporary file is only of |
this is not a good solution for embedded systems. They don't have 5GB of temp space or 100MB of ram to spare. |
Considering the minimum |
I'm fine with using a temp file on disk. I think most users will have enough space to hold the size of a few parts. |
As much as I'd like to have this feature, I see no way to implement this without any drawbacks. The implementation either requires to cache large amounts of data or it's unable do deal with connectivity issues and reuploading failed parts. The latter is even impossible at the moment, because the hash of uploaded parts are required in advance, but that's soon to change. I don't know which drawback is easier to accept. Is there even a way to determine or estimate the size of the stdin stream? Afaik there isn't. |
The maximum file size in B2 is 10TB, and the maximum number of parts in a large file is 10k. Files this big require the part size to be at least 1GB. |
As you will use compression, any estimate can be very off depending on compression ratio. II hope we can later reduce the resource consumption when B2 api starts to accept files of unknown size. |
It would be possible to pass the size and sha1 on the command line, but it seems unlikely that the caller would have that information if they didn't have the whole file. |
Maybe it's OK to keep the part size just fixed to 100MB. This still allows uploading of a 1TB stream, until the limit of 10k parts is reached. I don't see anyone doing that anytime soon. Or if somebody actually wants to do this we could offer a commandline option to increase the part size for stream uploads. |
Would it be best to keep the upload stream serialized?, so that if there was a temporary connectivity issue that part could be retried without having to cache multiple parts and then resume consuming the stream. If it was a permanent connection loss the stream would be lost and upload terminated with probably no way to resume that large file upload (up to the user to consider the risk of lost data). If the stream exceeds file limit (10TB) or number of parts (10K), upload should be terminated and data upload would be incomplete. Again up to user to consider risk.
|
I have continued to refine my implementation to upload a stream with B2 CLI and have been using it daily to back up my ZFS (~5GB of data) without failure. Current features include:
TODO:
When would it be ready for a pull request? @bwbeach @svonohr @ppolewicz |
I think that in the first version of stdin support, progress_listener could not be supported. There is not much value in it, if total size is unknown. If someone will want it one day, maybe it can be added, but in such case I'd expect the user to say what he expects - and that will become very useful. Unit tests are a must. Maybe we don't need to run the test with 10TB, but instead we can check if the condition is met with an artificially decreased constant (down to, for example, 101MB). We will need to read the source code of your changes and a pull request seems to be the easiest way to organize it - we will be able to discuss your changes on a per-line basis. I think that despite the lack of unit tests, it is already a good moment to create a PR (just mention in the description that it is not ready for merge yet). |
[Sorry for the delay getting back to you. It was Sunday yesterday, and I took the day off. :-)] I agree that it's fine to skip some features in the first pull request, such as the progress bar. And testing a 10TB file is just not practical. At Backblaze, we test 10TB files; it's a major project. And it takes LOTS of bandwidth that we don't want to pay for, so we do it from inside the data center. In our development lab, we have a B2 system with a smaller minimum part size, which is useful for testing uploading multiple parts. I'm happy to run some tests there. (Someday, I'd like to expose a developer option to make this feature public.) Unit tests are a requirement. If you want to make a pull request, we can collaborate on the unit tests. |
Just wanted to note that this feature missing is what prevents me from using b2 cloud. |
The inability of b2 CLI commands to consume stdin can probably be worked around nicely with xargs command. To find files with missing "File info" (issue 292), I have used this: $ b2 ls --long CanonT2i 7z | sed 's/([^ ])(.)/\1/' | xargs -n 1 b2 get_file_info | grep 'millis|fileName' Easy for me to spot the missing src_last_modified_millis lines, and the impacted file. Running this on Ubuntu, latest LTS. |
@kazsulec unfortunately in this case he actually needs to write the file to memory. One could imagine a bash script which uses a |
Within a month or two, the B2 service will let you provide the SHA1 checksum at the end of the upload for both |
You mean for each part, right? Even at the moment, starting of large files
doesn't requiere a size.
Am 16.12.2016 18:11 schrieb "Brian Beach" <notifications@github.com>:
… Within a month or two, the B2 service will let you provide the SHA1
checksum at the *end* of the upload for both b2_upload_file and
b2_upload_part. You will still be required to know the size of the file
at the beginning.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#152 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AQGT4eKGy1vYBQQ37jLA024p0qXGb_dAks5rIsY2gaJpZM4IZgmW>
.
|
That's right: each part. |
Thinking about this some more, I believe there is little to be gained from
sending hashes at the end, for this particular issue. We need some sort of
caching anyway, so we know the size of the next part and we can recover in
case of an error. While reading the next part from stdin we could already
calculate the hash.
Am 16.12.2016 18:21 schrieb "Sebastian von Ohr" <sebastian.von.ohr@uni-
oldenburg.de>:
… You mean for each part, right? Even at the moment, starting of large files
doesn't requiere a size.
Am 16.12.2016 18:11 schrieb "Brian Beach" ***@***.***>:
> Within a month or two, the B2 service will let you provide the SHA1
> checksum at the *end* of the upload for both b2_upload_file and
> b2_upload_part. You will still be required to know the size of the file
> at the beginning.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#152 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AQGT4eKGy1vYBQQ37jLA024p0qXGb_dAks5rIsY2gaJpZM4IZgmW>
> .
>
|
But thanks to the API enhancement, instead of buffering the whole file, we will just need to buffer one part in memory at the time, right? |
What API enhancement? Buffering only a single part should be possible
already.
Am 16.12.2016 18:36 schrieb "Sebastian von Ohr" <
sebastian.von.ohr@uni-oldenburg.de>:
Thinking about this some more, I believe there is little to be gained from
sending hashes at the end, for this particular issue. We need some sort of
caching anyway, so we know the size of the next part and we can recover in
case of an error. While reading the next part from stdin we could already
calculate the hash.
Am 16.12.2016 18:21 schrieb "Sebastian von Ohr" <
sebastian.von.ohr@uni-oldenburg.de>:
… You mean for each part, right? Even at the moment, starting of large files
doesn't requiere a size.
Am 16.12.2016 18:11 schrieb "Brian Beach" ***@***.***>:
> Within a month or two, the B2 service will let you provide the SHA1
> checksum at the *end* of the upload for both b2_upload_file and
> b2_upload_part. You will still be required to know the size of the file
> at the beginning.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#152 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AQGT4eKGy1vYBQQ37jLA024p0qXGb_dAks5rIsY2gaJpZM4IZgmW>
> .
>
|
If you know the size of the file, you shouldn't have to buffer anything. You'll be able to stream the data straight to B2. If you don't know the size, you'll need to be able to buffer the minimum part size. |
Right, but what's the timeline on omitting the content size? Also, it would
be impossible to recover from a transmission error.
Am 16.12.2016 18:46 schrieb "Brian Beach" <notifications@github.com>:
… If you know the size of the file, you shouldn't have to buffer anything.
You'll be able to stream the data straight to B2.
If you don't know the size, you'll need to be able to buffer the minimum
part size.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#152 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AQGT4QGoWTHbNJeayZGPTnd3PplkBcUQks5rIs5ogaJpZM4IZgmW>
.
|
That's true. Without buffering, you would have to restart from the beginning if anything goes wrong. We don't plan on omitting the content size. S3 doesn't let you do that either. |
I don't know whether or not S3 allows you to omit content size, but I note it supports streaming from stdin aws/aws-cli#903 |
They keep data in memory, but S3 "chunks" are much smaller than b2 "parts". If we just implemented it in the same way, it would eat 20 times more memory than in case of aws-cli. |
I would like to send ZFS backups to B2 and use pipes and uploads of unknown sizes. Where did we land with this? |
It looks like the API now supports sending the SHA1 sum at the end of the transfer by sending the following HTTP header with your upload request: A naive search suggested that this API option has been scaffolded out but not put to any real use. It looks like it was implemented in #337. @olcto: Can you validate if your use case works now/whether this issue can be closed? |
Sending the SHA1 at the end doesn't help with the issues discussed above about knowing the size of the file (or part) before uploading. The minimum part size is B2 was reduced to 5MB, which means that buffering parts in memory while streaming from stdin is a reasonable approach. (Backblaze still recommends a larger part size when feasible for better upload throughput.) Anybody interesting in working on an implementation of streaming from stdin? |
I needed this badly so I took a stab at it. https://github.com/icodeforlove/npm_b2pipe For anyone that needs this edge case this is a great approach, and support concurrency. |
As as sidenote: there is a way to determine the upload size for ZFS streams (for whole snapshots and for incremental streams too):
A sample output is :
The --compressed parameter is also taken into consideration:
|
Did this ever get implemented into the CLI? I could really use this! |
It was merged into python sdk, but not to cli yet. I assigned it to a developer now, thanks for the reminder! |
Support for FIFO files and stdin has been implemented and released in b2>=3.10.0. closing. Happy uploads! |
A basic unix way of programs interacting with programs is with stdin and stdout using pipe (|). I have a use case where I want to send a backup of a ZFS snapshot to Backblaze B2 cloud storage with compression and encryption. For example on a unix command line:
zfs send storage@20160507 | gzip | openssl enc -aes-256-cbc -a -salt | b2 upload_file bucket-name - storage_20160507.gz.ssl
This has many advantages:
I have a fork containing a working example of what I need with the B2 Command Line Tool: https://github.com/olcto/B2_Command_Line_Tool
The text was updated successfully, but these errors were encountered: