-
Notifications
You must be signed in to change notification settings - Fork 54
checking checksums and also multi-part resumes #36
Comments
|
1.) Well, I thought that Amazon returns the calculated checksum(of the uploaded file) to the client(Glacier Uploader) so that the client can verify it against the local copy. OR... perhaps the client sends the local checksum to Amazon, and Amazon is responsible for verifying it? So regardless, does somebody verify that the upload is a correct match? |
With glacieruploader you have to do it manually. Amazon only returns the archive ID after upload, but you can calculate the checksum and compare it against the value that Amazon displays in your vault. Or maybe you can list the inventory of the vault and when the job is ready, compare the returned I'm not sure if this can be done automatically, because the jobs are asynchronous. And glacieruploader doesn't keep a database of previously uploaded files, so right now I don't know how to automatically check the TreeHashes from a returned inventory listing... If you come up with some code, create a pull requests. I'm always happy when I can merge contributions. 😄 |
Suppose we did have the multipart resume capability(i.e. keeping a database of already-uploaded chunks etc.) In that case, does the whole(large) file show up as a single archive on the inventory listing(with only one single checksum for the entire large archive to be verified?) |
There is only one archive after a multipart upload. From the Glacier docs:
The pull request from @nitriques added this functionality and https://github.com/DeuxHuitHuit/glacieruploader/blob/5fc0e89c109c8a328c5774b7d1d1be978854c223/src/main/java/de/kopis/glacier/CommandLineGlacierMultipartUploader.java#L149 shows the line where the |
Thanks for listing the line number. :) |
I have no idea how they handle that. Maybe you can find information on http://docs.aws.amazon.com/amazonglacier/latest/dev/working-with-archives.html |
Are you(or maybe @nitriques) interested in doing this for a fee? (I'm sorry, I'm not sure how to send a PM in github?) EDIT: Specifically, I mean adding the "resume" feature, by keeping track/recording of the (multi)parts in a local file. |
The original intention of this CLI application was a dumb and easy way to put archives into Amazon Glacier. Maybe you should check some other applications like https://github.com/vsespb/mt-aws-glacier or https://github.com/basak/glacier-cli. Also I'm pretty short on free time at the moment, sorry. |
Okay, and thanks a lot for the links. |
Hi to both of you. Yes the (1) point is good: Checksums are calculated at the end of each transfer, and an error messge is displayed if the checksums are different. See this line https://github.com/MoriTanosuke/glacieruploader/blob/master/src/main/java/de/kopis/glacier/commands/UploadMultipartArchiveCommand.java#L75 Checksum are also calcultated for each part sent to Amazon. As for (2), I do not know what you want: Do you want to resume an upload after a failed attempt ? Or do you want to be able to pause the transfert ? I am not sure if either one of it are possible. You can contact me via twitter (@nitriques) or via our web form at http://www.deuxhuithuit.com/en/ and click on the contact link. We could check how much it would cost to do that. |
Sure, I can contact you via web, but, 2.) Let us first decide if it is even possible to do resume after failure(which is an absolutely necessary feature for large archives, and should probably be possible. These days, even my mother has 30-GB of family photos and videos, and that takes more than 24 hours on her high speed Shaw Cable network to upload. During this time, occasionally, the network or power goes down.) Doesn't that mean that resume-after-failure is possible? |
Try it please ! And tell us how it compare to @MoriTanosuke implementation !
Absolutly.
I regularly upload files larger than that, and I had no failure, so the problem is really with the available bandwith and QoS. If this is the problem you are trying to solve, we could very easily add a check in the uploadPart method, in order to re-send any part that was not uploaded correctly. This is more easy to do than resuming a upload that was started in a previous session, which is a completely different thing. The 24 hours span is really the problem here. How often do you expect you Mom do read the logs? Being able to set a maximum "part retry count" would be a neat solution IMHO. |
What if you must boot into Windows for 30 minutes, then resume Linux? What if you must take the laptop to the library for 6 hours, and then resume the upload after that? The idea of hogging a regular desktop machine for just one upload, for over 30 hours, is inflexible for some people.
I would have thought, that you can simply initiate a new upload session at any time, use the old upload ID for the previous(interrupted) session, and start uploading ONLY the parts that haven't been uploaded yet(in the previous session.) |
This is a completely separate topic. I am talking about resuming a failure right on the way. You are talking about pausing and resuming a download, which would require a lot more effort to do. And you only have 24 hours to do it.
I totally agree with you. Maybe you should consider buying a usb drive than... Or lower the CPU priority of the jvm.
You could. But as I said, it is a lot more complex to do than retry on error. You have to store all this information (old upload ID, uploaded parts) yourself somewhere "safe". This would require a major refactor of how it is implemented right now. Remember, HTTP is stateless: Each parts are uploaded in a separate operation. |
This is possible with the uploadID: Amazon can give you its checksum, you create your own, and compare. At least, that's what I did when I implemented the multipart upload. |
Hm, the retry-on-failure scenario sounds fairly simply. It's more like exception handling while uploading and having a configuration option for the maximum number of retries. For the pause-and-resume scenario I'm not sure if this really matches my idea of a simple CLI application. But I'd have to read up on the official docs and maybe build a first prototype to fully understand it. For the 24h example, could this be solved by parallelizing the multipart upload or would the bandwidth limit still hit and slow the whole download down over 24h again? Anyhow, if one of the alternatives suits your needs, simply using that sounds like the best solution. Maybe you can drop a comment here after trying it. :-) |
Me neither...
The 24 hours delay is because the uploadID is valid only for 24 hours... And I think it's 24 hours from the first byte you send... |
There must be more to it than that: maybe they expire that one and give you a new one, or something like that??? From Glacier FAQ: "Individual archives are limited to a maximum size of 40 terabytes." There is no public network in the solar system, today, that can send this kind of data in 24 hours! In fact most normal high speed connections, can send only a few gigabytes per day. I cannot see Amazon cutting it off HARD at 24 hours. |
That's not true!!! You just have to pay for it! Maybe the 24 hours limit is set and it gets deleted only if you do not use it anymore. I do not know that. If you can find something in Amazon's doc, please let us know. BTW, AWS market is COMPANIES not individuals. It's not made in order to be "user-friendly" it is more "developer-friendly". So the use case where you have a simple ADSL ou cable connection is not their priority. |
You mean several parallel connections?
Yes, I suppose you're right. |
No, I mean I have a 100 Mbps on both sides (down an up) here at work. It's like 10 times faster than cable. Google online backup solutions: You might find something less cheap, but more suitable for home/personal use. |
Yes, I apologize, it seems some people do advertise 10Gb commercially, but I'm not sure how widely available that is. |
I don't want to be rude but I think this might be a good point to stop the discussion here. Maybe we can pick up in a different issue after someone brought light into the dark that is Amazon AWS rules and technical mysteries. ;-) That being said, I'm still interested in first hand experiences about the way how other clients are handling the upload of large files. |
@MoriTanosuke I agree. |
Why did this issue get re-opened ? |
It shouldn't have been closed in the first place. I just made a mistake, sorry. |
Ah ok thanks! |
So I just took a look at this. The current multipart upload meets the requirements of question 1 of the original issue: "Does Glacier Uploader actually make certain that the checksum after upload matches the checksum that Amazon reports?" This is the case even if it the upload is only one part. The multipart implementation may also meet question 2's requirements: "Can Glacier Uploader resume properly after upload is interrupted?" I haven't tested it, but it seems as if its in the realm of possibility. |
Un-completed uploads only lasts for a short period of time on Amazon's server. So yes, but you would have limited time to do it... |
Questions:
1.) Does Glacier Uploader actually make certain that the checksum after upload matches the checksum that Amazon reports?
2.) Can Glacier Uploader resume properly after upload is interrupted?
If the answer to (1) is NO, what is the preferred way to check the checksum(manually?)
Thanks.
The text was updated successfully, but these errors were encountered: