Skip to content
This repository has been archived by the owner on Jul 5, 2021. It is now read-only.

checking checksums and also multi-part resumes #36

Open
RezaRob opened this issue May 12, 2013 · 30 comments
Open

checking checksums and also multi-part resumes #36

RezaRob opened this issue May 12, 2013 · 30 comments

Comments

@RezaRob
Copy link

RezaRob commented May 12, 2013

Questions:
1.) Does Glacier Uploader actually make certain that the checksum after upload matches the checksum that Amazon reports?

2.) Can Glacier Uploader resume properly after upload is interrupted?

If the answer to (1) is NO, what is the preferred way to check the checksum(manually?)

Thanks.

@MoriTanosuke
Copy link
Owner

  1. There is no validation done after uploading the file. The TreeHash can be calculated https://github.com/MoriTanosuke/glacieruploader#calculate-hash-for-file, but you then have to rely on Amazons tools to verify it. How would you create a checksum or validate the uploaded file?

  2. Hm, I'd say: no. ;-)

@RezaRob
Copy link
Author

RezaRob commented May 13, 2013

1.) Well, I thought that Amazon returns the calculated checksum(of the uploaded file) to the client(Glacier Uploader) so that the client can verify it against the local copy. OR... perhaps the client sends the local checksum to Amazon, and Amazon is responsible for verifying it?

So regardless, does somebody verify that the upload is a correct match?

@MoriTanosuke
Copy link
Owner

With glacieruploader you have to do it manually. Amazon only returns the archive ID after upload, but you can calculate the checksum and compare it against the value that Amazon displays in your vault.

Or maybe you can list the inventory of the vault and when the job is ready, compare the returned SHA256TreeHash to the calculated one. You can see an example of a job inventory at https://github.com/MoriTanosuke/glacieruploader/blob/4f6d3f872ea4deec3ac686ba98ebdf77da7a4489/src/test/resources/inventorylisting.txt

I'm not sure if this can be done automatically, because the jobs are asynchronous. And glacieruploader doesn't keep a database of previously uploaded files, so right now I don't know how to automatically check the TreeHashes from a returned inventory listing...

If you come up with some code, create a pull requests. I'm always happy when I can merge contributions. 😄

@RezaRob
Copy link
Author

RezaRob commented May 13, 2013

Suppose we did have the multipart resume capability(i.e. keeping a database of already-uploaded chunks etc.) In that case, does the whole(large) file show up as a single archive on the inventory listing(with only one single checksum for the entire large archive to be verified?)

@MoriTanosuke
Copy link
Owner

There is only one archive after a multipart upload. From the Glacier docs:

After uploading all the archive parts, you use the complete operation. Again, you must specify
the upload ID in your request. Amazon Glacier creates an archive by concatenating parts in
ascending order based on the content range you provided. Amazon Glacier response to a
Complete Multipart Upload request includes an archive ID for the newly created archive.

The pull request from @nitriques added this functionality and https://github.com/DeuxHuitHuit/glacieruploader/blob/5fc0e89c109c8a328c5774b7d1d1be978854c223/src/main/java/de/kopis/glacier/CommandLineGlacierMultipartUploader.java#L149 shows the line where the complete operation is called after uploading the archive.

@RezaRob
Copy link
Author

RezaRob commented May 13, 2013

Thanks for listing the line number. :)
In case of upload interruption(and then resume) how long does Amazon "remember" the already-uploaded parts? Do they say that?

@MoriTanosuke
Copy link
Owner

I have no idea how they handle that. Maybe you can find information on http://docs.aws.amazon.com/amazonglacier/latest/dev/working-with-archives.html

@RezaRob
Copy link
Author

RezaRob commented May 13, 2013

Are you(or maybe @nitriques) interested in doing this for a fee? (I'm sorry, I'm not sure how to send a PM in github?)

EDIT: Specifically, I mean adding the "resume" feature, by keeping track/recording of the (multi)parts in a local file.

@MoriTanosuke
Copy link
Owner

The original intention of this CLI application was a dumb and easy way to put archives into Amazon Glacier. Maybe you should check some other applications like https://github.com/vsespb/mt-aws-glacier or https://github.com/basak/glacier-cli.

Also I'm pretty short on free time at the moment, sorry.

@RezaRob
Copy link
Author

RezaRob commented May 13, 2013

Okay, and thanks a lot for the links.
Just to be sure, I did say "fee" and not "free!!" :)

@nitriques
Copy link
Contributor

Hi to both of you. Yes the (1) point is good: Checksums are calculated at the end of each transfer, and an error messge is displayed if the checksums are different. See this line https://github.com/MoriTanosuke/glacieruploader/blob/master/src/main/java/de/kopis/glacier/commands/UploadMultipartArchiveCommand.java#L75

Checksum are also calcultated for each part sent to Amazon.

As for (2), I do not know what you want: Do you want to resume an upload after a failed attempt ? Or do you want to be able to pause the transfert ?

I am not sure if either one of it are possible. You can contact me via twitter (@nitriques) or via our web form at http://www.deuxhuithuit.com/en/ and click on the contact link. We could check how much it would cost to do that.

@RezaRob
Copy link
Author

RezaRob commented May 13, 2013

Sure, I can contact you via web, but,
1.) This becomes less attractive if mt-aws-glacier etc.(see above) do a god job of this already. (I haven't tested them.)

2.) Let us first decide if it is even possible to do resume after failure(which is an absolutely necessary feature for large archives, and should probably be possible. These days, even my mother has 30-GB of family photos and videos, and that takes more than 24 hours on her high speed Shaw Cable network to upload. During this time, occasionally, the network or power goes down.)
Apparently, the multipart upload "ID" lasts 24 hours:
https://forums.aws.amazon.com/message.jspa?messageID=385027
I wasn't sure what that means, so I have added another question to the bottom of that thread.

Doesn't that mean that resume-after-failure is possible?

@nitriques
Copy link
Contributor

@RezaRob

This becomes less attractive if mt-aws-glacier

Try it please ! And tell us how it compare to @MoriTanosuke implementation !

Let us first decide if it is even possible to do resume after failure

Absolutly.

These days, even my mother has 30-GB

I regularly upload files larger than that, and I had no failure, so the problem is really with the available bandwith and QoS. If this is the problem you are trying to solve, we could very easily add a check in the uploadPart method, in order to re-send any part that was not uploaded correctly.

This is more easy to do than resuming a upload that was started in a previous session, which is a completely different thing.

The 24 hours span is really the problem here. How often do you expect you Mom do read the logs? Being able to set a maximum "part retry count" would be a neat solution IMHO.

@RezaRob
Copy link
Author

RezaRob commented May 13, 2013

the problem is really with the available bandwith and QoS

What if you must boot into Windows for 30 minutes, then resume Linux? What if you must take the laptop to the library for 6 hours, and then resume the upload after that?

The idea of hogging a regular desktop machine for just one upload, for over 30 hours, is inflexible for some people.

This is more easy to do than resuming a upload that was started in a previous session, which is a completely different thing.

I would have thought, that you can simply initiate a new upload session at any time, use the old upload ID for the previous(interrupted) session, and start uploading ONLY the parts that haven't been uploaded yet(in the previous session.)

@nitriques
Copy link
Contributor

What if you must boot into Windows for 30 minutes, then resume Linux? What if you must take the laptop to the library for 6 hours, and then resume the upload after that?

This is a completely separate topic. I am talking about resuming a failure right on the way. You are talking about pausing and resuming a download, which would require a lot more effort to do. And you only have 24 hours to do it.

The idea of hogging a regular desktop machine for just one upload, for over 30 hours, is inflexible for some people.

I totally agree with you. Maybe you should consider buying a usb drive than... Or lower the CPU priority of the jvm.

I would have thought, that you can simply initiate a new upload session at any time, use the old upload ID for the previous(interrupted) session, and start uploading ONLY the parts that haven't been uploaded yet(in the previous session.)

You could. But as I said, it is a lot more complex to do than retry on error. You have to store all this information (old upload ID, uploaded parts) yourself somewhere "safe". This would require a major refactor of how it is implemented right now.

Remember, HTTP is stateless: Each parts are uploaded in a separate operation.

@nitriques
Copy link
Contributor

@MoriTanosuke

How would you create a checksum or validate the uploaded file?

This is possible with the uploadID: Amazon can give you its checksum, you create your own, and compare. At least, that's what I did when I implemented the multipart upload.

@MoriTanosuke
Copy link
Owner

Hm, the retry-on-failure scenario sounds fairly simply. It's more like exception handling while uploading and having a configuration option for the maximum number of retries.

For the pause-and-resume scenario I'm not sure if this really matches my idea of a simple CLI application. But I'd have to read up on the official docs and maybe build a first prototype to fully understand it.

For the 24h example, could this be solved by parallelizing the multipart upload or would the bandwidth limit still hit and slow the whole download down over 24h again?

Anyhow, if one of the alternatives suits your needs, simply using that sounds like the best solution. Maybe you can drop a comment here after trying it. :-)

@nitriques
Copy link
Contributor

I'm not sure if this really matches my idea of a simple CLI application

Me neither...

For the 24h example

The 24 hours delay is because the uploadID is valid only for 24 hours... And I think it's 24 hours from the first byte you send...

@RezaRob
Copy link
Author

RezaRob commented May 14, 2013

The 24 hours delay is because the uploadID is valid only for 24 hours... And I think it's 24 hours from the first byte you send...

There must be more to it than that: maybe they expire that one and give you a new one, or something like that??? From Glacier FAQ: "Individual archives are limited to a maximum size of 40 terabytes." There is no public network in the solar system, today, that can send this kind of data in 24 hours! In fact most normal high speed connections, can send only a few gigabytes per day. I cannot see Amazon cutting it off HARD at 24 hours.

@nitriques
Copy link
Contributor

There is no public network in the solar system, today, that can send this kind of data in 24 hours

That's not true!!! You just have to pay for it!

Maybe the 24 hours limit is set and it gets deleted only if you do not use it anymore. I do not know that. If you can find something in Amazon's doc, please let us know.


BTW, AWS market is COMPANIES not individuals. It's not made in order to be "user-friendly" it is more "developer-friendly". So the use case where you have a simple ADSL ou cable connection is not their priority.

@RezaRob
Copy link
Author

RezaRob commented May 14, 2013

That's not true!!! You just have to pay for it!

You mean several parallel connections?

It's not made in order to be "user-friendly" it is more "developer-friendly". So the use case where you have a simple ADSL ou cable connection is not their priority.

Yes, I suppose you're right.

@nitriques
Copy link
Contributor

You mean several parallel connections?

No, I mean I have a 100 Mbps on both sides (down an up) here at work. It's like 10 times faster than cable.
I event saw a company that has 10 Gbps. That's one Gigabyte per 8 seconds. That's one Terabyte in a bit more than 2 hours...

Google online backup solutions: You might find something less cheap, but more suitable for home/personal use.

@RezaRob
Copy link
Author

RezaRob commented May 14, 2013

Yes, I apologize, it seems some people do advertise 10Gb commercially, but I'm not sure how widely available that is.

@MoriTanosuke
Copy link
Owner

I don't want to be rude but I think this might be a good point to stop the discussion here. Maybe we can pick up in a different issue after someone brought light into the dark that is Amazon AWS rules and technical mysteries. ;-)

That being said, I'm still interested in first hand experiences about the way how other clients are handling the upload of large files.

@nitriques
Copy link
Contributor

@MoriTanosuke I agree.

@RezaRob RezaRob closed this as completed Jun 3, 2013
@RezaRob RezaRob reopened this Jun 3, 2013
@nitriques
Copy link
Contributor

Why did this issue get re-opened ?

@RezaRob
Copy link
Author

RezaRob commented Jun 3, 2013

It shouldn't have been closed in the first place. I just made a mistake, sorry.

@nitriques
Copy link
Contributor

Ah ok thanks!

@nbarnard
Copy link

nbarnard commented Jan 4, 2015

So I just took a look at this.

The current multipart upload meets the requirements of question 1 of the original issue: "Does Glacier Uploader actually make certain that the checksum after upload matches the checksum that Amazon reports?" This is the case even if it the upload is only one part.

The multipart implementation may also meet question 2's requirements: "Can Glacier Uploader resume properly after upload is interrupted?" I haven't tested it, but it seems as if its in the realm of possibility.

@nitriques
Copy link
Contributor

I haven't tested it, but it seems as if its in the realm of possibility.

Un-completed uploads only lasts for a short period of time on Amazon's server. So yes, but you would have limited time to do it...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants