Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignoreChecksum: true by default #128

Merged
merged 1 commit into from
May 2, 2015
Merged

Conversation

odeke-em
Copy link
Owner

This is because checksum verification by default was deemed to be
too vigorous and unncessary by default. This discussion stems from
issue #117.

@odeke-em
Copy link
Owner Author

cc @l3iggs @kcwu

This is because checksum verification by default was deemed to be
too vigorous and unncessary by default. This discussion stems from
issue #117.
@kcwu
Copy link

kcwu commented Mar 27, 2015

Well, what I preferred is more like rsync. Not only changing the default, but also change the logic when to do checksum. I will explain the detail later (sorry i'm busy now)

@odeke-em
Copy link
Owner Author

No worries. Am off to bed, this is a work in progress so actually you could get that file, and submit a PR but with the logic implemented and I can close this one: just to cut out the middle man, otherwise I'll wait for your explanation or jump onto it in the morning. Catch you in a few hours, am off to bed.

@l3iggs
Copy link

l3iggs commented Mar 27, 2015

I agree with @kcwu, as close to what rsync does would be my preference as well. Here's how I understand rsync uses checksums (please correct me if I'm wrong here):

  • When deciding if a file is to be skipped, checksums are only used if the --checksum flag is given (by default mod time and file size are used)
  • rsync keeps a running MD4 checksum of the data stream as it goes out on the sending end and as it comes in on the receiving end. Once the transfer is complete, it compares the MD4s of the data received and the data sent to verify transfer integrity. In this case rsync does not explicitly run a checksum of the data after it has been written to disk. It trusts that if no error was returned while writing to disk then the data was stored properly.

I'm pretty sure the first point there is implementable in drive (and is hopefully already taken care of by this PR).

I'm not so sure the second point can be exactly recreated here since we don't have control over both ends of the pipe. The way rsync does it allows for verification of chunked file transfers.
For drive push we may be able to do something similar here for whole file transfers if drive were to keep a running md5 as the whole file goes out and then ask Google what the md5 is for the file after the transfer is complete (and then fire off a retry if the checksums differ). For this to work in practice, I think the Drive API would have to be ready to return the file's md5 immediately following the reception of the file, which may or may not be possible depending on if Google computes & returns a running md5 as the file streams in or if they're computing the md5 over the bytes they have after they're on disk (in which case the md5 would probably be available too late for our purposes).
drive pull transfer verification for a full file should be straightforward: calculate the running md5 as the file streams in and then compare it to the md5 returned by the Drive API to decide if a retry is needed.

I think any file transfer verification checksumming is out of scope for this PR.

@kcwu
Copy link

kcwu commented Mar 27, 2015

Regarding to rolling checksum mentioned by @l3iggs , it is not possible.

What I am concerned are following two cases.

  • size identical, mtime identical
  • size identical, mtime differ

The old behavior: (-ignore-checksum=false)

  • size identical, mtime identical - sync or not depend on checksum
  • size identical, mtime differ - always do checksum and always sync

The last case need to explain, please look https://github.com/odeke-em/drive/blob/master/src/types.go#L267 fileDifferences() and https://github.com/odeke-em/drive/blob/master/src/types.go#L296 op()

fileDifferences() will compute differ flags first. So if size equals and ignoreChecksum=false, it always do checksum. However, after that, even checksum equals, last check of op() if modTimeDiffers(mask) still return OpMod.

The new behavior after this patch: (-ignore-checksum=true)

  • size identical, mtime identical - treat identical, no sync
  • size identical, mtime differ - treat file differ, sync

Following are what I expected

-ignore-checksum=false. I prefer this one as default.

  • size identical, mtime identical - treat identical, no sync
  • size identical, mtime differ - sync or not depend on checksum

with -ignore-checksum=true

  • size identical, mtime identical - treat identical, no sync
  • size identical, mtime differ - treat file differ, sync

maybe somebody like this new flag -always-checksum, i don't know.

  • size identical, mtime identical - sync or not depend on checksum
  • size identical, mtime differ - treat file differ, sync (no need to checksum)

@odeke-em
Copy link
Owner Author

I'll look at this later, however this might be two releases from now.

@kcwu
Copy link

kcwu commented Mar 28, 2015

FYI, I found the behavior of "-always-checksum" I described above, is actually what rakyll's original code does.
https://github.com/rakyll/drive/blob/master/types.go#L130

@odeke-em
Copy link
Owner Author

@kcwu
Copy link

kcwu commented Mar 31, 2015

L267-282 is run before L315-323. In fileDifferences(), checksum calculation is done even if it is not used later.
Is this what you are asking?

@odeke-em
Copy link
Owner Author

checksum calculation is done only if you aren't ignoring it, and if size is the same.

@odeke-em
Copy link
Owner Author

And if you notice, if we are going to actually ignore the checksum by default should then toggling the DifferChecksum bit should produce the expected result.

@kcwu
Copy link

kcwu commented Mar 31, 2015

Hmm, I didn't answer it accurately. Let me try again.

The problem is, L315-323 if "mtime differ", you decided to sync no matter the checksum is identical or not. Given that, calculate checksum is wasting CPU and time.

In other words, according to L315-323, if mtime differs, we actually don't care DifferChecksum bit. The problem appears because original logic have been split into two parts.

@odeke-em
Copy link
Owner Author

See the thing is changing mtime is a separate operation from calculating the checksum so I don't see how those two should be coupled. Actually there are lots of times for which you just want to change the mtime and not the content e.g after touch locally or even on the server.

@odeke-em
Copy link
Owner Author

So as you would see from that, that is not wasting the CPU and time.

@kcwu
Copy link

kcwu commented Mar 31, 2015

Yes, you are right. Sorry I misunderstood how it works.

So, the current behavior should be
The old behavior: (-ignore-checksum=false)

  • size identical, mtime identical - sync or not depend on checksum
  • size identical, mtime differ - sync or not depend on checksum

The new behavior after this patch: (-ignore-checksum=true)

  • size identical, mtime identical - treat identical, no sync
  • size identical, mtime differ - treat file differ, sync

Is this right? If so, what do you think about following behavior? (assume no conflict)

  • size identical, mtime identical - treat identical, no sync, no need to calculate checksum
  • size identical, mtime differ - sync or not depend on checksum

@odeke-em
Copy link
Owner Author

Sorry for the late reply, I was just completing these essays ;) #studentlife

Is this right? If so, what do you think about following behavior? (assume no conflict)

size identical, mtime identical - treat identical, no sync, no need to calculate checksum
size identical, mtime differ - sync or not depend on checksum

size identical, mtime differ - sync or not depend on checksum
mtime should be decoupled from the checksum discussion because mtime is an independent property that is independent of checksum, and must be propagated to the users' files e.g if they upload photos from a media card but maybe getting them to their desktop caused a touch, they'd probably want to keep the dates.

@l3iggs
Copy link

l3iggs commented Mar 31, 2015

Will drive ever get delta upload functionality?
i.e. We're pushing a 10GB file over a 10GB file on the server where only the first 100k are different between the two files (virtual machine disk images for example). Delta upload would mean we don't have to transfer the whole thing to get the cloud version in sync. (this is relevant to this discussion)

@odeke-em
Copy link
Owner Author

Remember you once opened this issue then evaluated and closed it? Also I don't think delta uploads are relevant to this discussion as that is a beast of its own, I'd rather keep this small and solveable. Delta uploads can take their own issue and discussion :)

@l3iggs
Copy link

l3iggs commented Mar 31, 2015

I'm not suggesting we talk in depth about it here. My question was about drive's roadmap. If delta uploads are not on the horizon then I have opinion A on this issue. If they are on the horizon, then I have opinion B. (where A and B are secrets right now :-))

@odeke-em
Copy link
Owner Author

Alright. Since you asked, I've been looking at it offline but it is quite haxorish. It is quite a risky option because it will require quite a bit of work and indexing that I would prefer to keep it off for now before it is well understood. Also Google Drive isn't yet exposing checksum-ing chunks. However it allows for checking versions and etags ( from this you can see how the pieces come together). So to answer your question: on the record no plan yet, off the record yes.

@l3iggs
Copy link

l3iggs commented Apr 1, 2015

Ok, then my opinion on how this should work is this: When checking if a transfer needs to be done (for push or pull):
Drive should compare all metadata (not including checksum) available to it for the local and remote files. If any differences are found, then transfer the file.

If the user passes a --checksum flag, compare all file metadata AND the checksum (which will be calculated over every single file in the transfer at this point), if any differences are found, then transfer the file.

@odeke-em odeke-em merged commit fda55c7 into master May 2, 2015
@odeke-em odeke-em deleted the checksum-off-by-default branch May 9, 2015 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants