Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload/download(/organize): --sync mode #365

Closed
yarikoptic opened this issue Feb 5, 2021 · 13 comments · Fixed by #616
Closed

upload/download(/organize): --sync mode #365

yarikoptic opened this issue Feb 5, 2021 · 13 comments · Fixed by #616
Assignees

Comments

@yarikoptic
Copy link
Member

Both upload and download should have an option to "sync" two copies of a dandiset in its entirety. This functionality need was mentioned before: #41 (comment) and #48 but apparently there were no clear dedicated issue for that, so decided to file a new issue (although #48 is likely to be addressed along).

Should be

  • done only toward dandi-api (no girder) implementation . Thus at large is blocked ATM by Handle uploading already-present files in new API #347
  • smart about avoiding re-uploads of files: files already uploaded must not be re-uploaded. I see multiple ways we could do that. At large specific implementation would depend on either we would like to "preserve" the identity/provenance of an already existing asset (if we keep asset UUID preserved while changing its metadata).
  • smart about "deletion" of assets on the "remote" end (e.g. to not delete until a replacement file is obtained)
@yarikoptic yarikoptic added the enhancement New feature or request label Feb 5, 2021
@satra
Copy link
Member

satra commented Feb 5, 2021

i'll add my comments to this:

just wanted file this that we would likely want two upload modes [sync, append (default)] + a special delete mode to remove assets from a dandiset.

it may also be helpful to have a helper that can clear out a dandiset ( with sufficient prompting for confirmation - type dandiset id, yes, etc.,.)

use cases:

  • 2+ groups submitting data collected in two different locations for the same project. each group should not have to download the data from other groups to sync/upload. the merge could happen on the server.
  • i want to clear out assets because the metadata or the NWB file did not have the correct info. it would be good to be able to clear out all the assets in a dandiset.

@yarikoptic
Copy link
Member Author

two upload modes [sync

dandi upload --sync

, append (default)]

dandi upload

this should work ok with "2+ groups" use case

delete mode to remove assets from a dandiset.
...
i want to clear out assets because the metadata or the NWB file did not have the correct info. it would be good to be able to clear out all the assets in a dandiset.

I think we better have a general ability to reupload just the metadata. So something like --existing overwrite-metadata of #355 . Delete + reupload might cause avoidable data transfer + minting of new UUIDs (if UUID is attached to an asset "draft instance" as it is now)

@satra
Copy link
Member

satra commented Feb 6, 2021

but nwb files have changed - so not just metadata. this is what happened with one of the datasets that ben was working on.

@yarikoptic
Copy link
Member Author

Then just upload --existing=refresh

@satra
Copy link
Member

satra commented Feb 6, 2021

to clarify

no flag == rsync -au
sync --existing=refresh == rsync -au --delete

@yarikoptic
Copy link
Member Author

either in current API implementation or provision looking forward in the API (#447 ) but time to get sync implemented.

@jwodder
Copy link
Member

jwodder commented Mar 11, 2021

@yarikoptic So "sync" mode should be the same as upload --existing=refresh/download --existing=refresh, but if a file on the local side (when uploading) or on the remote side (when downloading) does not exist on the other side, delete it?

@jwodder jwodder added the waiting Waiting on feedback/directions label Mar 24, 2021
@yarikoptic
Copy link
Member Author

I think any deletion should be done with explicit instruction/option for that . That is what rsync requires as well

$> rsync --help | grep delete
--del                    an alias for --delete-during
--delete                 delete extraneous files from dest dirs
--delete-before          receiver deletes before xfer, not during
--delete-during          receiver deletes during the transfer
--delete-delay           find deletions during, delete after
--delete-after           receiver deletes after transfer, not during
--delete-excluded        also delete excluded files from dest dirs
--delete-missing-args    delete missing source args from destination
--ignore-errors          delete even if there are I/O errors
--max-delete=NUM         don't delete more than NUM files

(and iirc other tools for s3 etc support similar option(s)). I think we could add a --delete=(None|during|after) and which would invoke now existing delete functionality

@yarikoptic yarikoptic removed the waiting Waiting on feedback/directions label May 4, 2021
@jwodder
Copy link
Member

jwodder commented May 5, 2021

@yarikoptic So what exactly are the desired semantics for the --sync option?

@yarikoptic
Copy link
Member Author

yarikoptic commented May 5, 2021

--sync pretty much just implies delete for assets in the target location if not present in the source location. E.g. upload --sync: of entire dandiset should delete assets on the server not present locally, of a folder(s) -- delete assets in those folder(s) if not present locally, and if an asset - kinda a no-op.

I think there is not much sense in adding --delete option for now. Let's just do "after": collect information about assets we are to delete, ask user for confirmation ("{len(to_delete)} assets are to be removed from {destination}" or alike) and delete them. "during" might come handy later if someone runs out of space on local hard drive I guess. And we do still possibly overwrite the assets at the destination (--existing option present in both download and upload still applies).

@yarikoptic
Copy link
Member Author

renames: (if e.g. dandiset is reorganized but content is the same).

For upload we do not need to be anyhow clever since we would not reupload blobs if already known, and since we are not tracking any kind of "provenance asset id" ATM, we have no need to do anything special.

For download it does get trickier. Initial implementation can avoid doing that (figuring out renames), but then we would need to add the logic:

  • match assets between source/destination based on size first (fast and can reduce need for checksumming?). Those which have matches - compute digest(s) (the one known on remote end) for local files to establish final mapping.
  • Establish sequence of renamings (tricky one to not overwrite smth which is yet to be renamed; if to be downloaded, I guess could be the --delete=during kind of behavior)
  • Do renames + downloads of what is missing

This --sync code, in particular renaming -- might be reusable for organize as well where we do have source (just differently named files) and then we pretty much sync them while giving them new filenames. "source" and "destination" location though might be the same if re-organizing an existing dandiset with --files-mode move and which would entail bunch of renames.

@jwodder
Copy link
Member

jwodder commented May 6, 2021

@yarikoptic For download --sync, what local files should be treated as assets? Should only *.nwb files be treated as assets, or should all files be, with some filtering out of VCS files?

@yarikoptic
Copy link
Member Author

great question. For upload we have devel option which triggers the switch between

paths = find_files(".*", paths) if allow_any_path else find_dandi_files(paths)

I think we should just use the same find_files invocation here (it should ignore VCS, dot files and dirs, etc) and see how that works out ;)

yarikoptic added a commit that referenced this issue May 10, 2021
Add "sync" option for upload & download
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants