Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass credentials to git-annex for downloading embargoed text assets #21

Merged
merged 9 commits into from
Feb 2, 2024

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Jan 30, 2024

Closes #19.

@jwodder jwodder added bugfix Fixing something broken embargo Handling embargoed Dandisets & assets on the Archive labels Jan 30, 2024
@jwodder
Copy link
Member Author

jwodder commented Jan 30, 2024

@yarikoptic The tests are failing (both here and on my Mac) because, for some reason, git-annex can't find the git-annex-remote-datalad program despite datalad being installed in the virtual environment, and the error message that git-annex emits shows a different PATH value than expected. Neither tox nor click is resetting the PATH (I checked). Does git or git-annex reset PATH? Do you have any other ideas?

@jwodder
Copy link
Member Author

jwodder commented Jan 30, 2024

@yarikoptic Nevermind; it was something I did.

@jwodder
Copy link
Member Author

jwodder commented Jan 30, 2024

@yarikoptic The tests are currently failing because addurl can't download the test embargoed text file, and the error message just says "Unauthorized." I can get this to work using your script, both for the production Archive and staging, but I can't find any meaningful differences between that script and what I'm doing in this PR. Could you take a look at my PR and sanity-check it?

@yarikoptic
Copy link
Member

nothing immediate come to mind (you did produce providers config etc...). I will keep eyeballing -- but would it probably be possible to just add a debug invocation like

datalad -l debug download-url https://api-staging.dandiarchive.org/api/assets/6e8d374b-cfeb-498c-90d8-17aa533a4ef2/download/

before feeding that file.txt into addurl , ie just invoke datalad directly and see if it can download that file within that dataset or if not then may be why

@yarikoptic
Copy link
Member

good news -- reproduced locally, so will try to debug....

@yarikoptic
Copy link
Member

oh -- it is the --raw option you added to addurl -- it makes behavior a bit odd: it does still consult external remote for claimurl but then does not use it to actually download.

removing it proceeds forward until

        for path in embargoed_dandiset.blob_assets.keys():
            whereis = json.loads(
                await areadcmd("git-annex", "whereis", "--json", "--", path, cwd=ds.path)
            )
>           (web_urls,) = [
                w["urls"] for w in whereis["whereis"] if w["description"] == "web"
            ]
E           ValueError: not enough values to unpack (expected 1, got 0)

which I didn't troubleshoot further

@jwodder
Copy link
Member Author

jwodder commented Jan 31, 2024

@yarikoptic I can't find why we decided to use --raw, but the docs describe the option as:

Prevent special handling of urls by yt-dlp, bittorrent, and other special remotes. This will for example, make addurl download the .torrent file and not the contents it points to.

I can see why we would have originally wanted to set this, but I don't think that reason applies any longer — that is, assuming that:

  • The yt-dlp and bittorrent behaviors are the only built-in special cases (Indeed, this page lists only these as special uses for addurl), and "other special remotes" are things that the git-annex user has to set up
  • yt-dlp only applies to download URLs pointing to established video streaming sites (which our download URLs do not do)
  • bittorrent only applies to files ending in .torrent; as we currently only use addurl for text files, and the identify library does not classify .torrent files as text (nor should it), this won't apply to us

So, unless there's another special case that --raw blocks that we want to block, I'm going to remove that option.

@jwodder
Copy link
Member Author

jwodder commented Jan 31, 2024

@yarikoptic The ValueError is happening because, now that the datalad remote is set up, registering a download URL for a binary blob via registerurl causes the URL to be associated with the datalad remote rather than the web remote. Is this acceptable? Will the program need to do anything to undo this when unembargoing a dataset?

@yarikoptic
Copy link
Member

re --raw, good thinking for original motivation behind using it! I think it is worth enhancement to git-annex, thus filed https://git-annex.branchable.com/todo/addurl_--raw-except_REMOTEs__63__/ for @joeyh.

re registerurl association with datalad remote -- I think it is fine.

As for "undo"ing that -- actually have to ask @joeyh since nothing straightforward comes to mind [*]: https://git-annex.branchable.com/forum/how_to___34__move__34___URL_between_remotes__63__/

[*] I guess, besides modifying .web files in git-annex branch to IIRC remove some indicator that URL is to be handled by an external remote, we could probably just git remote rm datalad and run another git annex registerurl but that then I guess (didn't try; so if works) would duplicate info for that URL as it could be handled by an external remote and by web.

@yarikoptic
Copy link
Member

Will the program need to do anything to undo this when unembargoing a dataset?

could you please for now just add raise NotImplementedError("waiting on Joey's input, see https://git-annex.branchable.com/forum/how_to___34__move__34___URL_between_remotes__63__/ ") there so we could proceed as is (just addressing for that now URLs are associated with datalad remote).

@jwodder
Copy link
Member Author

jwodder commented Feb 2, 2024

@yarikoptic Done: 9f5adba

Copy link

codecov bot commented Feb 2, 2024

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (e76c760) 75.64% compared to head (8a12ad6) 74.59%.

Files Patch % Lines
src/backups2datalad/adataset.py 66.66% 5 Missing and 4 partials ⚠️
src/backups2datalad/syncer.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #21      +/-   ##
==========================================
- Coverage   75.64%   74.59%   -1.05%     
==========================================
  Files          17       17              
  Lines        2681     2712      +31     
  Branches      589      594       +5     
==========================================
- Hits         2028     2023       -5     
- Misses        482      519      +37     
+ Partials      171      170       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jwodder jwodder marked this pull request as ready for review February 2, 2024 20:02
@yarikoptic yarikoptic merged commit 8e239fa into main Feb 2, 2024
5 of 7 checks passed
@yarikoptic yarikoptic deleted the gh-19 branch February 2, 2024 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Fixing something broken embargo Handling embargoed Dandisets & assets on the Archive
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Seems to not be able to access/modify private repos with current github token?
2 participants