-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document why signed S3 URLs might be giving 400s when called from inside us-west-2 #188
Comments
You actually get a 400, and here is the smallest sample case:
When running from inside
which is pretty clear and useful! And on my laptop, this prints:
So we have a reproducible setup now. fsspec uses aiohttp under the hood, so this is the same issue fsspec is facing |
This is likely the aiohttp bug actually: aio-libs/aiohttp#2610 |
Here is the same code with requests: import requests
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')
resp = requests.get(url, auth=(username, password))
print(resp.status_code)
print(resp.content[:15]) This actually produces the correct output on both my laptop and on us-west-2!
This is most likely because requests implemented request/request#1184, while the equivalent bug with aiohttp is still open. This is amazing news, as this means that fixing aio-libs/aiohttp#2610 should get fsspec to work, which means most of the pangeo stack would work after that. It will still have lower performance than using s3 directly when in us-west-2, so work there still needs to be done. But this will at least make sure regular https URLs work when both inside and outside us-west-2 |
aiohttp has documented this should not be the case, based on the note here: https://docs.aiohttp.org/en/stable/client_advanced.html?highlight=redirects#custom-request-headers |
I also looked at the request being made by aiohttp, and see the following:
So I think this confirms that the Authorization header is being retained during redirects. |
I've now found @betolink's comment in aio-libs/aiohttp#5783 (comment), and made me realize that what we want is for the credentials to be forwarded when we are redirected to earthdata login, but then dropped. But what we are getting is instead it being sent to everything |
AHA, so what's actually happening is that we are setting the basic auth on the session, rather than on the request. So it's being sent to every request from the session, including S3! This actually now is unrelated to the aiohttp bug |
if I move the So the question now really is why does requests work? Separately, it should be possible for us to subclass aiohttp's ClientSession to pass per-host basicauth so it can provide appropriate auth to different hosts in the chain, and just send basic auth to earthdata. |
ok, so I have discovered why it works with requests but not with aiohttp. It is because requests supports netrc lol! So at the first redirect, requests drops the Authorization header, but when making the request to EDL, it reads netrc file directly and sends the appropriate credentials! So that is why it works by default with requests, and not with aiohttp. So to summarize, the current problem is that we pass parameters to fsspec that are set at the import asyncio
import aiohttp
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')
auth = aiohttp.BasicAuth(username, password)
async def main():
async with aiohttp.ClientSession() as session:
async with session.get(url, auth=auth) as response:
print(response.status)
print((await response.read())[:30])
asyncio.run(main()) This actually will fail with a HTTP Basic request denied error anywhere, which makes sense - the If I recreate this with import requests
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
username = "yuvipanda"
password = "mypassword"
resp = requests.get(url, auth=(username, password))
print(resp.status_code)
print(resp.content[:15]) I get the exact same behavior. WHICH IS GREAT! So the problem now isn't to do with redirects at all, it is really - how do we make sure to send the HTTP Basic Creds just to EDL? Because right now, the reason this works with non-cloud datasets is that we are actually leaking plaintext EDL creds to all of them, completely negating the point of OAuth2 :D |
I see |
So the current issue is really that aiohttp has no way to say 'for this domain, send this authentication information'. requests accidentally provides this with netrc, but otherwise doesn't afaict. |
So, netrc support is actually the easiest way to make sure that we can send specific Basic Auth credentials only to specific Hosts. So I made this PR adding it to aiohttp! aio-libs/aiohttp#7131 If merged and released, this should sort of automatically make fsspec work again. |
Amazing work @yuvipanda! I'm just catching up with this thread. One thing I'd like to mention is that -if possible- it would be preferable to have a solution/workaround that does not rely on having a |
@betolink so I think these tokens (https://urs.earthdata.nasa.gov/documentation/for_users/user_token) should get rid of the need for netrc completely. I have no idea why people are restricted to just two tokens per user - that makes it definitely harder to use :( |
I dug some more into what
What we need is something like I think this is a fairly well scoped and small change to fsspec that would be extremely useful! I'm super swamped though, I am hoping someone else can implement this? |
Opened fsspec/filesystem_spec#1142 to discuss what would help solve the issue from fsspec in allowing us to use tokens! |
Turns out this already exists in fsspec - any kwargs you pass in actually get passed directly to the requests, exactly what we wanted! So the following code works for me :) from fsspec.implementations.http import HTTPFileSystem
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"
token = 'my-long-token'
fs = HTTPFileSystem(headers={
"Authorization": f"Bearer {token}"
})
with fs.open(url) as f:
print(f.read()[:30]) yay! |
ok, so current summary is:
Unfortunately, there is a limit of only two tokens per user in earthdata login right now, so you can not just generate a token for each machine you would use it in, like with GitHub Personal Access token. However, the lack of need for specific files means this would also work with dask. |
Here is an example of it working with xarray!
|
This is awesome @yuvipanda! I feel like we need to refactor this library to only use CMR tokens everywhere instead of monkey-patching OAuth2 redirects for cloud-hosted data. I wish DAAC hosted data would follow the same behavior with bearer tokens. i.e.
Also, maybe we only need one token even if we use it concurrently from different processes? I haven't tested but I suspect it should work. |
@betolink yeah we should only need one token even if it is used concurrently. So the token only works for some datasets but not all? And works for cloud datasets but not on-prem? Does it work for any on prem thing at all? |
I'm afraid it won't work for on-prem data, it may work for some data hosted at the ASF DAAC marked on-prem but actually hosted at AWS. This is tremendous progress! Now there is a clear path for one of the most common access patterns! |
@betolink feels like long term, the right way is to get the access token to work for all data, and support the earthdatalogin folks in this misison. In the meantime, netrc is the more universal solution, once we get the aiohttp pr merged. But that is slightly messy when it comes to dask, because it requires populating a specific file in the dask worker which is not always easy. Does that sound right? |
Me and @briannapagan did another bit of deep dive here, and made some more progress. There seem to be two primary packages supporting earthdata login on the server side:
We have established that TEA already supports bearer tokens (https://github.com/asfadmin/thin-egress-app/blob/7b0f7110b1694f553af2b71594cc19e40c179ea9/lambda/app.py#L183). But what of the apache2 module?! As of Sep 2021, it also supports bearer tokens! https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/e13ddeb1c3be7767a3214191f9de31e8cc311187 is the appropriate merge commit, and we discovered an internal JIRA ticket named With some more sleuthing, we discovered https://forum.earthdata.nasa.gov/viewtopic.php?t=3290. We tracked that through looking for URSFOUR-1858, mentioned in https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/8c4796c0467a1d5dcb8740fb86f23474db8258e3. That merge was the only further activity on the apache module since the merge for token support. Looking through that earthdata forum post, we see that LPDAAC (which maintains the dataset talked about there) mentions deploying 'some apache change' to help with that. So the hypothesis I had was:
I tested this hypothesis by trying to send a token to So the pathway to using tokens everywhere, including onprem, boils down to getting all the DAACs to use the latest version of the official earthdata apache2 module. This is great news for many reasons:
|
Also, passing |
NSIDC also seems to have the latest version of the apache module - So looks like some (many?) DAACs have this deployed, and some don't. |
@betolink in fact, the exact URL you used to test tokens earlier in #188 (comment) works now. My suspicion is that NSIDC deployed the latest version of the apache2 module very recently? |
ASDC also supports tokens, as tested with Again, I'm using the presence of |
ORNL also supports it, as tested via |
Note that uppercase |
podac (tested with |
This issue has spawned off many different things, so here's a quick summary: 1. Support using HTTPS + .netrc with xarray universallyCurrently, it is not possible to use 2. Support using EDL user tokens universallyFor cloud access, EDL tokens already work with xarray (#188 (comment) has an example). However, it doesn't work universally - many on-prem servers don't support EDL tokens yet, although some do. Me (from outside NASA) and @briannapagan (from inside) are pushing on this, getting EDL token support rolled out more universally. If you are inside a DAAC, we could use your help! 3. Determine when to use s3:// protocol vs https:// protocol when inside us-west-2
End goal for educationMy intuitive end goal here is to be able to tell people to 'use HTTPS links with token auth' universally, regardless of where they are accessing the data from, with an addendum suggesting using the |
An addendum to #188 (comment) that me and @briannapagan discovered is OpenDAP, offered mostly by the Hyrax server. It also uses the apache module for authentication (https://opendap.github.io/hyrax_guide/Master_Hyrax_Guide.html#_earthdata_login_oauth2), regardless of wether it is on-prem or on the cloud. So my understanding is that all opendap behind earthdata is using the apache module, so those would also need the module updated to support the token. This also means that the apache module is going to be with us for a long time, not just for on-prem work, as it is used for cloud hosted opendap too. |
Writing this as a reminder, apache requires a capital B in Bearer which matters for on-premise files, this also work cloud files so should use the following: |
May not be the right thread, but dropping a note here so it's more permanent than Slack: A while ago, I stumbled across these Twitter threads about some climate data stored in Zarr on OpenStorageNetwork's S3 buckets with HTTP URLs. The example they show accesses Zarr directly via an HTTP URL. https://twitter.com/charlesstern/status/1574497421245108224?s=20&t=rLvID-0c1j1NxHgy0JOCjQ Here's a direct link to the corresponding Pangeo feedstock (in case Twitter dies): https://pangeo-forge.org/dashboard/feedstock/79 From what I can tell, the underlying storage here is OpenStorageNetwork, which provides the S3 API via Ceph. How exactly all of this is wired and optimized is a bit beyond me, but the end result is compelling and may have some interesting lessons for how we do S3/HTTP. |
Bringing in @cisaacstern to maybe provide some extra feedback to Alexey's last message. |
Happy to contribute however I can! We do currently use an OSN allocation as our default storage target for Pangeo Forge. |
Do I have to be in the AWS Seems so, as my notebook that doesn't work from my laptop does work when running in an EC2 instance in Oregon... |
Hi Alex, yes, you must have an EC2 instance running in the same region as the S3 bucket ( Andy Barrett |
Ok, I got it working using @yuvipanda's code above. Needs a little fix in a related project, which I've raised as a PR: nasa/EMIT-Data-Resources#24 |
@alexgleith just FYI, there are a few catches when we access HTTPS:// instead of S3://:
Finally, if we are running our code in us-west-2, we can use S3FS with the S3:// urls and we can use earthaccess to get us the authenticated sessions if we know the DAAC. import earthaccess
earthaccess.login()
url = "s3://some_nasa_dataset"
fs = earthaccess.get_s3fs_session("LPDAAC")
# we open our granule in a s3fs context and we work as usual
with fs.open(url) as file:
dataset = xr.open_dataset(file) |
Thanks @betolink For the work I'm doing, it's exploratory so performance isn't important yet. And I don't think that for the NetCDF files chunking matters, since they're not optimised for it. (Happy to be corrected there!) I'm just doing a little project on the EMIT data and there's enough complexity in the data itself that I'm happy with the HTTPS loading process. Thanks for your help! |
Just linking some benchmarks from @hrodmn comparing s3:// and https:// access for a year's worth of Harmonized Landsat Sentinel-2 (HLS) data from LP-DAAC on us-west-2 at https://hrodmn.dev/posts/nasa-s3/index.html. There's about a 0.25 seconds speed advantage (8.08s with |
This is great @weiji14! just this week @yuvipanda and I were talking about this and the pros and cons of defaulting to HTTPS, earthaccess handles the switch already, if it's running in AWS will use S3 and HTTPs if not, it does it by requesting the instance metadata on an IP range only available inside AWS (although it does not check the region yet)https://github.com/nsidc/earthaccess/blob/54b688b906776f5c845483dd00676f6c681feb10/earthaccess/store.py#LL67C36-L67C36 if we request data like granules = earthdata.search_data(...)
ds = xr.open_mfdatasets(earthaccess.open(granules)) and run this code in AWS, it will use the S3 links and S3FS to open them. |
Just to note, aiohttp finally made a release! So |
Sweet! Should we add a pin for aiohttp and mark this resolved @betolink ? |
@yuvipanda @betolink Can this be fully closed out now? Do we still need to pin this? We weren't seeing this referenced in https://github.com/nsidc/earthaccess/blob/main/pyproject.toml |
Sometimes when you make a request to a URL behind earthdata login, after a series of redirects, you get sent to a signed S3 URL. This should be transparent to the client, as the URL itself contains all the authentication needed for access.
However, sometimes, in some clients, you get a generic
403 Forbidden
here without much explanation. It has something to do with other auth being sent alongside (see #187 for more vague info).We should document what this is, and why you get the 403. This documentation would allow developing workarounds for various clients if needed.
The text was updated successfully, but these errors were encountered: