Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document why signed S3 URLs might be giving 400s when called from inside us-west-2 #188

Open
yuvipanda opened this issue Dec 13, 2022 · 47 comments

Comments

@yuvipanda
Copy link

Sometimes when you make a request to a URL behind earthdata login, after a series of redirects, you get sent to a signed S3 URL. This should be transparent to the client, as the URL itself contains all the authentication needed for access.

However, sometimes, in some clients, you get a generic 403 Forbidden here without much explanation. It has something to do with other auth being sent alongside (see #187 for more vague info).

We should document what this is, and why you get the 403. This documentation would allow developing workarounds for various clients if needed.

@yuvipanda yuvipanda changed the title Document why signed S3 URLs might be giving 403s when called from inside us-west-2 Document why signed S3 URLs might be giving 400s when called from inside us-west-2 Dec 15, 2022
@yuvipanda
Copy link
Author

yuvipanda commented Dec 15, 2022

You actually get a 400, and here is the smallest sample case:

import asyncio
import aiohttp
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')

auth = aiohttp.BasicAuth(username, password)

async def main():
    async with aiohttp.ClientSession(auth=auth) as session:
        async with session.get(url) as response:
            print(response.status)
            print((await response.read())[:30])

asyncio.run(main())

When running from inside us-west-2, this prints:

400
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidArgument</Code><Message>Only one auth mechanism allowed; only the X-Amz-Algorithm query parameter, Signature query string parameter or the Authorization header should be specified</Message><ArgumentName>Authorization</ArgumentName><ArgumentValue>Basic eXV2aXBhbmRhOmFpc2hlZTh3b29naGFobmdpZW1vb3Nob0thaXhpaWJl</ArgumentValue><RequestId>XM26KTSJ4X85W6YR</RequestId><HostId>gjjlJGJmgjalTBXzAnnMg4eBl2MCd3k9UD4klvAO3Rjd18TOB3QCgDC3bAMwciPyIRrStqrD4SQ=</HostId></Error>

which is pretty clear and useful!

And on my laptop, this prints:

200
b'\x89HDF\r\n\x1a\n\x00\x00\x00\x00\x00\x08\x08\x00\x04\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

So we have a reproducible setup now. fsspec uses aiohttp under the hood, so this is the same issue fsspec is facing

@yuvipanda
Copy link
Author

This is likely the aiohttp bug actually: aio-libs/aiohttp#2610

@yuvipanda
Copy link
Author

Here is the same code with requests:

import requests
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')

resp = requests.get(url, auth=(username, password))
print(resp.status_code)
print(resp.content[:15])

This actually produces the correct output on both my laptop and on us-west-2!

200
b'\x89HDF\r\n\x1a\n\x00\x00\x00\x00\x00\x08\x08'

This is most likely because requests implemented request/request#1184, while the equivalent bug with aiohttp is still open.

This is amazing news, as this means that fixing aio-libs/aiohttp#2610 should get fsspec to work, which means most of the pangeo stack would work after that. It will still have lower performance than using s3 directly when in us-west-2, so work there still needs to be done. But this will at least make sure regular https URLs work when both inside and outside us-west-2

@yuvipanda
Copy link
Author

aiohttp has documented this should not be the case, based on the note here: https://docs.aiohttp.org/en/stable/client_advanced.html?highlight=redirects#custom-request-headers

@yuvipanda
Copy link
Author

I also looked at the request being made by aiohttp, and see the following:

RequestInfo(url=URL('https://nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5?A-userid=yuvipanda&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2D3OGJNTHYLUSH3P/20221215/us-west-2/s3/aws4_request&X-Amz-Date=20221215T210703Z&X-Amz-Expires=3109&X-Amz-Security-Token=FwoGZXIvYXdzEN7//////////wEaDDp5wsiHWectpsmbPiK4AdzdhJBq0QIbppB7sa9DQ2po6R29dB1t2g0ACyx3h4keIqL4FLppwe3TShd9rcdJqC11UxTiOKoiVUVcrt%2BbwLAcd8wfVIMfUpze8ChSWCekiBQtIzyJGeelId6jn38rPFD71lXGUeaM/di/BFT6txD5j9g8br7BuQI8Jhwycn93lWgKv8zrfGgHwREt6wIaQ63ugKpseloAeGO0le6pz9oPL5P4cYn9SZjhGa7LgqqeeRHIGQKCHHEojJXunAYyLe6bzYyOU0h/2QqKZrFudhm772RwPg0LuXexViJ1Ae28OYexT/8xDD68yfsWjg%3D%3D&X-Amz-SignedHeaders=host&X-Amz-Signature=dbca3da4e6e9f3c1257db628be1d4aaeb3b2f67d931d53bf27440db980edebf6'), method='GET', headers=<CIMultiDictProxy('Host': 'nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'Python/3.9 aiohttp/3.8.3', 'Authorization': 'Basic <removed>')>, real_url=URL('https://nsidc-cumulus-prod-protected.s3.us-west-2.amazonaws.com/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5?A-userid=yuvipanda&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2D3OGJNTHYLUSH3P/20221215/us-west-2/s3/aws4_request&X-Amz-Date=20221215T210703Z&X-Amz-Expires=3109&X-Amz-Security-Token=FwoGZXIvYXdzEN7//////////wEaDDp5wsiHWectpsmbPiK4AdzdhJBq0QIbppB7sa9DQ2po6R29dB1t2g0ACyx3h4keIqL4FLppwe3TShd9rcdJqC11UxTiOKoiVUVcrt%2BbwLAcd8wfVIMfUpze8ChSWCekiBQtIzyJGeelId6jn38rPFD71lXGUeaM/di/BFT6txD5j9g8br7BuQI8Jhwycn93lWgKv8zrfGgHwREt6wIaQ63ugKpseloAeGO0le6pz9oPL5P4cYn9SZjhGa7LgqqeeRHIGQKCHHEojJXunAYyLe6bzYyOU0h/2QqKZrFudhm772RwPg0LuXexViJ1Ae28OYexT/8xDD68yfsWjg%3D%3D&X-Amz-SignedHeaders=host&X-Amz-Signature=dbca3da4e6e9f3c1257db628be1d4aaeb3b2f67d931d53bf27440db980edebf
'))

So I think this confirms that the Authorization header is being retained during redirects.

@yuvipanda
Copy link
Author

I've now found @betolink's comment in aio-libs/aiohttp#5783 (comment), and made me realize that what we want is for the credentials to be forwarded when we are redirected to earthdata login, but then dropped. But what we are getting is instead it being sent to everything

@yuvipanda
Copy link
Author

AHA, so what's actually happening is that we are setting the basic auth on the session, rather than on the request. So it's being sent to every request from the session, including S3! This actually now is unrelated to the aiohttp bug

@yuvipanda
Copy link
Author

if I move the auth= to just the request, I get a basic 401 denied, as the Basic auth is dropped during the redirect, which is correct and documented aiohttp behavior.

So the question now really is why does requests work?

Separately, it should be possible for us to subclass aiohttp's ClientSession to pass per-host basicauth so it can provide appropriate auth to different hosts in the chain, and just send basic auth to earthdata.

@yuvipanda
Copy link
Author

yuvipanda commented Dec 15, 2022

ok, so I have discovered why it works with requests but not with aiohttp.

It is because requests supports netrc lol!

So at the first redirect, requests drops the Authorization header, but when making the request to EDL, it reads netrc file directly and sends the appropriate credentials! So that is why it works by default with requests, and not with aiohttp.

So to summarize, the current problem is that we pass parameters to fsspec that are set at the ClientSession level, and those are sent with every request. So the Authorization header is also sent when making the request to S3, and it fails. This is validated with the following code:

import asyncio
import aiohttp
import netrc
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

username, _, password = netrc.netrc().authenticators('urs.earthdata.nasa.gov')

auth = aiohttp.BasicAuth(username, password)

async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get(url, auth=auth) as response:
            print(response.status)
            print((await response.read())[:30])

asyncio.run(main())

This actually will fail with a HTTP Basic request denied error anywhere, which makes sense - the Authorization header is dropped at the first redirect to EDL, and then we get an access denied.

If I recreate this with requests by deleting my netrc file:

import requests
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"


username = "yuvipanda"
password = "mypassword"
resp = requests.get(url, auth=(username, password))
print(resp.status_code)
print(resp.content[:15])

I get the exact same behavior.

WHICH IS GREAT! So the problem now isn't to do with redirects at all, it is really - how do we make sure to send the HTTP Basic Creds just to EDL? Because right now, the reason this works with non-cloud datasets is that we are actually leaking plaintext EDL creds to all of them, completely negating the point of OAuth2 :D

@yuvipanda
Copy link
Author

I see trust_env passed along to the aiohttp session, but aiohttp only uses this for proxies, not for authenticating to servers themselves.

@yuvipanda
Copy link
Author

So the current issue is really that aiohttp has no way to say 'for this domain, send this authentication information'. requests accidentally provides this with netrc, but otherwise doesn't afaict.

@yuvipanda
Copy link
Author

So, netrc support is actually the easiest way to make sure that we can send specific Basic Auth credentials only to specific Hosts. So I made this PR adding it to aiohttp! aio-libs/aiohttp#7131

If merged and released, this should sort of automatically make fsspec work again.

@betolink
Copy link
Member

Amazing work @yuvipanda! I'm just catching up with this thread. One thing I'd like to mention is that -if possible- it would be preferable to have a solution/workaround that does not rely on having a .netrc (even thought is what we been doing for the tutorials).

@yuvipanda
Copy link
Author

@betolink so I think these tokens (https://urs.earthdata.nasa.gov/documentation/for_users/user_token) should get rid of the need for netrc completely. I have no idea why people are restricted to just two tokens per user - that makes it definitely harder to use :(

@yuvipanda
Copy link
Author

I dug some more into what fsspec would need to do for us to use client tokens.

fsspec currently supports a client_kwargs that allows setting headers and other misc options for all requests. This accidentally works now when making requests behind EDL from outside us-west-2, but doesn't work from inside (for all the reasons outlined in this issue). So we can not use the auth tokens with it either.

What we need is something like request_kwargs (that is passed into places like https://github.com/fsspec/filesystem_spec/blob/45de5b509bacf8a62d99848bb2361cc78733ad09/fsspec/implementations/http.py#L242 and everywhere else requests are constructed). This allows these params to be set just for the originating request, but not for any follow-on redirects from there. This wouldn't help when using username / password for EDL (as the username / password needs to be sent for a request along the redirect path, not the originating request), but would work for using tokens (as they must be only sent to the originating request).

I think this is a fairly well scoped and small change to fsspec that would be extremely useful! I'm super swamped though, I am hoping someone else can implement this?

@yuvipanda
Copy link
Author

Opened fsspec/filesystem_spec#1142 to discuss what would help solve the issue from fsspec in allowing us to use tokens!

@yuvipanda
Copy link
Author

yuvipanda commented Dec 22, 2022

Turns out this already exists in fsspec - any kwargs you pass in actually get passed directly to the requests, exactly what we wanted!

So the following code works for me :)

from fsspec.implementations.http import HTTPFileSystem
url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

token = 'my-long-token'
fs = HTTPFileSystem(headers={
    "Authorization": f"Bearer {token}"
})

with fs.open(url) as f:
    print(f.read()[:30])

yay!

@yuvipanda
Copy link
Author

ok, so current summary is:

  1. Support using netrc for non-proxy HTTP credentials aio-libs/aiohttp#7131 adds .netrc support to aiohttp, and hence to fsspec. This is needed for earthdata login access to work consistently in AWS us-west-2 with fsspec the same way it works elsewhere, while using earthdata username / password to login.
  2. However, I think we should recommend everyone use tokens for actually authenticating programmatically - https://urs.earthdata.nasa.gov/documentation/for_users/user_token. This already works with fsspec - just pass headers as a kwargs as shown in the comment above, rather than as a part of client_kwargs. yay!

Unfortunately, there is a limit of only two tokens per user in earthdata login right now, so you can not just generate a token for each machine you would use it in, like with GitHub Personal Access token. However, the lack of need for specific files means this would also work with dask.

@yuvipanda
Copy link
Author

Here is an example of it working with xarray!

from fsspec.implementations.http import HTTPFileSystem
import xarray as xr

url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

token = 'my-long-token'
fs = HTTPFileSystem(headers={
    "Authorization": f"bearer {token}"
})
ds = xr.open_dataset(fs.open(url))
ds

image

@betolink
Copy link
Member

betolink commented Dec 23, 2022

This is awesome @yuvipanda! I feel like we need to refactor this library to only use CMR tokens everywhere instead of monkey-patching OAuth2 redirects for cloud-hosted data. I wish DAAC hosted data would follow the same behavior with bearer tokens. i.e.

# bearer token for the win with cloud hosted data !!
# url = "https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL08/005/2018/10/14/ATL08_20181014001049_02350102_005_01.h5"

# =( bearer token? don't know him.
url = "https://n5eil01u.ecs.nsidc.org/DP7/ATLAS/ATL08.005/2019.02.21/ATL08_20190221121851_08410203_005_01.h5"

Also, maybe we only need one token even if we use it concurrently from different processes? I haven't tested but I suspect it should work.

@yuvipanda
Copy link
Author

@betolink yeah we should only need one token even if it is used concurrently.

So the token only works for some datasets but not all? And works for cloud datasets but not on-prem? Does it work for any on prem thing at all?

@betolink
Copy link
Member

I'm afraid it won't work for on-prem data, it may work for some data hosted at the ASF DAAC marked on-prem but actually hosted at AWS.

This is tremendous progress! Now there is a clear path for one of the most common access patterns!

@yuvipanda
Copy link
Author

@betolink feels like long term, the right way is to get the access token to work for all data, and support the earthdatalogin folks in this misison. In the meantime, netrc is the more universal solution, once we get the aiohttp pr merged. But that is slightly messy when it comes to dask, because it requires populating a specific file in the dask worker which is not always easy. Does that sound right?

@yuvipanda
Copy link
Author

Me and @briannapagan did another bit of deep dive here, and made some more progress.

There seem to be two primary packages supporting earthdata login on the server side:

We have established that TEA already supports bearer tokens (https://github.com/asfadmin/thin-egress-app/blob/7b0f7110b1694f553af2b71594cc19e40c179ea9/lambda/app.py#L183). But what of the apache2 module?!

As of Sep 2021, it also supports bearer tokens! https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/e13ddeb1c3be7767a3214191f9de31e8cc311187 is the appropriate merge commit, and we discovered an internal JIRA ticket named URSFOUR-1600 that also tracks this feature.

With some more sleuthing, we discovered https://forum.earthdata.nasa.gov/viewtopic.php?t=3290. We tracked that through looking for URSFOUR-1858, mentioned in https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/8c4796c0467a1d5dcb8740fb86f23474db8258e3. That merge was the only further activity on the apache module since the merge for token support. Looking through that earthdata forum post, we see that LPDAAC (which maintains the dataset talked about there) mentions deploying 'some apache change' to help with that. So the hypothesis I had was:

  1. LPDAAC ran into some other unrelated issue,
  2. Which required code changes to the apache module, which was done via URSFOUR-1858
  3. They have deployed this change to their servers
  4. However, since this change was deployed , it is also likely that LPDAAC has included URS-1600 (user token support) in the deployment as well. Not necessarily explicitly, but just as a side effect of trying to deploy the more recent URSFOUR-1858.

I tested this hypothesis by trying to send a token to https://e4ftl01.cr.usgs.gov/ASTT/AG5KMMOH.041/2001.04.01/ASTER_GEDv4.1_A2001091.h5 - a dataset hosted by LPDAAC. And behold, it works! So all data hosted by LPDAAC supports tokens :)

So the pathway to using tokens everywhere, including onprem, boils down to getting all the DAACs to use the latest version of the official earthdata apache2 module.

This is great news for many reasons:

  1. No new code needs to be written! This all is already done.
  2. LPDAAC already deployed this, so it isn't a brand new deployment
  3. This is the official apache module that DAACs are already using, not some newfangled new software.

@yuvipanda
Copy link
Author

Also, passing -v to curl will send you back the response headers, which usually contain < Server: Apache to indicate they are using the apache2 server - and hence most likely 'on-prem' (aka not coming from S3)

@yuvipanda
Copy link
Author

NSIDC also seems to have the latest version of the apache module - https://n5eil01u.ecs.nsidc.org/DP7/ATLAS/ATL06.005/2020.03.08/ATL06_20200308234154_11190602_005_01.h5 works with the token!

So looks like some (many?) DAACs have this deployed, and some don't.

@yuvipanda
Copy link
Author

@betolink in fact, the exact URL you used to test tokens earlier in #188 (comment) works now. My suspicion is that NSIDC deployed the latest version of the apache2 module very recently?

@yuvipanda
Copy link
Author

ASDC also supports tokens, as tested with https://asdc.larc.nasa.gov/data/CALIPSO/LID_L2_VFM-Standard-V4-20/2010/09/CAL_LID_L2_VFM-Standard-V4-20.2010-09-01T00-14-43ZN.hdf.

Again, I'm using the presence of Server: apache to distinguish on-prem vs S3 hosted data. I think it's reasonably accurate.

@yuvipanda
Copy link
Author

yuvipanda commented Jan 5, 2023

ORNL also supports it, as tested via https://daac.ornl.gov/daacdata/deltax/DeltaX_Ecogeomorphic_Products/data/DeltaX_EcoGeoCells_2021_TerrebonneEast_std_superpixels.tif.

@yuvipanda
Copy link
Author

Note that uppercase Bearer is what I'm using, as that's what the apache module supports (see line 684 in https://git.earthdata.nasa.gov/projects/AAM/repos/apache-urs-authentication-module/commits/e13ddeb1c3be7767a3214191f9de31e8cc311187#mod_auth_urs.c).

@yuvipanda
Copy link
Author

podac (tested with 'https://podaac-tools.jpl.nasa.gov/drive/files/allData/topex/L1B/altsdr/001/altsdr001052.txt) and SEDAC (tested with https://sedac.ciesin.columbia.edu/downloads/data/urbanspatial/urbanspatial-urban-land-backscatter-time-series-1993-2020/urbanspatial-urban-land-backscatter-time-series-1993-2020-seasonal-urban-netcdf.zip) don't have the latest either.

@yuvipanda
Copy link
Author

yuvipanda commented Jan 6, 2023

This issue has spawned off many different things, so here's a quick summary:

1. Support using HTTPS + .netrc with xarray universally

Currently, it is not possible to use .netrc files with xarray if you are running from inside us-west-2. So if you are inside us-west-2 and want to access cloud hosted data with xarray, you must use S3 (not plain HTTPS). Once aio-libs/aiohttp#7131 lands and a new release of aiohttp is made, this issue will go away. So code that uses HTTPS+netrc will universally work, regardless of it being in us-west-2 or elsewhere.

2. Support using EDL user tokens universally

For cloud access, EDL tokens already work with xarray (#188 (comment) has an example). However, it doesn't work universally - many on-prem servers don't support EDL tokens yet, although some do. Me (from outside NASA) and @briannapagan (from inside) are pushing on this, getting EDL token support rolled out more universally. If you are inside a DAAC, we could use your help!

3. Determine when to use s3:// protocol vs https:// protocol when inside us-west-2

s3:// links only work from inside us-west-2, so we should have clear documentation on when users should use the s3:// protocol vs just https. From inside us-west-2, there could be a performance difference between these two, but my personal intuition is that it is not significant enough for man use cases, especially beginner cases. This is the part that least amount of work has been done on so far. We would need some test cases testing s3 vs https from inside us-west-2 to establish this performance difference.

End goal for education

My intuitive end goal here is to be able to tell people to 'use HTTPS links with token auth' universally, regardless of where they are accessing the data from, with an addendum suggesting using the s3:// protocol under specific performance circumstances. A step along the way is to be able to tell people to use HTTPS links with netrc universally.

@yuvipanda
Copy link
Author

An addendum to #188 (comment) that me and @briannapagan discovered is OpenDAP, offered mostly by the Hyrax server. It also uses the apache module for authentication (https://opendap.github.io/hyrax_guide/Master_Hyrax_Guide.html#_earthdata_login_oauth2), regardless of wether it is on-prem or on the cloud. So my understanding is that all opendap behind earthdata is using the apache module, so those would also need the module updated to support the token. This also means that the apache module is going to be with us for a long time, not just for on-prem work, as it is used for cloud hosted opendap too.

@briannapagan
Copy link
Contributor

Writing this as a reminder, apache requires a capital B in Bearer which matters for on-premise files, this also work cloud files so should use the following:
curl -H "Authorization: Bearer TOKEN" -L --url ‘URL’ >out

@ashiklom
Copy link

May not be the right thread, but dropping a note here so it's more permanent than Slack:

A while ago, I stumbled across these Twitter threads about some climate data stored in Zarr on OpenStorageNetwork's S3 buckets with HTTP URLs. The example they show accesses Zarr directly via an HTTP URL.

https://twitter.com/charlesstern/status/1574497421245108224?s=20&t=rLvID-0c1j1NxHgy0JOCjQ
https://twitter.com/charlesstern/status/1574499938465038336?s=20&t=rLvID-0c1j1NxHgy0JOCjQ

Here's a direct link to the corresponding Pangeo feedstock (in case Twitter dies): https://pangeo-forge.org/dashboard/feedstock/79

From what I can tell, the underlying storage here is OpenStorageNetwork, which provides the S3 API via Ceph. How exactly all of this is wired and optimized is a bit beyond me, but the end result is compelling and may have some interesting lessons for how we do S3/HTTP.

@briannapagan
Copy link
Contributor

Bringing in @cisaacstern to maybe provide some extra feedback to Alexey's last message.

@cisaacstern
Copy link

Happy to contribute however I can! We do currently use an OSN allocation as our default storage target for Pangeo Forge.

@alexgleith
Copy link

Do I have to be in the AWS us-west-2 region to access data direct from S3?

Seems so, as my notebook that doesn't work from my laptop does work when running in an EC2 instance in Oregon...

@andypbarrett
Copy link
Collaborator

Hi Alex,

yes, you must have an EC2 instance running in the same region as the S3 bucket (us-west-2 for NASA data) to "Directly Access" the data.

Andy Barrett

@alexgleith
Copy link

Ok, I got it working using @yuvipanda's code above.

Needs a little fix in a related project, which I've raised as a PR: nasa/EMIT-Data-Resources#24

@betolink
Copy link
Member

@alexgleith just FYI, there are a few catches when we access HTTPS:// instead of S3://:

  • Speed: when we use HTTPS we are going through NASA's CloudFront proxy and opening a dataset could be slower than using the S3:// schema URLs. This is why earthaccess (this library) picks the right access pattern depending on where the code is running (us-west-2 or not).
  • Chunking affecting performance: related to the first point, if a file is chunked into hundreds of chunks, each will result on a separate HTTPS request that has to go through the proxy and some datasets will be slower to access than others because of this.

Finally, if we are running our code in us-west-2, we can use S3FS with the S3:// urls and we can use earthaccess to get us the authenticated sessions if we know the DAAC.

import earthaccess

earthaccess.login()
url = "s3://some_nasa_dataset"
fs = earthaccess.get_s3fs_session("LPDAAC")

# we open our granule in a s3fs context and we work as usual
with fs.open(url) as file:
    dataset = xr.open_dataset(file)

@alexgleith
Copy link

Thanks @betolink

For the work I'm doing, it's exploratory so performance isn't important yet. And I don't think that for the NetCDF files chunking matters, since they're not optimised for it. (Happy to be corrected there!)

I'm just doing a little project on the EMIT data and there's enough complexity in the data itself that I'm happy with the HTTPS loading process. Thanks for your help!

@weiji14
Copy link

weiji14 commented Jun 6, 2023

  • Speed: when we use HTTPS we are going through NASA's CloudFront proxy and opening a dataset could be slower than using the S3:// schema URLs. This is why earthaccess (this library) picks the right access pattern depending on where the code is running (us-west-2 or not).

Just linking some benchmarks from @hrodmn comparing s3:// and https:// access for a year's worth of Harmonized Landsat Sentinel-2 (HLS) data from LP-DAAC on us-west-2 at https://hrodmn.dev/posts/nasa-s3/index.html. There's about a 0.25 seconds speed advantage (8.08s with s3, 7.74s with https) which is fairly small, but if earthaccess can handle switching between s3/https based on the compute region, that would be awesome!

@betolink
Copy link
Member

betolink commented Jun 7, 2023

This is great @weiji14! just this week @yuvipanda and I were talking about this and the pros and cons of defaulting to HTTPS, earthaccess handles the switch already, if it's running in AWS will use S3 and HTTPs if not, it does it by requesting the instance metadata on an IP range only available inside AWS (although it does not check the region yet)https://github.com/nsidc/earthaccess/blob/54b688b906776f5c845483dd00676f6c681feb10/earthaccess/store.py#LL67C36-L67C36

if we request data like

granules = earthdata.search_data(...)
ds = xr.open_mfdatasets(earthaccess.open(granules))

and run this code in AWS, it will use the S3 links and S3FS to open them.
On a related issue... I still notice a lot of latency when we try to open files even in region(compared to just download them to our EC2 instance), something that needs to be further documented. In this example with stack_stac I'm not sure if under the hood they use S3FS or not.

@yuvipanda
Copy link
Author

Just to note, aiohttp finally made a release! So fsspec now supports netrc correctly!

@mfisher87 mfisher87 added documentation Improvements or additions to documentation and removed documentation Improvements or additions to documentation labels Mar 1, 2024
@mfisher87
Copy link
Collaborator

mfisher87 commented Mar 1, 2024

Sweet! Should we add a pin for aiohttp and mark this resolved @betolink ?

@asteiker
Copy link
Member

asteiker commented Oct 29, 2024

@yuvipanda @betolink Can this be fully closed out now? Do we still need to pin this? We weren't seeing this referenced in https://github.com/nsidc/earthaccess/blob/main/pyproject.toml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

10 participants