Factorize client interface and provide "download all in range" #64

LucHermitte · 2024-09-10T15:00:21Z

This PR is quite big, I had to make some choices while trying to leave the current API as it is. More on the topic below.

The main changes I propose are:

a new method to query the list of all the files that intersect a time range -- the current API only permits to request a set of files given a finite and exhaustive list of datetimes and missions
an harmonization of ASFClient and DataspaceClient by:
- introducing a root class Client that factorizes a common interface for querying and authenticating
a new design:
- clients instances are not authenticated but they permit to query what is available on their respective servers
- when we call the authenticate() method on each clients we obtain a session object that provides the unique download_all method

A few smaller changes are included:

.netrc file is now either explicitly given as parameter, or obtained from $NETRC environment variable (name already used by other tools), or defaulted to ~/.netrc
I've added typing information in quite a few places (to silence my LSP client, and to remove possible errors)
Tests of ASFClient API avoid to query the list of available files again and again. Instead, a cached baseline is used. And only one test queries the lists of available files, and makes sure the new list contains that the cached baseline. EDIT: at this point I missed the usage of VCR. Yet I think we don't need multiple and redundant slow requests when recording the cassettes.
Cached EOF in ASFClient are no longer static and global, but local to each client instance -- this way we know exactly what can be expected from each test, and we can compare the EOF list in the cache from the list of remote EOFs available on ASF server.
I've introduced an enum OrbitType to avoid typo errors and assertions to silence pylint.

Finally, there are things that are half-way done as I didn't want to completely modify the current API in case you were explicitly relying on it.

Query methods return objects of different types List[str] in ASF case and List[dict] in Copernicus dataspace case. Ideally, we would return a List[SentinelOrbit] which all download*() methods would know how to convert into something that makes sense at their level
I have the feeling both clients did not handle the T0/T1 offsets in the same way. I'd recommend to centralize this at Client level, but I'd prefer to know exactly how you want it done
The implementation of the parallel download isn't the same in both cases. As I don't know which approach was the one you prefer, I haven't tried to factorize anything. IMO, download_all should be at Session level, and call download_one which would be specialized for each client.

There is no need to authenticate is order to search for available products

So far, `query_orbit` and `query_orbit_by_dt` (on Dataspace side) return only one orbit file: the one that contains the requested time range. The new `query_orbits_by_dt_range` method returns all orbit file contained in the requested time range. In order to factorize code, internal `_QueryOrbitFile` classes are defined.

EOF files on ASF server are listed once only in tests. Ideally this should have been mocked, but interrogation is still kept. Other tests will rely on a snapshot of precise orbit files that is cached. A test dedicated to testing the use of the cached baseline has been defined.

...instead of global and shared among all ASFClient instances.

And factorize out the creation of CDSE Access Token in order to to be able to inject 2FA token from the command line. In the end, no token nor secrets are recorded.

This will concern tests in test_eof.py. DB searching is tested in test_asf_client.py

LucHermitte · 2024-09-11T15:10:42Z

Also. I've curated the cassettes from all authentication related secrets.

The strange design in test_eof around cdse_access_token is to support the recording and the usage of cassettes with or without 2FA activated on CDSE account.

scottstanie · 2024-09-12T15:06:29Z

This is quite a PR @LucHermitte ! it might take a a little awhile to get through it all, but it looks like a lot of improvements. my only initial questions are

I'm wondering if there's a way to do that orbit selection without adding the new dependency of sortedcontainers, since it looks like that's only used one in the code (and once in a test)
I've been reminded by ASF that they now have an easier, password-free storage of orbits on S3 :

 aws --no-sign-request s3 ls s3://s1-orbits/AUX_POEORB/ | head -3
2024-07-26 22:04:55    4409614 S1A_OPER_AUX_POEORB_OPOD_20210203T122423_V20210113T225942_20210115T005942.EOF
2024-07-26 22:05:06    4409601 S1A_OPER_AUX_POEORB_OPOD_20210204T122413_V20210114T225942_20210116T005942.EOF
2024-07-26 22:06:36    4409874 S1A_OPER_AUX_POEORB_OPOD_20210205T122416_V20210115T225942_20210117T005942.EOF

do you think the interface changes would still accommodate that addition (which I expect to start using as my own default, since not needing a password is nice, and it'll go much faster on EC2 instances)?

LucHermitte · 2024-09-12T18:01:28Z

it might take a a little awhile to get through it all

Of course. I know. :(

I'm wondering if there's a way to do that orbit selection without adding the new dependency of sortedcontainers, since it looks like that's only used one in the code (and once in a test)

In the tests definitively. I guess set should be enough. In the code however it's quite practical as I didn't want to add more complexity with a hash on SentinelOrbit (and use set). Also may be there is a way to sort on one criteria and apply a kind of uniq function on another criteria (given a priority order). If you know/see of a standard way of doing it, we could easily remove the dependency here. Any idea?

I've been reminded by ASF that they now have an easier, password-free storage of orbits on S3 [...] do you think the interface changes would still accommodate that addition (which I expect to start using as my own default, since not needing a password is nice, and it'll go much faster on EC2 instances)?

Quite certainly. The idea behind the changes was to keep a single and unique interface which is quite simple:

search according to one criteria or another
download the references found

It should even be possible to add a new S3ASFClient that has the same interface and which doesn't uses a cache. This could be a wrapper around boto3.

Also, IIRC, I've seen another REST based interface to request references to filenames on earthdata server. As I didn't want to bring more changes, I didn't try to replace the current approach. I didn't have much time to investigate if there is indeed another way to search for files according to whatever time and mission based criteria.

scottstanie · 2024-09-12T18:12:47Z

I'm wondering if there's a way to do that orbit selection without adding the new dependency of sortedcontainers, since it looks like that's only used one in the code (and once in a test)

In the tests definitively. I guess set should be enough. In the code however it's quite practical as I didn't want to add more complexity with a hash on SentinelOrbit (and use set). Also may be there is a way to sort on one criteria and apply a kind of uniq function on another criteria (given a priority order). If you know/see of a standard way of doing it, we could easily remove the dependency here. Any idea?

what about something like this using groupby (not fully tested)

    candidates = [
        item
        for item in data
        if item.start_time <= (t0 - margin0) and item.stop_time >= (t1 + margin1)
    ]

    if not candidates:
        raise ValidityError(
            "none of the input products completely covers the requested "
            "time interval: [t0={}, t1={}]".format(t0, t1)
        )

    # Sort candidates by all attributes except created_time
    candidates.sort(key=lambda x: (x.start_time, x.stop_time, x.id))

    # Group by everything except created_time and select the one with the latest created_time
    result = []
    for _, group in groupby(candidates, key=lambda x: (x.start_time, x.stop_time, x.id)):
        result.append(max(group, key=lambda x: x.created_time))

    return result

(unsure if that's the exact key we want to group on)

LucHermitte · 2024-09-13T16:01:52Z

I've removed the dependency to sortedcontainers. Using sorted() instead of SortedList is a bit slower on my machine : 40ms. This is negligible.

Regarding my manual implementation of uniq, it does it's job is even less time (2ms). As I'm not used to group_by and as there is no id field in SentinelOrbit, I let it as it is for now.

(I can join my flawed benchmarking test if you're interested -- I don't think it makes sense in sentineleof code base)

Luc Hermitte added 22 commits September 6, 2024 13:57

WIP: factorize client interface

cc425ff

Use $NETRC (before ~/.netrc) if no file is given

a6c6f1f

Add typing notations

bca24fe

Authenticate only to download

1c2667a

There is no need to authenticate is order to search for available products

Update ASF test

28cd506

Fix typing notations & some pylint warnings

b94de9a

Separate search from authenticated download w/ ASF

4023d85

Improve/fix typing

f907cf4

Implement query all eofs that intersect time range

350b014

Make the EOF cache local to easu ASF client

d4b9804

...instead of global and shared among all ASFClient instances.

Test ASF-query of all EOFs intersecting time range

d3551ad

Address some of pylint/mypy remarks

56c2088

Declare dependency to sortedcontainers

7f042ce

Declare dependency to sortedcontainers (2)

c014cb4

Update cassettes without secrets

699cfbf

Remove secrets from cassettes

2d3c2c7

Update secret curated cassettes

738b883

Prevent cassettes from recording secrets

4bf239c

And factorize out the creation of CDSE Access Token in order to to be able to inject 2FA token from the command line. In the end, no token nor secrets are recorded.

Use pre-recorded baseline for search ASF EOF files

962e539

This will concern tests in test_eof.py. DB searching is tested in test_asf_client.py

Use pre-recorded baseline for search ASF EOF files

ea7da1e

Luc Hermitte added 2 commits September 13, 2024 12:50

Use logger for debug logs

892179e

Remove dependency to sortedcontainers

192e627

Don't apply offsets when searching range w/ ASF

8b4c5f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Factorize client interface and provide "download all in range" #64

Factorize client interface and provide "download all in range" #64

LucHermitte commented Sep 10, 2024 •

edited

Loading

LucHermitte commented Sep 11, 2024

scottstanie commented Sep 12, 2024

LucHermitte commented Sep 12, 2024

scottstanie commented Sep 12, 2024

LucHermitte commented Sep 13, 2024

Factorize client interface and provide "download all in range" #64

Are you sure you want to change the base?

Factorize client interface and provide "download all in range" #64

Conversation

LucHermitte commented Sep 10, 2024 • edited Loading

LucHermitte commented Sep 11, 2024

scottstanie commented Sep 12, 2024

LucHermitte commented Sep 12, 2024

scottstanie commented Sep 12, 2024

LucHermitte commented Sep 13, 2024

LucHermitte commented Sep 10, 2024 •

edited

Loading