Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with get_collections #320

Closed
system123 opened this issue Sep 21, 2022 · 9 comments · Fixed by #480
Closed

Issue with get_collections #320

system123 opened this issue Sep 21, 2022 · 9 comments · Fixed by #480
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@system123
Copy link

I am seeing a problem where if I call get_collections directly it is very slow and only the first page of the collections are returned. If I reimplement the get_collections function it runs almost instantaneously and returns the entire list of collections.

Here is a minimal example:

This takes almost 30 seconds to run and returns only the first page of collections (10 in total).

from pystac_client import Client

STAC_URL = 'https://cmr.earthdata.nasa.gov/stac'
catalog = Client.open(f'{STAC_URL}/OB_DAAC')

for c in catalog.get_collections():
    print(c)

This is the re-implementation which runs as expected:

from pystac_client import Client, CollectionClient 
from pystac_client._utils import Modifiable, call_modifier 

STAC_URL = 'https://cmr.earthdata.nasa.gov/stac'
catalog = Client.open(f'{STAC_URL}/OB_DAAC')

def get_collections(catalog):
    url = f"{catalog.get_self_href()}/collections"

    for page in catalog._stac_io.get_pages(url):
        if "collections" not in page:
            continue

        for col in page["collections"]:
            collection = CollectionClient.from_dict(
                            col, root=catalog, modifier=catalog.modifier
                        )
            call_modifier(catalog.modifier, collection)
            yield collection

for col in get_collections(catalog):
    print(col)

This implementation is basically copy pasted from the code, so I am not sure why calling it directly on the class doesn't provide the same performance or output.

@TomAugspurger
Copy link
Collaborator

TomAugspurger commented Sep 21, 2022

I think that https://cmr.earthdata.nasa.gov/stac/OB_DAAC is missing a /collections in its conformsTo, so pystac-client doesn't know that the /collections endpoint exists.

If "https://api.stacspec.org/v1.0.0-rc.1/collections", were added to that conformsTo, then things should be as fast as expected. @system123 do you know who maintains that endpoint? (or perhaps @gadomski knows?)

@gadomski
Copy link
Member

gadomski commented Sep 21, 2022

For APIs that don't advertise their collections endpoint, you can use the ignore_conformance parameter (or --ignore-conformance on the command line). This instructs the Client to assume a fully-featured API, even if conformance links are not provided.

I tested this out on the command line and here's the results. First, without --ignore-conformance:

$ time stac-client collections https://cmr.earthdata.nasa.gov/stac/OB_DAAC | jq '. | length'
10
stac-client collections https://cmr.earthdata.nasa.gov/stac/OB_DAAC  0.35s user 0.09s system 0% cpu 59.677 total

Then, with --ignore-conformance:

$ time stac-client collections --ignore-conformance https://cmr.earthdata.nasa.gov/stac/OB_DAAC | jq '. | length'
897
stac-client collections --ignore-conformance   1.33s user 0.13s system 5% cpu 28.014 total

It may be reasonable for pystac-client to (e.g.) throw a warning when iterating collections without the collections endpoint, since this is surprising but not error-worthy behavior (IMO). @philvarner @matthewhanson what do you think?

EDIT: @TomAugspurger we may maintain that endpoint but I don't know, I'm asking around.

@gadomski
Copy link
Member

@TomAugspurger that endpoint is maintained by the NASA CMR team.

@TomAugspurger
Copy link
Collaborator

Thanks. I opened nasa/cmr-stac#236 to track that. We can leave this open to address whether pystac-client should warn when taking the slow path. I don't have a strong preference, as long as there's a relatively easy way to silence those warnings (like a parameter when creating the catalog).

@philvarner
Copy link
Collaborator

I think a warning would be a good idea. Only implementing Core & Item Search happens, so calling collections on such an API could be surprisingly slow for the user if there's a deeply-nested tree of collections.

@system123
Copy link
Author

@gadomski Thanks, adding ignore_conformance=True solved the issue for me – although I do think having a warning suggesting to use ignore_conformance=True could be useful as I missed this in the documentation and the fact that the collections endpoint does exist the slower performance seemed strange.

@TomAugspurger Thanks for opening an issue on NASA CMR.

@philvarner
Copy link
Collaborator

philvarner commented Sep 22, 2022

I'll also say that the meaning of ignore_conformance is a bit confusing, since that doesn't point to the behavior that happens. Maybe something more like use_collections_endpoint would be clearer.

Obviously some issues with changing the semantics now, but I think these heuristic behaviors (like /collections vs child links) are problematic because they do things that aren't clear to the user. If we didn't already have the behavior, I'd say that it should use the Features or Collections conformance class by default or if there's a data link rel but those aren't advertised, but require the user to set a flag(s) like use_collections_endpoint=True or browse_for_collections=True otherwise. That way, it's explicit that the client is using a slower behavior.

@gadomski
Copy link
Member

Maybe something more like use_collections_endpoint would be clearer.

Agreed. In general, I'd like to see granular configuration (probably in a config-style object) where you can enable/disable(/warn/error) various heuristics, since this type of issue seems to keep popping up (e.g. related to #136 you could force the client to use the items endpoint even if its not advertised).

@gadomski gadomski added the enhancement New feature or request label Jan 13, 2023
@gadomski gadomski added this to the 0.5.2 milestone Jan 19, 2023
@gadomski gadomski self-assigned this Jan 19, 2023
@gadomski gadomski removed this from the 0.5.2 milestone Jan 24, 2023
@gadomski gadomski added this to the 0.7.0 milestone Feb 6, 2023
This was referenced Apr 14, 2023
@jsignell
Copy link
Member

jsignell commented May 4, 2023

Ok just to make sure this comes full circle now that #480 is in (and there should probably be more docs on this).

First you try running it naively:

In [1]: from pystac_client import Client
   ...: 
   ...: STAC_URL = 'https://cmr.earthdata.nasa.gov/stac'
   ...: catalog = Client.open(f'{STAC_URL}/OB_DAAC')
   ...: 
   ...: for c in catalog.get_collections():
   ...:     print(c)
   ...: 
/home/jsignell/pystac-client/pystac_client/client.py:399: DoesNotConformTo: Server does not conform to COLLECTIONS, FEATURES
  self._warn_about_fallback("COLLECTIONS", "FEATURES")
/home/jsignell/pystac-client/pystac_client/client.py:399: FallbackToPystac: Falling back to pystac. This might be slow.
  self._warn_about_fallback("COLLECTIONS", "FEATURES")
<CollectionClient id=Turbid9.v0>
<CollectionClient id=GreenBay.v0>
<CollectionClient id=Catlin_Arctic_Survey.v0>
... # this goes slowly

Then you notice the warnings about conformance and falling back to pystac so you do add_conforms_to

In [2]: from pystac_client import Client
   ...: 
   ...: STAC_URL = 'https://cmr.earthdata.nasa.gov/stac'
   ...: catalog = Client.open(f'{STAC_URL}/OB_DAAC')
   ...: catalog.add_conforms_to("COLLECTIONS")
   ...: 
   ...: for c in catalog.get_collections():
   ...:     print(c)
   ...: 
/home/jsignell/pystac-client/pystac_client/client.py:601: MissingLink: No link with rel='data' could be found on this Client.
  href = self._get_href("data", data_link, "collections")
<CollectionClient id=Turbid9.v0>
<CollectionClient id=GreenBay.v0>
<CollectionClient id=Catlin_Arctic_Survey.v0>
... # this goes fast

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants