Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance to construct a large ItemCollection #49

Closed
TomAugspurger opened this issue May 7, 2021 · 1 comment
Closed

Performance to construct a large ItemCollection #49

TomAugspurger opened this issue May 7, 2021 · 1 comment
Milestone

Comments

@TomAugspurger
Copy link
Collaborator

This issue documents some slowness on moderately large queries. In the snippet below we fetch 8,234 items. It takes about a minute to construct the results.

import rasterio.features
import pystac_client

area_of_interest = {
    "type": "Polygon",
        "coordinates": [
          [
            [
              -123.46435546875,
              46.4605655457854
            ],
            [
              -119.608154296875,
              46.4605655457854
            ],
            [
              -119.608154296875,
              48.26125565204099
            ],
            [
              -123.46435546875,
              48.26125565204099
            ],
            [
              -123.46435546875,
              46.4605655457854
            ]
          ]
        ]
}
bbox = rasterio.features.bounds(area_of_interest)
stac = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

search = stac.search(
    bbox=bbox,
    datetime="2016-01-01/2020-12-31",
    collections=["sentinel-2-l2a"],
    limit=2500,  # fetch items in batches of 2500
)

print(search.matched())  # 8234
items = list(search.items())

I ran list(search.items()) under snakeviz and came up with this result: https://gistcdn.rawgit.org/TomAugspurger/fb5b3bde8cee09d2d9aa2f7215edf2b2/94e4ec2ae97bec2169f9263e8f41183418e885d9/mosaic-static.html

A few notes:

  1. We're spend roughly 2/3s of our time in stac_io.get_pages, which includes IO, waiting for the endpoint (and maybe parsing the JSON into Python objects?)
  2. We spend the other 1/3 of our time in item_collection.from_dict

Some ideas for optimization:

  1. Most of the time in item_collections.from_dict is spent on a deepcopy in pystac.Item.from_dict. It might be safe to skip that copy (since these should be coming off the network with no other references) and provide a copy=False flag to pystac.Item.from_dict, to allow it to mutate the incoming dict.
  2. Maybe pystac_client.Client or .search could provide a raw=True/False flag to allow skipping constructing pystac Items?
  3. Maybe some kind of async magic would speed up the reads? Hard to say, since I don't know how much time is spent waiting for results vs. parsing JSON. I don't know if it's a good idea to parse JSON on the asyncio event loop.
@matthewhanson
Copy link
Member

Thanks for this @TomAugspurger , I think you're right that deepcopy can probably be skipped.
Also like the idea of maybe not converting to PySTAC Items, or separating that out, this would be useful for the case of just fetching and saving the results.

Async could help, but would require some sweeping changes in PySTAC and pystac-client (there's an experimental async branch of PySTAC, but outdated now), but more importantly the STAC spec currently only requires next links to do paging, so you have to get them sequentially.

Will look into the other suggestions, will be doing some work on this next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants