Performance to construct a large ItemCollection #49

TomAugspurger · 2021-05-07T15:18:00Z

This issue documents some slowness on moderately large queries. In the snippet below we fetch 8,234 items. It takes about a minute to construct the results.

import rasterio.features
import pystac_client

area_of_interest = {
    "type": "Polygon",
        "coordinates": [
          [
            [
              -123.46435546875,
              46.4605655457854
            ],
            [
              -119.608154296875,
              46.4605655457854
            ],
            [
              -119.608154296875,
              48.26125565204099
            ],
            [
              -123.46435546875,
              48.26125565204099
            ],
            [
              -123.46435546875,
              46.4605655457854
            ]
          ]
        ]
}
bbox = rasterio.features.bounds(area_of_interest)
stac = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

search = stac.search(
    bbox=bbox,
    datetime="2016-01-01/2020-12-31",
    collections=["sentinel-2-l2a"],
    limit=2500,  # fetch items in batches of 2500
)

print(search.matched())  # 8234
items = list(search.items())

I ran list(search.items()) under snakeviz and came up with this result: https://gistcdn.rawgit.org/TomAugspurger/fb5b3bde8cee09d2d9aa2f7215edf2b2/94e4ec2ae97bec2169f9263e8f41183418e885d9/mosaic-static.html

A few notes:

We're spend roughly 2/3s of our time in stac_io.get_pages, which includes IO, waiting for the endpoint (and maybe parsing the JSON into Python objects?)
We spend the other 1/3 of our time in item_collection.from_dict

Some ideas for optimization:

Most of the time in item_collections.from_dict is spent on a deepcopy in pystac.Item.from_dict. It might be safe to skip that copy (since these should be coming off the network with no other references) and provide a copy=False flag to pystac.Item.from_dict, to allow it to mutate the incoming dict.
Maybe pystac_client.Client or .search could provide a raw=True/False flag to allow skipping constructing pystac Items?
Maybe some kind of async magic would speed up the reads? Hard to say, since I don't know how much time is spent waiting for results vs. parsing JSON. I don't know if it's a good idea to parse JSON on the asyncio event loop.

The text was updated successfully, but these errors were encountered:

matthewhanson · 2021-05-11T19:42:28Z

Thanks for this @TomAugspurger , I think you're right that deepcopy can probably be skipped.
Also like the idea of maybe not converting to PySTAC Items, or separating that out, this would be useful for the case of just fetching and saving the results.

Async could help, but would require some sweeping changes in PySTAC and pystac-client (there's an experimental async branch of PySTAC, but outdated now), but more importantly the STAC spec currently only requires next links to do paging, so you have to get them sequentially.

Will look into the other suggestions, will be doing some work on this next week.

mukhery mentioned this issue May 21, 2021

query STAC extension support microsoft/PlanetaryComputerExamples#14

Closed

matthewhanson added this to the 0.2.0 milestone Jun 8, 2021

matthewhanson mentioned this issue Jun 10, 2021

Add basic ItemCollection implementation stac-utils/pystac#430

Merged

4 tasks

This was referenced Jun 17, 2021

Allow users to bypass deepcopy in from_dict methods stac-utils/pystac#453

Closed

Avoid calling deepcopy in from_dict methods when unnecessary stac-utils/pystac#454

Merged

matthewhanson closed this as completed Jun 25, 2021

TomAugspurger mentioned this issue Aug 30, 2021

PERF: Avoid copy in pystac for ItemCollections #90

Merged

3 tasks

TomAugspurger mentioned this issue Jul 26, 2022

Replacement functionality for get_all_items #237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance to construct a large ItemCollection #49

Performance to construct a large ItemCollection #49

TomAugspurger commented May 7, 2021

matthewhanson commented May 11, 2021

Performance to construct a large ItemCollection #49

Performance to construct a large ItemCollection #49

Comments

TomAugspurger commented May 7, 2021

matthewhanson commented May 11, 2021