Slowness in `Item.to_dict()`, seemingly from links #546

TomAugspurger · 2021-07-09T13:38:27Z

I'm trying to diagnose a slowdown I'm observing in pystac.to_dict() is slower. Right now I'm unsure if the slowdown is due to a change in pystac or in our STAC endpoint. Anyway, if I call clear_links() prior to to_dict() then things are faster (goes from about 1s to 200µs)

In [1]: import pystac

In [2]: item = pystac.Item.from_file("https://planetarycomputer-staging.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2A_MSIL2A_20160208T190542_R0
   ...: 13_T10TET")
   ...: %time _ = item.to_dict()
CPU times: user 25 ms, sys: 339 µs, total: 25.4 ms
Wall time: 987 ms

In [3]: item = pystac.Item.from_file("https://planetarycomputer-staging.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2A_MSIL2A_20160208T190542_R0
   ...: 13_T10TET")
   ...: item.clear_links()
   ...: %time _ = item.to_dict()
CPU times: user 167 µs, sys: 18 µs, total: 185 µs
Wall time: 192 µs

So a couple questions:

Why is having those links present slowing things down? Glancing at the output of to_dict, I don't see why it would need to make an HTTP call or anything like that, right? Maybe through some chain calling link.get_href()?
Have there been any recent changes that would have introduced this slowdown? We've also been making changes to that STAC endpoint that could have slowed things down. Brief testing shows that RC1 - RC3 all have similar performance on this same endpoint (about 1s to do to_dict()
to_dict() has a parameter include_self_link. Could we add a parameter to exclude all links? For my application, I only need the assets and extension info, so making requests to fetch the link

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2021-07-09T13:49:03Z

All the time is seemingly spent on Item.get_root()

link = item.links[0]
%lprun -f pystac.link.Link.get_href link.get_href()

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   112                                               def get_href(self) -> Optional[str]:
   121                                                   # get the self href
   122         1          8.0      8.0      0.0          if self.is_resolved():
   123                                                       href = cast(pystac.STACObject, self.target).get_self_href()
   124                                                   else:
   125         1         31.0     31.0      0.0              href = cast(Optional[str], self.target)
   126                                           
   127         1    1650663.0 1650663.0    100.0          if href and is_absolute_href(href) and self.owner and self.owner.get_root():

which seemingly does the (slow) I/O

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   187                                               def get_root(self) -> Optional["Catalog_Type"]:
   196         1         15.0     15.0      0.0          root_link = self.get_root_link()
   197         1          1.0      1.0      0.0          if root_link:
   198         1          2.0      2.0      0.0              if not root_link.is_resolved():
   199         1    1651180.0 1651180.0    100.0                  root_link.resolve_stac_object()

I notice that the (slow) conditional above is followed by another conditional at

pystac/pystac/link.py

Lines 134 to 137 in 12eff70

    
           if root and root.is_relative() and self.rel in rel_links: 
        
               owner_href = self.owner.get_self_href() 
        
               if owner_href is not None: 
        
                   href = make_relative_href(href, owner_href)

, and only if that's true do we actually do anything to the output. In our case, https://planetarycomputer-staging.microsoft.com/api/stac/v1/ is not relative, so we don't actually do anything. I'm not sure how the item / link could know that without actually reading the Catalog though...

There was existing unused arguments for seting the root on to_dict. This commit implements and tests that the root is set to the parameter if present, and adds the parameter to ItemCollection, which itself does not have a root but instead passes the root onto each of the Items parsed during the to_dict call. Related to #546

lossyrob · 2021-07-09T22:13:29Z

For this test:

import pystac
from pystac_client import Client
import planetary_computer as pc

from time import perf_counter

stac = Client.open("https://planetarycomputer-staging.microsoft.com/api/stac/v1")

area_of_interest = {
    "type": "Polygon",
    "coordinates": [
        [
            [-122.27508544921875, 47.54687159892238],
            [-121.96128845214844, 47.54687159892238],
            [-121.96128845214844, 47.745787772920934],
            [-122.27508544921875, 47.745787772920934],
            [-122.27508544921875, 47.54687159892238],
        ]
    ],
}

search = stac.search(
    intersects=area_of_interest,
    datetime="2016-01-01/2020-12-31",
    collections=["sentinel-2-l2a"],
    query={
        'eo:cloud_cover': { 'le': 25 }
    },
    limit=500,  # fetch items in batches of 500
)
t1 = perf_counter()
items = list(search.get_items())
print(f"Found {len(items)} items")
t2 = perf_counter()
signed_items = pystac.ItemCollection(
    [pc.sign(item) for item in items]
)
t3 = perf_counter()
dicts = [item.to_dict() for item in items]
t4 = perf_counter()

print(f"Search took {t2-t1} seconds")
print(f"Sign took {t3-t2} seconds")
print(f"to_dict took {t4-t2} seconds")

Without the above PRs:

Search took 1.8861020999975153 seconds
Sign took 2.654093600001943 seconds
to_dict took 123.35427039999922 seconds

With them:

Search took 2.9420552999981737 seconds
Sign took 1.4421263000003819 seconds
to_dict took 1.4572951000009198 seconds

TomAugspurger · 2021-07-12T19:36:56Z

Fixed by #549 + stac-utils/pystac-client#72 (I think).

gjoseph92 · 2021-08-30T19:52:00Z

I'm still experiencing extreme to_dict() slowness for the same reason (get_root()) with pystac== 1.1.0 and pystac_client== 0.2.0.

Is that expected? I thought both of those changes had been released now.

FYI @TomAugspurger I was going to release a new stackstac version today with your and @scottyhq's pystac updates, but I'm going to hold off until this is addressed because waiting multiple minutes for stackstac.stack is pretty poor UX.

TomAugspurger · 2021-08-30T20:07:03Z

You're correct, both of these changes should be available in the versions of pystac & pystac-client you posted.

With the https://planetarycomputer.microsoft.com/api/stac/v1 endpoint, things are better, but still not great:

%%time
catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
items = catalog.search(
    intersects=dict(type="Point", coordinates=[-106, 35.7]),
    collections=["sentinel-2-l2a"],
    datetime="2019-01-01/2020-01-01"
).get_all_items()

CPU times: user 396 ms, sys: 19.6 ms, total: 415 ms
Wall time: 1.36 s

(that gives 281 items instead of 294, dunno why)

and

%%time
dict_items = [item.to_dict() for item in items]

CPU times: user 3.95 s, sys: 243 ms, total: 4.2 s
Wall time: 19.2 s

Most of the time seems to be spent in item.to_links via item.to_dict() (this is with an item from the planterycomputer endpoint)

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   288                                               def to_dict(self, include_self_link: bool = True) -> Dict[str, Any]:
   289         1          3.0      3.0      0.3          links = self.links
   290         1          2.0      2.0      0.2          if not include_self_link:
   291                                                       links = [x for x in links if x.rel != pystac.RelType.SELF]
   292                                           
   293         1        224.0    224.0     25.2          assets = {k: v.to_dict() for k, v in self.assets.items()}
   294                                           
   295         1          2.0      2.0      0.2          if self.datetime is not None:
   296         1         43.0     43.0      4.8              self.properties["datetime"] = datetime_to_str(self.datetime)
   297                                                   else:
   298                                                       self.properties["datetime"] = None
   299                                           
   300         1          4.0      4.0      0.4          d: Dict[str, Any] = {
   301         1          4.0      4.0      0.4              "type": "Feature",
   302         1         42.0     42.0      4.7              "stac_version": pystac.get_stac_version(),
   303         1          5.0      5.0      0.6              "id": self.id,
   304         1          1.0      1.0      0.1              "properties": self.properties,
   305         1          1.0      1.0      0.1              "geometry": self.geometry,
   306         1        531.0    531.0     59.7              "links": [link.to_dict() for link in links],
   307         1          3.0      3.0      0.3              "assets": assets,
   308                                                   }
   309                                           
   310         1          4.0      4.0      0.4          if self.bbox is not None:
   311         1          3.0      3.0      0.3              d["bbox"] = self.bbox
   312                                           
   313         1          4.0      4.0      0.4          if self.stac_extensions is not None:
   314         1          1.0      1.0      0.1              d["stac_extensions"] = self.stac_extensions
   315                                           
   316         1          2.0      2.0      0.2          if self.collection_id:
   317         1          1.0      1.0      0.1              d["collection"] = self.collection_id
   318                                           
   319         2          5.0      2.5      0.6          for key in self.extra_fields:
   320         1          4.0      4.0      0.4              d[key] = self.extra_fields[key]
   321                                           
   322         1          1.0      1.0      0.1          return d

TomAugspurger · 2021-08-30T20:18:31Z

It's definitely related to link resolution though. If you clear the links on the earth-search endpoint, things look just fine in to_dict()

items2 = []
for item in items:
    item = item.clone()
    item.clear_links()
    items2.append(item)
    
ic2 = pystac.ItemCollection(items2)
%%time dict_items = [item.to_dict() for item in a_items[:20]]

CPU times: user 13.4 ms, sys: 3.77 ms, total: 17.2 ms
Wall time: 16.6 ms

I wonder what's best here. At a minimum, there's probably some work to do in pystac to speed up to_dict on links if possible. But since stackstac doesn't use links at all (I think), should it clear the links on pystac.ItemCollection / List[pystac.Item] inputs (at the cost of cloning the items)?

gjoseph92 · 2021-08-30T20:22:51Z

Yeah, I was thinking the same. Clearing links would be reasonable now, but I'm also a little bit worried about the Asset.to_dict() time from your profile.

I haven't looked yet but I'm curious

What links is pystac resolving?
Why does it need to be resolved? (I would have thought these collections used absolute references, at least for assets)
Is it a different link every time, or the same thing (root collection?) that could be cached?

TomAugspurger · 2021-08-30T20:30:01Z

All good questions, and I don't have a good sense for any of them, but I can try!

At a minimum, I think links to the root collection and self are common (required?)
I think it can be either relative or absolute. I think all of ours are absolute though (maybe that could be a bit set on Item / ItemCollection?)
The root collection / catalog will be common across all items, so there's a potential for an optimization there (if you're willing to cache that somewhere)

I made a snakeviz profile at https://gistcdn.githack.com/TomAugspurger/66c6084c0898c1c713e54163fc7769d2/raw/2282976c21185c38731d162878aa033075263821/item_to_dict_static.html (LMK if you need help reading that. https://jiffyclub.github.io/snakeviz/#interpreting-results has an explanation)

import pystac, pystac_client
%load_ext snakeviz

url = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2B_MSIL2A_20191230T174729_R098_T13SDV_20201003T112518"
item = pystac.read_file(url)

%snakeviz item.to_dict()

@duckontheweb do you or anyone else have time / interest to look into this?

duckontheweb · 2021-08-31T12:46:33Z

@TomAugspurger @gjoseph92 Thanks for bringing this up. I can take a look sometime in the next couple of days.

I've noticed that issue with slowness in Link.to_dict before and I think caching that get_root call should make things much better. I agree with @gjoseph92 that the Asset.to_dict slowness is concerning because that method should just be assigning values to a dictionary.

One thing @lossyrob and I have discussed a bit is moving towards having all JSON-like classes (Asset, Link, etc.) have an underlying dictionary for all of their fields rather than storing them in instance attributes and then converting them to a dictionary. In that case, our from_dict and to_dict methods could be much faster (since they would either be returning or copying dicitonaries) at the expense of slightly slower attribute access (since we would be using property accessors instead of instance attributes). I may take a crack at implementing this for the Asset class to see if it helps with this performance issue.

duckontheweb · 2021-09-02T00:43:46Z

Regarding the slow Link.to_dict method:

That method makes a call to Item.get_root here. If that Item's root link has not been resolved, then this results in a network request to resolve that link to a Catalog. Since pystac_client does not pass a root argument to ItemCollection in the get_all_items method none of the Items have resolved root links, so a new network request is made for each one.

I think the best solution here is to pass root=self.client when instantiating ItemCollection as is done in get_item_collections.

cc: @matthewhanson

lossyrob · 2021-09-02T02:22:17Z

... I think caching that get_root call should make things much better.

Caching the get_root call is tough, because the mechanism for caching in PySTAC is centered around a root. The root has a resolution cache, which can resolve links via calls like this.

However, when no root is set, there's no caching structure in place and reading the same href over and over will be very expensive.

As you said @duckontheweb, a good solution to this particular issue is to ensure the root is set on the Items via the ItemCollection.

However, this seems to keep popping up as an inefficiency, and is hard to diagnose. Perhaps we should rethink keeping caching to a per-root basis, and instead make caching based on a singleton? This may only make sense for the HREF caching - the caching by object IDs that happens off of the root is necessary for proper object resolution (e.g. in cases where links point to the same object read at different times, but should represent the same target object in memory). There's probably some cache invalidation headaches that a global HREF cache would cause, but it might be worth removing these types of performance issues in a way that doesn't necessitate users of the library to always remember to pass in root.

TomAugspurger · 2021-09-02T11:33:38Z

I think the best solution here is to pass root=self.client when instantiating ItemCollection as is done in get_item_collections.

stac-utils/pystac-client#90 sets that.

Until stac-utils/pystac#546 is resolved, pystac is very painful to use, so I'd rather not have it be the lead example. Partially Revert "Fix for Pystac ItemCollections (#69)" This reverts commit 98809b4.

TomAugspurger · 2021-10-28T12:01:51Z

Just noting that the slowness in #546 (comment) is fixed now, possibly by stac-utils/pystac-client#90.

#546 (comment) has some more thoughts on a way to speed things up even more. Do we want to keep this issue open to track that work? Or should we make a new one dedicated to it?

gjoseph92 · 2021-10-28T15:10:14Z

Confirmed, it does seem better with pystac-client=0.3.0! Thanks @TomAugspurger.

I feel like this issue could be closed, and those ideas moved to a new one?

duckontheweb · 2021-10-28T17:29:10Z

I'm +1 for closing this and opening a new issue. #617 starts to implement some of the ideas from #546 (comment), but needs some more work.

gjoseph92 · 2021-10-28T22:03:21Z

FYI to_dict is still a bit slow because of stac-utils/pystac-client#115

Losing the root collection was causing `to_dict()` on signed `ItemCollections` to be extremely slow, like stac-utils/pystac#546. By preserving the root catalog stac-utils/pystac-client#72, it's much faster.

scottyhq · 2021-11-12T00:51:43Z

Wanted to point out that while things work well with pystac_client search returns, the from_file method is still slow. I don't really understand the root resolution issues. But this affects loading saved search results, example below:

%%time

import pystac_client #0.3.0
import pystac #1.1

URL = "https://earth-search.aws.element84.com/v0"
catalog = pystac_client.Client.open(URL)

results = catalog.search(
    intersects=dict(type="Point", coordinates=[-105.78, 35.79]),
    collections=["sentinel-s2-l2a-cogs"],
    datetime="2010-04-01/2021-12-31",
    limit=1000,
)

print(f"{results.matched()} items found") # 604 items
stac_items = results.get_all_items() 
stac_items.save_object('items.json')
# Wall time: 2.0s

%%time 
items = [item.to_dict() for item in stac_items]
# Wall time: 69.3 ms

%%time 
stac_items = pystac.ItemCollection.from_file('items.json')
items = [item.to_dict() for item in stac_items]
# !!!! Wall time: 1min 13s !!!

I feel like the from_file() can do the root setting automatically since root urls are in the search results? Or else that method also needs to have a root kwarg:

pystac/pystac/item_collection.py

Line 165 in b17aeb5

def from_file(

%%time
# Awkward workaround
import json
with open('items.json') as f:
    featureCol = json.load(f)
    
e84 = pystac_client.Client.open('https://earth-search.aws.element84.com/v0')
stac_items = pystac.ItemCollection.from_dict(featureCol, preserve_dict=False, root=e84)
items = [item.to_dict() for item in stac_items]
# Wall time: 654 ms

lossyrob · 2021-11-12T03:28:25Z

With #663, and using the new flag:

stac_items = pystac.ItemCollection.from_file('items.json')
items = [item.to_dict(transform_hrefs=False) for item in stac_items]

this should be as fast as your workaround example. Also ItemCollection.to_dict() will not transform HREFs by default.

duckontheweb · 2021-11-17T13:50:58Z

#663 is included in v1.2.0, but I am leaving this issue open and assigning to a future release milestone so we can continue to explore some of the options outlined in the above comment and in #546 (comment)

pjhartzell · 2023-02-01T00:29:57Z

Closing Summary

The issue was slowness in the pystac.Item.to_dict() method. The cause was identified as root link resolution. The problem was compounded when calling to_dict() on each Item in an ItemCollection, in which case the root link was resolved each time. Several fixes were implemented:

The first set of fixes handled Items returned from a pystac-client search. By allowing the root to be set in pystac.ItemCollection.from_dict(), pystac-client can pass the client Catalog as the root for Items returned from a search. This speeds up pystac.Item.to_dict() calls since the root link is already resolved.
- pystac, allow root to be set: Allow root to be set during to_dict #549
- pystac-client, pass client Catalog as root in search().get_items(): Set client Catalog as Item root in ItemSearch pystac-client#72 by way of release 0.2.0-rc.1 pystac-client#74
- pystac-client, pass client Catalog as root in search().get_all_items(): PERF: Avoid copy in pystac for ItemCollections pystac-client#90
A second fix provided an option to disable transforming link hrefs to be relative to the link owner's root catalog in cases where the root catalog type is RELATIVE_PUBLISHED or SELF_CONTAINED. This eliminates any root link resolutions necessary for that transformation. The new option, transform_hrefs, defaults to True for Catalogs, Collections, Items, and Links. It defaults to False for ItemCollections, which handles the specific problem that was raised: slow pystac.Item.to_dict() performance for Items in an ItemCollection loaded from a file (ItemCollection.from_file())
- pystac, allow skipping href transformation: Allow PySTAC to skip transforming hrefs to relative or absolute based on CatalogType #663

There was discussion of additional caching for further performance improvements, but the complexity of the solution seems high. A topic that continues to surface - using a dictionary for all fields in JSON-like classes (Asset, Link, etc.) - was also investigated (see #617) as a performance boost since the to_dict() methods would be returning or copying dictionaries rather than converting classes. However, this change is about more than just performance, e.g., it would enable flexible changes to Item fields, and is probably better addressed in a dedicated issue or RFC.

There are now mechanisms in place that address the original issue. It can be re-opened if needed.

TomAugspurger mentioned this issue Jul 9, 2021

Avoid slow to_dict in cloudless-mosaic microsoft/PlanetaryComputerExamples#42

Merged

This was referenced Jul 9, 2021

Allow root to be set during to_dict #549

Merged

Set client Catalog as Item root in ItemSearch stac-utils/pystac-client#72

Closed

TomAugspurger closed this as completed Jul 12, 2021

duckontheweb reopened this Aug 31, 2021

duckontheweb mentioned this issue Aug 31, 2021

RFC: Use mutable dictionaries for object properties #617

Closed

5 tasks

This was referenced Oct 26, 2021

Workaround poor pystac performance, or switch examples away from pystac gjoseph92/stackstac#79

Closed

Fix for Pystac ItemCollections gjoseph92/stackstac#69

Merged

gjoseph92 mentioned this issue Oct 28, 2021

Switch examples back to satsearch gjoseph92/stackstac#80

Merged

gjoseph92 mentioned this issue Oct 28, 2021

Default low items-per-page limit leads to bad performance stac-utils/pystac-client#115

Closed

gjoseph92 mentioned this issue Oct 29, 2021

PERF: maintain root when signing ItemCollections microsoft/planetary-computer-sdk-for-python#30

Merged

lossyrob mentioned this issue Nov 12, 2021

Changes to main for v1.2.0 release #653

Merged

6 tasks

duckontheweb added this to the 1.2.0 milestone Nov 12, 2021

lossyrob mentioned this issue Nov 12, 2021

Allow PySTAC to skip transforming hrefs to relative or absolute based on CatalogType #663

Merged

5 tasks

duckontheweb modified the milestones: 1.2.0, 1.4.0 Nov 17, 2021

duckontheweb mentioned this issue Jan 24, 2022

Establish performance testing framework and benchmarks #729

Closed

duckontheweb removed this from the 1.4.0 milestone Feb 16, 2022

gadomski added the bug Things which are broken label Jan 9, 2023

gadomski mentioned this issue Jan 20, 2023

New feature: StacIO that retries network requests #958

Closed

broesch-e84 assigned pjhartzell Jan 30, 2023

gadomski added this to the 1.7 milestone Jan 31, 2023

pjhartzell closed this as completed Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slowness in `Item.to_dict()`, seemingly from links #546

Slowness in `Item.to_dict()`, seemingly from links #546

TomAugspurger commented Jul 9, 2021

TomAugspurger commented Jul 9, 2021

lossyrob commented Jul 9, 2021

TomAugspurger commented Jul 12, 2021

gjoseph92 commented Aug 30, 2021

TomAugspurger commented Aug 30, 2021

TomAugspurger commented Aug 30, 2021

gjoseph92 commented Aug 30, 2021

TomAugspurger commented Aug 30, 2021 •

edited

Loading

duckontheweb commented Aug 31, 2021

duckontheweb commented Sep 2, 2021

lossyrob commented Sep 2, 2021

TomAugspurger commented Sep 2, 2021

TomAugspurger commented Oct 28, 2021

gjoseph92 commented Oct 28, 2021

duckontheweb commented Oct 28, 2021

gjoseph92 commented Oct 28, 2021

scottyhq commented Nov 12, 2021 •

edited

Loading

lossyrob commented Nov 12, 2021

duckontheweb commented Nov 17, 2021

pjhartzell commented Feb 1, 2023 •

edited

Loading

Slowness in Item.to_dict(), seemingly from links #546

Slowness in Item.to_dict(), seemingly from links #546

Comments

TomAugspurger commented Jul 9, 2021

TomAugspurger commented Jul 9, 2021

lossyrob commented Jul 9, 2021

TomAugspurger commented Jul 12, 2021

gjoseph92 commented Aug 30, 2021

TomAugspurger commented Aug 30, 2021

TomAugspurger commented Aug 30, 2021

gjoseph92 commented Aug 30, 2021

TomAugspurger commented Aug 30, 2021 • edited Loading

duckontheweb commented Aug 31, 2021

duckontheweb commented Sep 2, 2021

lossyrob commented Sep 2, 2021

TomAugspurger commented Sep 2, 2021

TomAugspurger commented Oct 28, 2021

gjoseph92 commented Oct 28, 2021

duckontheweb commented Oct 28, 2021

gjoseph92 commented Oct 28, 2021

scottyhq commented Nov 12, 2021 • edited Loading

lossyrob commented Nov 12, 2021

duckontheweb commented Nov 17, 2021

pjhartzell commented Feb 1, 2023 • edited Loading

Closing Summary

Slowness in `Item.to_dict()`, seemingly from links #546

Slowness in `Item.to_dict()`, seemingly from links #546

TomAugspurger commented Aug 30, 2021 •

edited

Loading

scottyhq commented Nov 12, 2021 •

edited

Loading

pjhartzell commented Feb 1, 2023 •

edited

Loading