-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slowness in Item.to_dict()
, seemingly from links
#546
Comments
All the time is seemingly spent on
which seemingly does the (slow) I/O
I notice that the (slow) conditional above is followed by another conditional at Lines 134 to 137 in 12eff70
|
There was existing unused arguments for seting the root on to_dict. This commit implements and tests that the root is set to the parameter if present, and adds the parameter to ItemCollection, which itself does not have a root but instead passes the root onto each of the Items parsed during the to_dict call. Related to #546
For this test: import pystac
from pystac_client import Client
import planetary_computer as pc
from time import perf_counter
stac = Client.open("https://planetarycomputer-staging.microsoft.com/api/stac/v1")
area_of_interest = {
"type": "Polygon",
"coordinates": [
[
[-122.27508544921875, 47.54687159892238],
[-121.96128845214844, 47.54687159892238],
[-121.96128845214844, 47.745787772920934],
[-122.27508544921875, 47.745787772920934],
[-122.27508544921875, 47.54687159892238],
]
],
}
search = stac.search(
intersects=area_of_interest,
datetime="2016-01-01/2020-12-31",
collections=["sentinel-2-l2a"],
query={
'eo:cloud_cover': { 'le': 25 }
},
limit=500, # fetch items in batches of 500
)
t1 = perf_counter()
items = list(search.get_items())
print(f"Found {len(items)} items")
t2 = perf_counter()
signed_items = pystac.ItemCollection(
[pc.sign(item) for item in items]
)
t3 = perf_counter()
dicts = [item.to_dict() for item in items]
t4 = perf_counter()
print(f"Search took {t2-t1} seconds")
print(f"Sign took {t3-t2} seconds")
print(f"to_dict took {t4-t2} seconds") Without the above PRs:
With them:
|
Fixed by #549 + stac-utils/pystac-client#72 (I think). |
I'm still experiencing extreme Is that expected? I thought both of those changes had been released now. FYI @TomAugspurger I was going to release a new stackstac version today with your and @scottyhq's pystac updates, but I'm going to hold off until this is addressed because waiting multiple minutes for |
You're correct, both of these changes should be available in the versions of pystac & pystac-client you posted. With the https://planetarycomputer.microsoft.com/api/stac/v1 endpoint, things are better, but still not great: %%time
catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
items = catalog.search(
intersects=dict(type="Point", coordinates=[-106, 35.7]),
collections=["sentinel-2-l2a"],
datetime="2019-01-01/2020-01-01"
).get_all_items()
CPU times: user 396 ms, sys: 19.6 ms, total: 415 ms
Wall time: 1.36 s (that gives 281 items instead of 294, dunno why) and %%time
dict_items = [item.to_dict() for item in items]
CPU times: user 3.95 s, sys: 243 ms, total: 4.2 s
Wall time: 19.2 s Most of the time seems to be spent in
|
It's definitely related to link resolution though. If you clear the links on the earth-search endpoint, things look just fine in items2 = []
for item in items:
item = item.clone()
item.clear_links()
items2.append(item)
ic2 = pystac.ItemCollection(items2)
%%time dict_items = [item.to_dict() for item in a_items[:20]]
CPU times: user 13.4 ms, sys: 3.77 ms, total: 17.2 ms
Wall time: 16.6 ms I wonder what's best here. At a minimum, there's probably some work to do in |
Yeah, I was thinking the same. Clearing links would be reasonable now, but I'm also a little bit worried about the I haven't looked yet but I'm curious
|
All good questions, and I don't have a good sense for any of them, but I can try!
I made a snakeviz profile at https://gistcdn.githack.com/TomAugspurger/66c6084c0898c1c713e54163fc7769d2/raw/2282976c21185c38731d162878aa033075263821/item_to_dict_static.html (LMK if you need help reading that. https://jiffyclub.github.io/snakeviz/#interpreting-results has an explanation) import pystac, pystac_client
%load_ext snakeviz
url = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2B_MSIL2A_20191230T174729_R098_T13SDV_20201003T112518"
item = pystac.read_file(url)
%snakeviz item.to_dict() @duckontheweb do you or anyone else have time / interest to look into this? |
@TomAugspurger @gjoseph92 Thanks for bringing this up. I can take a look sometime in the next couple of days. I've noticed that issue with slowness in One thing @lossyrob and I have discussed a bit is moving towards having all JSON-like classes ( |
Regarding the slow That method makes a call to I think the best solution here is to pass cc: @matthewhanson |
Caching the get_root call is tough, because the mechanism for caching in PySTAC is centered around a root. The root has a resolution cache, which can resolve links via calls like this. However, when no root is set, there's no caching structure in place and reading the same href over and over will be very expensive. As you said @duckontheweb, a good solution to this particular issue is to ensure the root is set on the Items via the ItemCollection. However, this seems to keep popping up as an inefficiency, and is hard to diagnose. Perhaps we should rethink keeping caching to a per-root basis, and instead make caching based on a singleton? This may only make sense for the HREF caching - the caching by object IDs that happens off of the root is necessary for proper object resolution (e.g. in cases where links point to the same object read at different times, but should represent the same target object in memory). There's probably some cache invalidation headaches that a global HREF cache would cause, but it might be worth removing these types of performance issues in a way that doesn't necessitate users of the library to always remember to pass in |
stac-utils/pystac-client#90 sets that. |
Until stac-utils/pystac#546 is resolved, pystac is very painful to use, so I'd rather not have it be the lead example. Partially Revert "Fix for Pystac ItemCollections (#69)" This reverts commit 98809b4.
Until stac-utils/pystac#546 is resolved, pystac is very painful to use, so I'd rather not have it be the lead example. Partially Revert "Fix for Pystac ItemCollections (#69)" This reverts commit 98809b4.
Just noting that the slowness in #546 (comment) is fixed now, possibly by stac-utils/pystac-client#90. #546 (comment) has some more thoughts on a way to speed things up even more. Do we want to keep this issue open to track that work? Or should we make a new one dedicated to it? |
Confirmed, it does seem better with pystac-client=0.3.0! Thanks @TomAugspurger. I feel like this issue could be closed, and those ideas moved to a new one? |
I'm +1 for closing this and opening a new issue. #617 starts to implement some of the ideas from #546 (comment), but needs some more work. |
FYI |
Losing the root collection was causing `to_dict()` on signed `ItemCollections` to be extremely slow, like stac-utils/pystac#546. By preserving the root catalog stac-utils/pystac-client#72, it's much faster.
Losing the root collection was causing `to_dict()` on signed `ItemCollections` to be extremely slow, like stac-utils/pystac#546. By preserving the root catalog stac-utils/pystac-client#72, it's much faster.
Wanted to point out that while things work well with pystac_client search returns, the %%time
import pystac_client #0.3.0
import pystac #1.1
URL = "https://earth-search.aws.element84.com/v0"
catalog = pystac_client.Client.open(URL)
results = catalog.search(
intersects=dict(type="Point", coordinates=[-105.78, 35.79]),
collections=["sentinel-s2-l2a-cogs"],
datetime="2010-04-01/2021-12-31",
limit=1000,
)
print(f"{results.matched()} items found") # 604 items
stac_items = results.get_all_items()
stac_items.save_object('items.json')
# Wall time: 2.0s %%time
items = [item.to_dict() for item in stac_items]
# Wall time: 69.3 ms %%time
stac_items = pystac.ItemCollection.from_file('items.json')
items = [item.to_dict() for item in stac_items]
# !!!! Wall time: 1min 13s !!! I feel like the pystac/pystac/item_collection.py Line 165 in b17aeb5
%%time
# Awkward workaround
import json
with open('items.json') as f:
featureCol = json.load(f)
e84 = pystac_client.Client.open('https://earth-search.aws.element84.com/v0')
stac_items = pystac.ItemCollection.from_dict(featureCol, preserve_dict=False, root=e84)
items = [item.to_dict() for item in stac_items]
# Wall time: 654 ms |
With #663, and using the new flag: stac_items = pystac.ItemCollection.from_file('items.json')
items = [item.to_dict(transform_hrefs=False) for item in stac_items] this should be as fast as your workaround example. Also ItemCollection.to_dict() will not transform HREFs by default. |
#663 is included in v1.2.0, but I am leaving this issue open and assigning to a future release milestone so we can continue to explore some of the options outlined in the above comment and in #546 (comment) |
Closing SummaryThe issue was slowness in the
There was discussion of additional caching for further performance improvements, but the complexity of the solution seems high. A topic that continues to surface - using a dictionary for all fields in JSON-like classes (Asset, Link, etc.) - was also investigated (see #617) as a performance boost since the There are now mechanisms in place that address the original issue. It can be re-opened if needed. |
I'm trying to diagnose a slowdown I'm observing in
pystac.to_dict()
is slower. Right now I'm unsure if the slowdown is due to a change in pystac or in our STAC endpoint. Anyway, if I callclear_links()
prior toto_dict()
then things are faster (goes from about 1s to 200µs)So a couple questions:
to_dict
, I don't see why it would need to make an HTTP call or anything like that, right? Maybe through some chain callinglink.get_href()
?to_dict()
to_dict()
has a parameterinclude_self_link
. Could we add a parameter to exclude all links? For my application, I only need the assets and extension info, so making requests to fetch the linkThe text was updated successfully, but these errors were encountered: