You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A few issues have come up over the past few months related to slow performance within the library. We do not currently have any benchmarks for memory usage or runtimes built into our testing suite, so it is hard to catch performance regressions or evaluate possible performance improvements.
The goal of this issue is to articulate a plan for a first pass at performance benchmarking within the library. This should include deciding which parts of the code we want to benchmark and selecting a performance benchmarking library (or libraries) to use in our testing suite.
I do not have extensive personal experience in this area, so input from others who have set up performance testing frameworks in the past is highly desired. To start the discussion, here are some initial thoughts:
This could be a good test of the round-trip of reading, copying, and writing a catalog for each of our test catalogs.
to_dict methods
Slowness in Item.to_dict(), seemingly from links #546 brought up some performance issues in the use of to_dict and this comment raises some questions about how we cache STAC objects. It would be good to have some benchmarks here so we can fine-tune this in terms of both memory and speed.
Methods for fetching items/children in catalogs Efficiently fetching a specific child object #99 brings up the issue of efficiently fetching specific items from a catalog and it would be good to have some metrics around this
What Tool to Use?
Given the interest in using async and multithreaded techniques to speed up performance (see #609 and #274) any library we choose should probably be able to handle asynchronous and multithreaded code. Here are some options from a brief survey of the landscape:
Well-documented and supported speed benchmarking framework with built-in visualization capabilities. Not sure what it would look like to support asynchronous or multi-threaded code.
Has support for async and gevent and multithreaded processing, seems well-supported, and claims to be very fast. Only profiles timing, not memory usage (as far as I can tell).
Already included in the standard library, but would take some work to make it support multi-threaded or async code. Only profiles timing, not memory usage. Can be combined with snakeviz for visualizing results.
The nice thing about asv it that it tracks benchmarks over time, making it relatively easy to detect performance regressions and tie them back to specific commits. You just need somewhere to run it (we could maybe use the same server sitting in my closet running the pandas benchmarks, but it's somewhat flaky).
I suspect, but am not sure, that you can profile async code with asv by creating an event loop in the setup method and making / running an asyncio.Task to run the actual async code.
➕ 1 for benchmarking memory usage. I am trying to use pystac to iterate all the items and assets in a static catalog with over a million items, and I'm seeing too high memory usage. Will probably just process the catalog as raw json instead.
A few issues have come up over the past few months related to slow performance within the library. We do not currently have any benchmarks for memory usage or runtimes built into our testing suite, so it is hard to catch performance regressions or evaluate possible performance improvements.
The goal of this issue is to articulate a plan for a first pass at performance benchmarking within the library. This should include deciding which parts of the code we want to benchmark and selecting a performance benchmarking library (or libraries) to use in our testing suite.
I do not have extensive personal experience in this area, so input from others who have set up performance testing frameworks in the past is highly desired. To start the discussion, here are some initial thoughts:
What to Benchmark?
tests.test_writing.py::STACWritingTest::test_testcases
This could be a good test of the round-trip of reading, copying, and writing a catalog for each of our test catalogs.
to_dict
methodsSlowness in
Item.to_dict()
, seemingly from links #546 brought up some performance issues in the use ofto_dict
and this comment raises some questions about how we cache STAC objects. It would be good to have some benchmarks here so we can fine-tune this in terms of both memory and speed.Methods for fetching items/children in catalogs
Efficiently fetching a specific child object #99 brings up the issue of efficiently fetching specific items from a catalog and it would be good to have some metrics around this
What Tool to Use?
Given the interest in using async and multithreaded techniques to speed up performance (see #609 and #274) any library we choose should probably be able to handle asynchronous and multithreaded code. Here are some options from a brief survey of the landscape:
asv
(airspeed velocity)Well-documented and supported speed benchmarking framework with built-in visualization capabilities. Not sure what it would look like to support asynchronous or multi-threaded code.
yappi
Has support for
async
andgevent
and multithreaded processing, seems well-supported, and claims to be very fast. Only profiles timing, not memory usage (as far as I can tell).profile
/cProfile
Already included in the standard library, but would take some work to make it support multi-threaded or async code. Only profiles timing, not memory usage. Can be combined with
snakeviz
for visualizing results.guppy3
Memory profiler with some related blog posts on how to track down memory issues.
cc: @TomAugspurger @lossyrob @matthewhanson @gadomski @scottyhq
The text was updated successfully, but these errors were encountered: