Establish performance testing framework and benchmarks #729

duckontheweb · 2022-01-24T21:51:07Z

A few issues have come up over the past few months related to slow performance within the library. We do not currently have any benchmarks for memory usage or runtimes built into our testing suite, so it is hard to catch performance regressions or evaluate possible performance improvements.

The goal of this issue is to articulate a plan for a first pass at performance benchmarking within the library. This should include deciding which parts of the code we want to benchmark and selecting a performance benchmarking library (or libraries) to use in our testing suite.

I do not have extensive personal experience in this area, so input from others who have set up performance testing frameworks in the past is highly desired. To start the discussion, here are some initial thoughts:

What to Benchmark?

tests.test_writing.py::STACWritingTest::test_testcases

This could be a good test of the round-trip of reading, copying, and writing a catalog for each of our test catalogs.
to_dict methods

Slowness in Item.to_dict(), seemingly from links #546 brought up some performance issues in the use of to_dict and this comment raises some questions about how we cache STAC objects. It would be good to have some benchmarks here so we can fine-tune this in terms of both memory and speed.
Methods for fetching items/children in catalogs
Efficiently fetching a specific child object #99 brings up the issue of efficiently fetching specific items from a catalog and it would be good to have some metrics around this

What Tool to Use?

Given the interest in using async and multithreaded techniques to speed up performance (see #609 and #274) any library we choose should probably be able to handle asynchronous and multithreaded code. Here are some options from a brief survey of the landscape:

asv (airspeed velocity)

Well-documented and supported speed benchmarking framework with built-in visualization capabilities. Not sure what it would look like to support asynchronous or multi-threaded code.
yappi

Has support for async and gevent and multithreaded processing, seems well-supported, and claims to be very fast. Only profiles timing, not memory usage (as far as I can tell).
profile/cProfile

Already included in the standard library, but would take some work to make it support multi-threaded or async code. Only profiles timing, not memory usage. Can be combined with snakeviz for visualizing results.
guppy3

Memory profiler with some related blog posts on how to track down memory issues.

cc: @TomAugspurger @lossyrob @matthewhanson @gadomski @scottyhq

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2022-01-25T21:13:49Z

https://tomaugspurger.github.io/maintaing-performance.html has a few notes on what we do for pandas, mostly using asv (I know there's a typo in the title).

The nice thing about asv it that it tracks benchmarks over time, making it relatively easy to detect performance regressions and tie them back to specific commits. You just need somewhere to run it (we could maybe use the same server sitting in my closet running the pandas benchmarks, but it's somewhat flaky).

I suspect, but am not sure, that you can profile async code with asv by creating an event loop in the setup method and making / running an asyncio.Task to run the actual async code.

duckontheweb · 2022-02-08T19:26:05Z

asv also allows us to benchmark peak memory usage, object-specific memory size, and custom metrics (more detail here).

guidorice · 2022-05-05T14:12:24Z

➕ 1 for benchmarking memory usage. I am trying to use pystac to iterate all the items and assets in a static catalog with over a million items, and I'm seeing too high memory usage. Will probably just process the catalog as raw json instead.

duckontheweb added enhancement help wanted discussion An issue to capture a discussion labels Jan 24, 2022

duckontheweb removed the help wanted label Feb 9, 2022

duckontheweb self-assigned this Feb 9, 2022

duckontheweb mentioned this issue Feb 10, 2022

Add performance benchmarks #748

Merged

10 tasks

gadomski unassigned duckontheweb Dec 30, 2022

gadomski self-assigned this Jan 9, 2023

gadomski closed this as completed in #748 Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Establish performance testing framework and benchmarks #729

Establish performance testing framework and benchmarks #729

duckontheweb commented Jan 24, 2022

TomAugspurger commented Jan 25, 2022

duckontheweb commented Feb 8, 2022

guidorice commented May 5, 2022

Establish performance testing framework and benchmarks #729

Establish performance testing framework and benchmarks #729

Comments

duckontheweb commented Jan 24, 2022

What to Benchmark?

What Tool to Use?

TomAugspurger commented Jan 25, 2022

duckontheweb commented Feb 8, 2022

guidorice commented May 5, 2022