Make pystac handle item creation asynchronously #609

xenon-dev · 2021-08-20T12:54:47Z

As of version 1.1.0 stac items are handeled in a synchronized way. This is fine for a few items, but has a long runtime with over hundreds of items.

I use our own LGLN script to build the catalog up from collections and items, calling our own function for the heavy lifting, normalizing and then save the catalog in the end.

With pythons concurrent lib, I was able to fasten up the item creation from tiff files with the ThreadPoolExecutor, but having trouble with using a ProcessPoolExecutor, which would be the right way to do it, cause the task is pretty much just cpu-bound, not I/O-bound in any way.

I can see a 400% increase in item creation speed with 16 processes spawned instead of one, but have to comment out the save method in the end, which then generated nothing.

Unfortunately, the catalog.save method does not wait at all for the catalog.add_item method to return and so tries to build a dictionary from an empty list, which results in an error.

To circumvent this, it would be really great, if pystac itself internally could handle the creation of items and adding to the catalog asynchronously, maybe with an extra parameter, to be synchronious handling still be the default, because "it just works".

def addImageItemsToCatalog(imageList, catalog):
    """multithreaded execution inspired by:
    https://stackoverflow.com/questions/21143162/python-wait-on-all-of-concurrent-futures-threadpoolexecutors-futures
    Performance increase was measured to be around 100%, measured with 93 stac items in 2 different catalogs,
    compared to singlethreaded.
    Could be further increased with multiprocess implementation (e.g. use "ProcessPoolExecutor" instead),
    but a way to have the "catalog.save" method be waiting on the execution has to be found, otherwise
    pyStac will fail.
    """
    executor = concurrent.futures.ProcessPoolExecutor(MULTI_THREAD_SWEET_SPOT) 
    futures = [executor.submit(writeImageFileItems, imageFile, catalog) for imageFile in imageList]
    concurrent.futures.wait(futures, timeout=None, return_when=ALL_COMPLETED)
    # VVV this is done on the main thread, without waiting, which crashes :(
    catalog.normalize_hrefs("generated/dop")
    print("Saving catalog to " + catalog.get_self_href())
    catalog.save(catalog_type=pystac.CatalogType.SELF_CONTAINED)

The resulting error would be the following, which I think is clear to me, because .save() does not wait for anything to be in there:

Traceback (most recent call last):
  File "C:\dev\repo\task-3d-2.0\prototypes\dopCatalogBuilder\main.py", line 9, in <module>
    main()
  File "C:\dev\repo\task-3d-2.0\prototypes\dopCatalogBuilder\main.py", line 5, in main
    dopCatalogBuilder.build()
  File "C:\dev\repo\task-3d-2.0\prototypes\dopCatalogBuilder\dopCatalogBuilder.py", line 100, in build
    dop_catalog = createCatalogFromAreaDictionary(dopImages, subcatalogs, catalogTitle="test raster Orthophotos",
  File "C:\dev\repo\task-3d-2.0\prototypes\dopCatalogBuilder\dopCatalogBuilder.py", line 114, in createCatalogFromAreaDictionary
    addImageItemsToCatalog(images, dopCatalog)
  File "C:\dev\repo\task-3d-2.0\prototypes\dopCatalogBuilder\dopCatalogBuilder.py", line 133, in addImageItemsToCatalog
    catalog.save(catalog_type=pystac.CatalogType.SELF_CONTAINED)
  File "C:\Users\dev\anaconda3\envs\catalogBuilder\lib\site-packages\pystac\catalog.py", line 753, in save
    child.save()
  File "C:\Users\dev\anaconda3\envs\catalogBuilder\lib\site-packages\pystac\catalog.py", line 783, in save
    self.save_object(
  File "C:\Users\dev\anaconda3\envs\catalogBuilder\lib\site-packages\pystac\stac_object.py", line 341, in save_object
    stac_io.save_json(dest_href, self.to_dict(include_self_link=include_self_link))
  File "C:\Users\dev\anaconda3\envs\catalogBuilder\lib\site-packages\pystac\collection.py", line 520, in to_dict
    d["extent"] = self.extent.to_dict()
AttributeError: 'list' object has no attribute 'to_dict'

The text was updated successfully, but these errors were encountered:

xenon-dev · 2021-08-23T09:44:59Z

@matthewhanson commented in Gitter, that there is a stale branch here (which just wasn´t merged because it limits the functionality to be just async, which shouldn´t be the default):

https://github.com/stac-utils/pystac/tree/mah/async

I am personally trying to use that branch to handle more than 1000+ items gracefully.

Still it would be fantastic, if that async branch could be merged into main and made usage available through either a cli argument or a switch directly in the code.

keewis · 2023-04-14T08:15:46Z

As far as I can tell, save does two things: recursively compute the href for children / items, and actually save the object to disk.

I wonder if it would make sense to expose both separately, so we can use whatever parallelizing framework we want (concurrent.futures, dask, beam, ...) to do the actual saving (call .save_object() or use .to_dict() and save manually). Something like

# recursively yield (resolved?) links as Catalog / Collection / Item
descendants = catalog.normalize_hrefs(dest_href).get_descendants()

might help already? Alternatively, having a function that returns a mapping (or yields dict.items() tuples) of dest_href → object would also work.

gadomski · 2023-06-26T19:15:20Z

Closing, not planning on doing async in PySTAC at this time. See this comment for background on the rationale.

duckontheweb mentioned this issue Jan 24, 2022

Establish performance testing framework and benchmarks #729

Closed

duckontheweb mentioned this issue Feb 14, 2022

Add async I/O methods [WIP] #749

Closed

9 tasks

gadomski added this to the 2.0 milestone Jan 31, 2023

gadomski closed this as not planned Won't fix, can't repro, duplicate, stale Jun 26, 2023

gadomski removed this from the 2.0 milestone Jun 26, 2023

gadomski added the wontfix label Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make pystac handle item creation asynchronously #609

Make pystac handle item creation asynchronously #609

xenon-dev commented Aug 20, 2021 •

edited

Loading

xenon-dev commented Aug 23, 2021

keewis commented Apr 14, 2023

gadomski commented Jun 26, 2023

Make pystac handle item creation asynchronously #609

Make pystac handle item creation asynchronously #609

Comments

xenon-dev commented Aug 20, 2021 • edited Loading

xenon-dev commented Aug 23, 2021

keewis commented Apr 14, 2023

gadomski commented Jun 26, 2023

xenon-dev commented Aug 20, 2021 •

edited

Loading