-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make pystac handle item creation asynchronously #609
Comments
@matthewhanson commented in Gitter, that there is a stale branch here (which just wasn´t merged because it limits the functionality to be just async, which shouldn´t be the default): I am personally trying to use that branch to handle more than 1000+ items gracefully. Still it would be fantastic, if that async branch could be merged into main and made usage available through either a cli argument or a switch directly in the code. |
As far as I can tell, I wonder if it would make sense to expose both separately, so we can use whatever parallelizing framework we want ( # recursively yield (resolved?) links as Catalog / Collection / Item
descendants = catalog.normalize_hrefs(dest_href).get_descendants() might help already? Alternatively, having a function that returns a mapping (or yields |
Closing, not planning on doing |
As of version 1.1.0 stac items are handeled in a synchronized way. This is fine for a few items, but has a long runtime with over hundreds of items.
I use our own LGLN script to build the catalog up from collections and items, calling our own function for the heavy lifting, normalizing and then save the catalog in the end.
With pythons concurrent lib, I was able to fasten up the item creation from tiff files with the ThreadPoolExecutor, but having trouble with using a ProcessPoolExecutor, which would be the right way to do it, cause the task is pretty much just cpu-bound, not I/O-bound in any way.
I can see a 400% increase in item creation speed with 16 processes spawned instead of one, but have to comment out the save method in the end, which then generated nothing.
Unfortunately, the catalog.save method does not wait at all for the catalog.add_item method to return and so tries to build a dictionary from an empty list, which results in an error.
To circumvent this, it would be really great, if pystac itself internally could handle the creation of items and adding to the catalog asynchronously, maybe with an extra parameter, to be synchronious handling still be the default, because "it just works".
The resulting error would be the following, which I think is clear to me, because .save() does not wait for anything to be in there:
The text was updated successfully, but these errors were encountered: