Write Items or collections in parallel #690

ljstrnadiii · 2021-12-20T22:41:52Z

I am pretty new to stac and pystac, but am currently giving it a try.

I have 1tb of geotiffs and I have build a function to build items by scanning over all elements, categorizing into collections, adding collections and writing out to a catalog. I wrote a custom stackio object for a first pass, simple way to persist to gcs:

class GCSStacIO(DefaultStacIO):
    """
    Helper class to configure how we read and write to our catalog. Extending
    this default stacio class allows the catalog to overwrite the read and write
    to the files on gcs for persistence!
    """

    def __init__(self):
        self.fs = gcsfs.GCSFileSystem()

    def read_text(self, source: Union[str, Link], *args: Any, **kwargs: Any) -> str:
        if "gs://" in source:
            source = source.strip("gs://")
        with self.fs.open(source) as file:
            text = file.read()
        return text

    def write_text(
        self, dest: Union[str, Link], txt: str, *args: Any, **kwargs: Any
    ) -> None:
        if "gs://" in dest:
            dest = dest.strip("gs://")
        with self.fs.open(dest, "w") as file:
            text = file.write(txt)
        return

as described here.

When I save out, it takes some time since we seem to be saving out many very small jsons (one per item). Is there any way to write these out in parallel? (assuming each item should in fact be saved out as a small json element)

So, two questions:

is there a way to consolidate these Item files?
if not, is there a way to parallelize over writes? I can't seem to find where the save on the catalog calls save for recursive items to json. (may someone point me to that?)
Is there any way to open a saved out catalog and just adding a single collection?

Thanks a ton! Cool stuff!

The text was updated successfully, but these errors were encountered:

duckontheweb · 2022-02-16T15:00:18Z

@ljstrnadiii Thanks for raising this.

I opened #749 to add async I/O operations to the library, which should help with this.

is there a way to consolidate these Item files?

The answer to this really depends on the content and structure of your GeoTIFFs. STAC Items are meant to represent distinct spatiotemporal resources (think satellite imagery scenes), so and files that are part of the same scene should be represented as assets in a single Item. However, if the GeoTIFFs are not part of a single scene then you are probably doing the right thing by creating separate Items. @m-mohr @cholmes May be able to offer additional guidance on this as well.

2. if not, is there a way to parallelize over writes? I can't seem to find where the save on the catalog calls save for recursive items to json. (may someone point me to that?)

I've opened #749 to introduce async I/O operations into the library, and any feedback on that would be much appreciated. Currently, the recursive saving of Items happens here. In the PR, those save requests are made in batches of asynchronous requests here.

3. Is there any way to open a saved out catalog and just adding a single collection?

Calling Catalog.save will only save child Catalogs, Collections, and Items that have been resolved in memory (see the checks here and here. This means that you should be able to...

Open a saved Catalog
Open up a single child Collection or create a new Collection
Call Catalog.save

...and PySTAC will only save the root Catalog and the modified/new Collection without walking the rest of the Catalog.

gadomski · 2023-06-26T19:11:31Z

We discussed async support during today's stac-utils working group meeting. The group coalesced around the idea that async operations would not be appropriate or realistic for PySTAC in its current form. If folks want to do async work, they should write their own IO and just use the PySTAC data structures. One example of this pattern is this method in stac-asset.

gadomski added this to the 2.0 milestone Jan 31, 2023

gadomski closed this as not planned Won't fix, can't repro, duplicate, stale Jun 26, 2023

gadomski removed this from the 2.0 milestone Jun 26, 2023

gadomski added the wontfix label Jun 26, 2023

gadomski mentioned this issue Jun 26, 2023

Make pystac handle item creation asynchronously #609

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write Items or collections in parallel #690

Write Items or collections in parallel #690

ljstrnadiii commented Dec 20, 2021 •

edited

Loading

duckontheweb commented Feb 16, 2022

gadomski commented Jun 26, 2023

Write Items or collections in parallel #690

Write Items or collections in parallel #690

Comments

ljstrnadiii commented Dec 20, 2021 • edited Loading

duckontheweb commented Feb 16, 2022

gadomski commented Jun 26, 2023

ljstrnadiii commented Dec 20, 2021 •

edited

Loading