Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write Items or collections in parallel #690

Closed
ljstrnadiii opened this issue Dec 20, 2021 · 2 comments
Closed

Write Items or collections in parallel #690

ljstrnadiii opened this issue Dec 20, 2021 · 2 comments
Labels

Comments

@ljstrnadiii
Copy link

ljstrnadiii commented Dec 20, 2021

I am pretty new to stac and pystac, but am currently giving it a try.

I have 1tb of geotiffs and I have build a function to build items by scanning over all elements, categorizing into collections, adding collections and writing out to a catalog. I wrote a custom stackio object for a first pass, simple way to persist to gcs:

class GCSStacIO(DefaultStacIO):
    """
    Helper class to configure how we read and write to our catalog. Extending
    this default stacio class allows the catalog to overwrite the read and write
    to the files on gcs for persistence!
    """

    def __init__(self):
        self.fs = gcsfs.GCSFileSystem()

    def read_text(self, source: Union[str, Link], *args: Any, **kwargs: Any) -> str:
        if "gs://" in source:
            source = source.strip("gs://")
        with self.fs.open(source) as file:
            text = file.read()
        return text

    def write_text(
        self, dest: Union[str, Link], txt: str, *args: Any, **kwargs: Any
    ) -> None:
        if "gs://" in dest:
            dest = dest.strip("gs://")
        with self.fs.open(dest, "w") as file:
            text = file.write(txt)
        return

as described here.

When I save out, it takes some time since we seem to be saving out many very small jsons (one per item). Is there any way to write these out in parallel? (assuming each item should in fact be saved out as a small json element)

So, two questions:

  1. is there a way to consolidate these Item files?
  2. if not, is there a way to parallelize over writes? I can't seem to find where the save on the catalog calls save for recursive items to json. (may someone point me to that?)
  3. Is there any way to open a saved out catalog and just adding a single collection?

Thanks a ton! Cool stuff!

@duckontheweb
Copy link
Contributor

@ljstrnadiii Thanks for raising this.

I opened #749 to add async I/O operations to the library, which should help with this.

  1. is there a way to consolidate these Item files?

The answer to this really depends on the content and structure of your GeoTIFFs. STAC Items are meant to represent distinct spatiotemporal resources (think satellite imagery scenes), so and files that are part of the same scene should be represented as assets in a single Item. However, if the GeoTIFFs are not part of a single scene then you are probably doing the right thing by creating separate Items. @m-mohr @cholmes May be able to offer additional guidance on this as well.

2. if not, is there a way to parallelize over writes? I can't seem to find where the save on the catalog calls save for recursive items to json. (may someone point me to that?)

I've opened #749 to introduce async I/O operations into the library, and any feedback on that would be much appreciated. Currently, the recursive saving of Items happens here. In the PR, those save requests are made in batches of asynchronous requests here.

3. Is there any way to open a saved out catalog and just adding a single collection?

Calling Catalog.save will only save child Catalogs, Collections, and Items that have been resolved in memory (see the checks here and here. This means that you should be able to...

  1. Open a saved Catalog
  2. Open up a single child Collection or create a new Collection
  3. Call Catalog.save

...and PySTAC will only save the root Catalog and the modified/new Collection without walking the rest of the Catalog.

@gadomski gadomski added this to the 2.0 milestone Jan 31, 2023
@gadomski
Copy link
Member

We discussed async support during today's stac-utils working group meeting. The group coalesced around the idea that async operations would not be appropriate or realistic for PySTAC in its current form. If folks want to do async work, they should write their own IO and just use the PySTAC data structures. One example of this pattern is this method in stac-asset.

@gadomski gadomski closed this as not planned Won't fix, can't repro, duplicate, stale Jun 26, 2023
@gadomski gadomski removed this from the 2.0 milestone Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants