Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-band COGs #62

Open
TomAugspurger opened this issue Jun 23, 2021 · 12 comments
Open

Support multi-band COGs #62

TomAugspurger opened this issue Jun 23, 2021 · 12 comments

Comments

@TomAugspurger
Copy link
Contributor

This is mentioned in the README, but I thought I'd open an issue that I can link to.

Currently, multi-band cogs are not supported by stackstac.

>>> import pystac
>>> import stackstac

>>> item = pystac.read_file(
...     "https://planetarycomputer.microsoft.com/api/stac/v1/collections/naip/items/fl_m_2608004_nw_17_060_20191215_20200113"
... )
>>> ds = stackstac.stack([item.to_dict()])
>>> ds.shape
(1, 1, 12240, 11040)

That COG has 4 bands, so the shape should be (1, 4, 12240, 11040).

I started a branch, but haven't had a chance to finish it off. I'll post here if / when I pick it up again.

@kylebarron
Copy link
Contributor

You could probably parse both eo:bands and raster:bands if it exists. Seems you could optimize the Dask arrays in some ways if you already know e.g. what the data type of the file is.

@TomAugspurger
Copy link
Contributor Author

@gjoseph92 implementation-wise, do you have a sense for which of these would be preferable?

  1. Rewrite the incoming items to take a multiband asset like image and replace it with one asset per band. Each new asset would have the same href, so we also need to include the band index number somewhere in the item. The Reader would read a single band.
  2. Rewrite the asset detection logic to look for eo:bands in each asset. Then (somehow) update the Dataset construction logic to know about these bands. The Readers would maybe need to be updated to know that they're supposed to read all the bands.

If I had to guess, 1 sounds a bit easier to shoehorn into the current setup, but I'm not confident in that.

@gjoseph92
Copy link
Owner

Thanks for opening this @TomAugspurger. It's true that 1 might be easier, but I think we should go with some form of 2 for a few reasons. I had always intended Readers to support multi-band; I just cut that corner to get the first release out.

Depending on the file format, it may not be possible to read the individual bands within an asset separately. For example by default, GeoTIFFs are pixel-interleaved. Therefore, we really don't want to have a separate "virtual" asset per band, because then fetching any of R, G, or B would require reading RGB—and trying to fetch R, G, and B would triple-read the data (GDAL block-caching aside; we don't want to rely on that too much) and open triple datasets, which will then be further copied per thread!

So basically, we need dask's chunking along the bands to match up with how the data is actually physically chunked. In the case of Sentinel-2, if we asked stackstac for assets=["B08", "visual-10m", "B11"], I'd want the chunks along the bands axis to be (1, 3, 1). That way, whether you access stack[:, 1] or stack[:, 1:3], dask would fetch just one asset, which would read all three RGB bands.

Of course non-GeoTIFF formats (or band-interleaved GeoTIFFs) might support efficient sub-selection by band, but since stackstac primarily cares about STAC best practices right now, and GeoTIFF is STAC best practice, I think it'd be reasonable to start by assuming assets don't support efficient band sub-selection.

So we'll say we always have one chunk per asset, and the length of that chunk equals the number of bands in that asset.

Then, I think we'd do something like:

  1. In prepare_items, figure out which bands are in the asset.

    IMO this is actually the hardest part and the reason I skipped it, because I didn't want to deal with item_assets vs eo:bands on the Item vs eo:bands on the Asset, and item_assets won't even work currently because prepare_items is designed around taking a sequence of STAC items, not a Collection. Now that ItemCollection exists, I'm not sure if this is less of a pain—can an ItemCollection have an item_assets field? In practice, will any STAC API actually fill that in? Curious if @matthewhanson has any recommendations here. Once you figure this hard part out, then please do Index bands by common_name when available? #3 as well!

  2. Add the expected number of bands to the asset table.

  3. Make Readers check that the asset, when opened, actually has that number of bands, then read all of them (read() vs read(1)). I want to guard against incorrect STACs breaking the dask array downstream.

  4. Set the chunks of the output array along the bands dimension as described above (probably with an adjust_chunks={"b": bands_chunks} or something in the blockwise here).

@scotjohn
Copy link

Are there any new developments concerning this? I wonder if there is a workaround that allows us to work with multiband data (apart from creating single band copies).

Or can we simply input single band VRTs pointing to the multiband data, without loosing to much performance?

@gjoseph92
Copy link
Owner

Unless @TomAugspurger has been working on it, there are no updates. Updating the dask and rasterio side of things to handle multi-band assets is comparatively straightforward. The main thing I haven't wanted to deal with is parsing the various places in STAC where the number of bands per asset can be defined.

There isn't a workaround for this that I know of. If you want to take a stab at it yourself, contributions are always welcome!

@scotjohn
Copy link

@gjoseph92 thank you for your answer. I tried it out: Working with single-band VRTs pointing to multiband COGs worked well for me, allowing to use stackstac in the intended way, i.e. single-band raster inputs. Also it seems to have no impact on processing speed (concluding from a small, limited comparative experiement).

@gjoseph92
Copy link
Owner

@scotjohn mind sharing an example of how you set up the VRTs? Did you rewrite the STAC and create your own asset files, which were actually VRTs pointing to the original assets?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 20, 2022 via email

@scotjohn
Copy link

@scotjohn mind sharing an example of how you set up the VRTs? Did you rewrite the STAC and create your own asset files, which were actually VRTs pointing to the original assets?

Yes, it is as you described.

I work with 4-band PlanetFusion data that are available as multiband COGs. For each COG I create four VRTs, i.e. one per band, which become the assets. I have written the STAC from scratch. So I am not sure how this fits into the problem of existing STACs and public data catalogs.

@scotjohn
Copy link

To generate VRTs from COGs:

def create_bandwise_vrt(in_file: Path, out_dir: Path, n_bands: int):
    for i in range(1, n_bands + 1):
        out_vrt = Path(out_dir, in_file.stem + f"_B{i:02}.vrt")
        in_files = [str(in_file)]
        vrt_options = gdal.BuildVRTOptions(
            bandList=[i]
        )
        gdal.BuildVRT(str(out_vrt), in_files, options=vrt_options)

@julianblue
Copy link

Hi @gjoseph92 ,
is this feature planned to be integrated some point soon or is there a blocker that needs support ?

@gjoseph92
Copy link
Owner

@julianblue my current job doesn't allocate time for me to work on stackstac, so there's no timeline for this happening. I'm opening to reviewing PRs though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants