Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out multiband writes #44

Open
jessjaco opened this issue Jan 5, 2024 · 5 comments
Open

Figure out multiband writes #44

jessjaco opened this issue Jan 5, 2024 · 5 comments

Comments

@jessjaco
Copy link
Collaborator

jessjaco commented Jan 5, 2024

Since the popular approach is to only write singleband tiffs, datasets with the same source data end up running the dask graph for each band. Threaded writes don't resolve this. Loading into memory before calling the write function works but is wasteful (since writing the corresponding multiband tiff to memory compresses the data). One solution might be to write to memory, then parse the bands. Or, look in the stac guidance for a possible way to write multibands

@jessjaco
Copy link
Collaborator Author

jessjaco commented Jan 5, 2024

See gjoseph92/stackstac#62

@jessjaco
Copy link
Collaborator Author

jessjaco commented Jan 5, 2024

From that link,

  1. You can write multiband stac items but
  2. Neither stackstac nor odc.stac.load (I tested it) can read them

One workaround is to create a vrt for each band, but there is a very good point that GeoTiff's are pixel interleaved by default, though they can be interleaved by band. However, while the gdal GeoTiff driver supports band interleaving, the gdal COG driver doesn't. This is even more confusing as the COG standard supports BSQ writing. So ultimately, while the VRT approach may work, there are performance considerations on read (as an aside, this may be why some operations on the multiband tide data are so memory intensive). (Though also consider writing a COG using the GeoTiff driver as we used to do).

@jessjaco
Copy link
Collaborator Author

jessjaco commented Jan 8, 2024

I think the simplest (probably) workable solution is to offer the option to load a dataset right before write - when the values have been (in most cases) scaled to their minimal representation. Not dissimilar to what Alex was doing in the PR I refused last week. These shouldn't be that large for the grid size we're dealing with. The only frustration is then we will have two versions in memory at one (one uncompressed as an xarray, and one compressed as a blob).

This precludes us from the ultimate goal of never having a whole dataset in memory at once, but that hasn't yet been possible (unless we use the dask writer to s3 from odc, which we haven't)

@alexgleith
Copy link
Contributor

unless we use the dask writer to s3 from odc

I have some hesitations about that writer. It's not using GDAL at all, and I worry about the maintenance of it. I also had an issue when trying to use it, but that might have just been my environment.

the simplest (probably) workable solution is to offer the option to load a dataset right before write

I've been having errors when not loading data into memory before writing, possibly only with big dask graphs. Doing the load before writing has proven reliable.

@jessjaco
Copy link
Collaborator Author

jessjaco commented Feb 2, 2024

unless we use the dask writer to s3 from odc

I have some hesitations about that writer. It's not using GDAL at all, and I worry about the maintenance of it. I also had an issue when trying to use it, but that might have just been my environment.

the simplest (probably) workable solution is to offer the option to load a dataset right before write

I've been having errors when not loading data into memory before writing, possibly only with big dask graphs. Doing the load before writing has proven reliable.

I haven't had errors, but if the bands are written to separate files, it will load common source bands (like qa_pixel) multiple times. My guess is this is part of the issue you were experiencing that made you implement multithreaded writes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants