Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use smaller internal data format when possible #63

Open
kylebarron opened this issue Jun 25, 2021 · 1 comment
Open

Use smaller internal data format when possible #63

kylebarron opened this issue Jun 25, 2021 · 1 comment

Comments

@kylebarron
Copy link
Contributor

From looking at some examples, it appears that data is always loaded to float64 arrays. For example in https://github.com/gjoseph92/stackstac/blob/5f984b211993380955b5d3f9eba3f3e285f6952c/examples/show.ipynb, loading the RGB bands of a Sentinel 2 asset (rgb = stack.sel(band=["B04", "B03", "B02"]).persist() ) creates an xarray dataset of type float64. It seems to me that you could improve performance (or at least memory usage) if you were able to use a smaller data type when possible.

You could look at the raster:bands object if it exists to optimize the xarray data type. If the extension doesn't exist, or if the bands have mixed dtypes, then fall back to float64?

@gjoseph92
Copy link
Owner

Using float64 by default was an intentional choice because

  • raster:bands didn't exist when I wrote everything a few months ago, so there was no way to know without actually fetching data what the native dtype of the asset would be. But we have to know that ahead of time to correctly construct the dask array. So float64 seemed like the safest default, since anything else could lose precision.

  • rescale=True by default, which uses the scale_offset metadata defined within each GeoTIFF (not known within the STAC metadata) to apply rescaling. So even if the asset were uint16 to begin with, it could become float64 after applying rescaling—yet another reason why that default made sense.

    However from what I've seen, nobody really sets the scale_offset metadata at the GeoTIFF level, so I think this might be reasonable to remove. It would make thinking about dtypes a lot easier.

Note that you can control the dtype using the dtype= parameter to stackstac.stack. You'll also want to set rescale=False if doing this, as noted in the docs.

I'd really like to make this automatic though. I think raster:bands is the missing link to allow us to do that. Having data_type, scale, offset, and nodata in metadata really changes the game!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants