Scale offset from item asset #202

RSchueder · 2023-06-14T01:15:55Z

This MR enables users to set rescale=True in stackstac.stack and have the scale and offset values be taken from the STAC Item with priority over the COG.

gjoseph92

Thanks @RSchueder! Not the easiest part of the code to work on. There are some tricky bits around how this integrates with the dtype, and how to use other parts of raster:bands. I think overall, the approach here is good.

gjoseph92 · 2023-06-21T01:38:48Z

pyproject.toml

@@ -15,7 +15,7 @@ license = {text = "MIT"}
 name = "stackstac"
 readme = "README.md"
 requires-python = ">=3.8,<4.0"
-version = "0.4.3"
+version = "0.4.4"


Suggested change

version = "0.4.4"

version = "0.4.3"

I'd prefer to do the version bump as a separate PR, like #176 for example.

Sounds good

stackstac/prepare.py

gjoseph92 · 2023-06-21T01:50:19Z

stackstac/prepare.py

+                try:
+                    assert len(raster_bands) == 1
+                    asset_scale = raster_bands[0].get('scale', np.nan)
+                    asset_offset = raster_bands[0].get('offset', np.nan)
+                except AssertionError:
+                    raise ValueError(f'raster:bands has more than one element for asset {asset_id}.')


Suggested change

try:

assert len(raster_bands) == 1

asset_scale = raster_bands[0].get('scale', np.nan)

asset_offset = raster_bands[0].get('offset', np.nan)

except AssertionError:

raise ValueError(f'raster:bands has more than one element for asset {asset_id}.')

if len(raster_bands) != 1:

raise ValueError(

f"raster:bands has {len(raster_bands)} elements for asset {asset_id!r}. "

"Multi-band rasters are not currently supported.\n"

"If you don't care about this asset, you can skip it by giving a list "

"of asset IDs you *do* want in `assets=`, and leaving this one out."

)

asset_scale = raster_bands[0].get('scale', np.nan)

asset_offset = raster_bands[0].get('offset', np.nan)

No need to catch our own AssertionError here; just raise the ValueError when necessary.

gjoseph92 · 2023-06-21T02:15:48Z

stackstac/rio_reader.py

-            scale, offset = reader.scale_offset
+            scale, offset = self.scale_offset
+
+            if np.isnan(scale) or np.isnan(offset):


Note that it seems to be valid for only one of scale and offset to be set in raster:bands: https://github.com/stac-extensions/raster#transform-height-measurement-to-water-level.

In that case, it seems like it would still be appropriate to apply the rescaling from STAC metadata?

Ah very interesting! yes I agree.

I have enabled rescaling to be used even if only scale or offset are present.

gjoseph92 · 2023-06-21T03:00:17Z

stackstac/rio_reader.py

            if scale != 1:
+                if self.dtype in INTEGER_DTYPES:
+                    raise ValueError(f'Requested asset dtype ({self.dtype}) is not compatible with '


I think the check here would need to be more sophisticated. The goal of this check is to prevent silent information loss, and alert users that they need to either not rescale, or change the output data type to something else that can fit the rescaled values.

But just checking if the output dtype is integral is too strict. Rescaling in an integer dtype could be perfectly fine, if you're scaling by an integer, and the resulting values still fit in the new dtype. Rescaling from float to int could even be fine; maybe the values are all integers, but unnecessarily stored in float64 on disk.

Ideally we'd probably temporarily set NumPy to raise on overflow and just try to do the rescale, and handle that error if it occurs, but np.errstate is probably not threadsafe, which this method needs to be.

Note that right now, the docs for rescale say

Note that this could produce floating-point data when the original values are ints, so set dtype accordingly. You will NOT be warned if the cast to dtype is losing information!

which makes me feel that for this first PR, maybe it's better to not validate at all than to have overly strict validation.

Hmm yes I see how it is overly strict. I will remove the validation.

stackstac/rio_reader.py

gjoseph92 · 2023-06-21T03:15:35Z

stackstac/rio_reader.py

@@ -304,6 +304,7 @@ def __init__(
        dtype: np.dtype,
        fill_value: Union[int, float],
        rescale: bool,


It feels a little weird to have both rescale and scale_offset. We should probably just remove rescale, and at the same time, remove support for rescaling from metadata in the COG. As mentioned in #63 (comment), this doesn't seem to get used often anyway, and makes things more complex (it's unclear which should take precedence).

Then, for stack(..., rescale=False), we'd just pass a no-op scale_offset of (1, 0).

I have made the recommended changes, happy to get feedback to see if it makes sense how I use DEFAULT_SCALE, DEFAULT_OFFSET across the repo.

stackstac/prepare.py

gjoseph92 · 2023-06-21T03:25:17Z

stackstac/rio_reader.py

@@ -304,6 +304,7 @@ def __init__(
        dtype: np.dtype,
        fill_value: Union[int, float],
        rescale: bool,


Also, the docstring for the rescale parameter should be updated after this to mention that the values come from STAC metadata.

RSchueder · 2023-07-19T22:37:26Z

@gjoseph92 Thanks very much for the initial review. I was wondering if you had any additional suggestions for this MR? Thanks!

gjoseph92 · 2023-07-20T01:31:09Z

stackstac/rio_reader.py

+DEFAULT_SCALE = 1
+DEFAULT_OFFSET = 0


nit: I'd actually rather not have these be pulled out into variables, because they're not something that could ever change (DEFAULT_SCALE is never going to be 2, for instance), and the fact that they're 0 and 1 are somewhat significant to the code (if scale != DEFAULT_SCALE: result *= scale is strange unless you know DEFAULT_SCALE is 1) . To me, it's easier to just read/write 0 instead of DEFAULT_OFFSET.

Sounds good, I had just done this because then we have the number 1 and 0 hard-coded in numerous places. You are right that they will not change.

gjoseph92 · 2023-07-20T01:34:24Z

stackstac/rio_reader.py

+        if result.dtype != self.dtype:
+            result = result.astype(self.dtype, copy=False)


Note that this becomes unnecessary after #208, so depending on merge order, it can be removed. (This is actually just a different way of solving #206.)

I am happy to remove and delegate to #208 as it is authored by you. What is your timeline for merging that?

I was hoping to hear confirmation from @Berhinj that it solves the problem before merging, since I don't have a reproducer. However, I'm quite confident it'll work, so I'll just merge it now so we can get these both in.

I think I'd slightly prefer that approach to what you have here, just because it maybe saves one copy (letting GDAL read the data into an array of the desired output dtype, vs copying into a new array). There are so many copies internally already though, I doubt it matters much.

stackstac/rio_reader.py

stackstac/stack.py

RSchueder · 2023-07-20T17:29:45Z

@gjoseph92 comments round 2 addressed, except for the conflict with #208, which I will probably address after it is merged to master (assuming you would like to do #208 first).

gjoseph92

Thanks @RSchueder, great work! I'm going to:

merge Explicitly pass dtype during dataset read #208
push a commit here fixing the conflict
merge this

If you don't mind (since there's still no testing infrastructure in stackstac 😞), once it's merged, could you test this out yourself and confirm it works as expected? (I assume you have something in mind, since you went to the trouble of opening this PR.) After that, I'll tag a release.

gjoseph92 · 2023-07-23T15:47:08Z

stackstac/rio_reader.py

+        if result.dtype != self.dtype:
+            result = result.astype(self.dtype, copy=False)


I was hoping to hear confirmation from @Berhinj that it solves the problem before merging, since I don't have a reproducer. However, I'm quite confident it'll work, so I'll just merge it now so we can get these both in.

I think I'd slightly prefer that approach to what you have here, just because it maybe saves one copy (letting GDAL read the data into an array of the desired output dtype, vs copying into a new array). There are so many copies internally already though, I doubt it matters much.

RSchueder · 2023-07-25T19:41:16Z

@gjoseph92 sounds like a good plan, I have a functional test in my company repo that should validate this use case. I will install from main, demonstrate, and get back to you.

RSchueder · 2023-07-27T20:01:25Z

@gjoseph92 When installing this branch in my system, the following test passes:

from common.eo.catalog.eo_catalog import EOCatalog
from common.eo.catalog.stac.config import Catalog
from common.geometry import polygon_from_bounds
from common.types import EOProductType, TimeRange


def test_scale_offset():
    aoi = [-97.8028, 49.8776, -97.1300, 50.0845]
    toi = TimeRange(value='2021-12-01/2022-03-01')
    bands = ["red"]
    aoi = polygon_from_bounds(aoi)

    client = EOCatalog(product_type=EOProductType('sentinel-2-l2a'), source=Catalog('skyfox'))
    stac_items = client.search(
        aoi=aoi,
        toi=toi,
        cloud_cover=100,
    )
    x_idx = 500
    y_idx = 500
    datacube = client.stack(stac_items[:1], bands, rescale=False)
    unscaled_pixel = datacube[0, 0, y_idx, x_idx].values
    assert unscaled_pixel == 9288

    datacube = client.stack(stac_items[:1], bands, rescale=True)
    scaled_pixel = datacube[0, 0, y_idx, x_idx].values
    assert scaled_pixel == (unscaled_pixel * 0.0001) - 0.1

You can ignore the method for obtaining the items that constitute the content of the test aside from recognizing that they are sentinel-2-l2a items that have scale and offset data in their metadata.

stackstac.stack is called in client.stack() in the following way

            stack = stackstac.stack(
                list(items),
                assets=query_band_names,
                epsg=modal_epsg_code,
                resolution=product_type.RESOLUTION,
                dtype=product_type.DTYPE if not rescale else 'float64',
                fill_value=product_type.FILL_VALUE,
                bounds_latlon=bounds_latlon,
                rescale=rescale
            )

If you want to run it yourself or encode it in the repo somehow I can serialize the items to omit the search component of the test.

What do you think of this?

RSchueder · 2023-07-27T21:15:33Z

@gjoseph92 I also wanted to point out that this is a breaking change for my existing codebase:

  File "/usr/local/lib/python3.10/dist-packages/stackstac/rio_reader.py", line 408, in read
    result *= scale
  File "/usr/local/lib/python3.10/dist-packages/numpy/ma/core.py", line 4285, in __imul__
    self._data.__imul__(np.where(self._mask, self.dtype.type(1),
numpy.core._exceptions._UFuncOutputCastingError: Cannot cast ufunc 'multiply' output from dtype('float64') to dtype('uint16') with casting rule 'same_kind'

I suspect that this is caused by me specifying dtype='uint16' in stackstac.stack, but the default for rescale is True, causing the float scale to be multiplied by the int result. What do you think about this behaviour?

gjoseph92 · 2023-07-27T21:28:37Z

What do you think of this?

Seems good! Thanks for testing it out.

I suspect that this is caused by me specifying dtype='uint16' in stackstac.stack, but the default for rescale is True, causing the float scale to be multiplied by the int result. What do you think about this behaviour?

Interesting. So right now, the doctoring for rescale says:

stackstac/stackstac/stack.py

Lines 207 to 209 in 7836a36

    
                   Default: True. Note that this could produce floating-point data when the 
        
                   original values are ints, so set ``dtype`` accordingly. You will NOT be warned 
        
                   if the cast to ``dtype`` is losing information!

However, now that we know the scaling factors ahead of time, I think we can actually do better than this. If you specify dtyple=uint16, and we know any scaling or offset factor is floating-point, we could raise an error about this during stack explaining the problem and telling you to either set rescale=False or change the dtype. That would be much more helpful behavior, especially since this error from NumPy is hard to interpret.

Any interest in opening a PR to do that?

RSchueder · 2023-07-28T19:43:54Z

@gjoseph92 the PR for the fix you describe above has been created here: #209

gjoseph92 · 2023-09-19T20:45:27Z

FYI @RSchueder, this is now released in 0.5.0. Thanks for your contribution!

RSchueder added 4 commits June 13, 2023 23:39

added scale_offset that is read from STAC

36e07d3

convert dtype before rescale

dae0481

error handling for when dtype does not match scale request

1b93db6

and -> or

1dcc7b3

gjoseph92 reviewed Jun 21, 2023

View reviewed changes

remove scale_offset, int validation, MR comments

5dbf54c

RSchueder requested a review from gjoseph92 June 29, 2023 00:31

edit to README

1179df4

gjoseph92 reviewed Jul 20, 2023

View reviewed changes

respond to comments round 2

559b03f

RSchueder requested a review from gjoseph92 July 20, 2023 17:28

scale_offset dtype float -> Union[int,float]

8bf5839

gjoseph92 mentioned this pull request Jul 23, 2023

stackstac 0.4.3 to 0.4.4 creating an issue #206

Closed

gjoseph92 approved these changes Jul 23, 2023

View reviewed changes

Merge branch 'main' into pr/RSchueder/202

4f1d951

gjoseph92 merged commit 7836a36 into gjoseph92:main Jul 23, 2023

gjoseph92 mentioned this pull request Nov 30, 2023

supporting multidimensional coordinnates #222

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale offset from item asset #202

Scale offset from item asset #202

RSchueder commented Jun 14, 2023

gjoseph92 left a comment

gjoseph92 Jun 21, 2023

RSchueder Jun 27, 2023

gjoseph92 Jun 21, 2023

gjoseph92 Jun 21, 2023

RSchueder Jun 27, 2023

RSchueder Jun 29, 2023

gjoseph92 Jun 21, 2023

RSchueder Jun 28, 2023

gjoseph92 Jun 21, 2023

RSchueder Jun 29, 2023

gjoseph92 Jun 21, 2023

RSchueder Jun 29, 2023

RSchueder commented Jul 19, 2023

gjoseph92 Jul 20, 2023

RSchueder Jul 20, 2023 •

edited

Loading

gjoseph92 Jul 20, 2023

RSchueder Jul 20, 2023

gjoseph92 Jul 23, 2023

RSchueder commented Jul 20, 2023

gjoseph92 left a comment

gjoseph92 Jul 23, 2023

RSchueder commented Jul 25, 2023

RSchueder commented Jul 27, 2023 •

edited

Loading

RSchueder commented Jul 27, 2023 •

edited

Loading

gjoseph92 commented Jul 27, 2023

RSchueder commented Jul 28, 2023 •

edited

Loading

gjoseph92 commented Sep 19, 2023

		if result.dtype != self.dtype:
		result = result.astype(self.dtype, copy=False)

		DEFAULT_SCALE = 1
		DEFAULT_OFFSET = 0

Scale offset from item asset #202

Scale offset from item asset #202

Conversation

RSchueder commented Jun 14, 2023

gjoseph92 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RSchueder commented Jul 19, 2023

Choose a reason for hiding this comment

RSchueder Jul 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RSchueder commented Jul 20, 2023

gjoseph92 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RSchueder commented Jul 25, 2023

RSchueder commented Jul 27, 2023 • edited Loading

RSchueder commented Jul 27, 2023 • edited Loading

gjoseph92 commented Jul 27, 2023

RSchueder commented Jul 28, 2023 • edited Loading

gjoseph92 commented Sep 19, 2023

RSchueder Jul 20, 2023 •

edited

Loading

RSchueder commented Jul 27, 2023 •

edited

Loading

RSchueder commented Jul 27, 2023 •

edited

Loading

RSchueder commented Jul 28, 2023 •

edited

Loading