Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale offset from item asset #202

Merged
merged 9 commits into from
Jul 23, 2023

Conversation

RSchueder
Copy link
Contributor

This MR enables users to set rescale=True in stackstac.stack and have the scale and offset values be taken from the STAC Item with priority over the COG.

Copy link
Owner

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RSchueder! Not the easiest part of the code to work on. There are some tricky bits around how this integrates with the dtype, and how to use other parts of raster:bands. I think overall, the approach here is good.

pyproject.toml Outdated
@@ -15,7 +15,7 @@ license = {text = "MIT"}
name = "stackstac"
readme = "README.md"
requires-python = ">=3.8,<4.0"
version = "0.4.3"
version = "0.4.4"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
version = "0.4.4"
version = "0.4.3"

I'd prefer to do the version bump as a separate PR, like #176 for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

stackstac/prepare.py Show resolved Hide resolved
Comment on lines 148 to 153
try:
assert len(raster_bands) == 1
asset_scale = raster_bands[0].get('scale', np.nan)
asset_offset = raster_bands[0].get('offset', np.nan)
except AssertionError:
raise ValueError(f'raster:bands has more than one element for asset {asset_id}.')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
assert len(raster_bands) == 1
asset_scale = raster_bands[0].get('scale', np.nan)
asset_offset = raster_bands[0].get('offset', np.nan)
except AssertionError:
raise ValueError(f'raster:bands has more than one element for asset {asset_id}.')
if len(raster_bands) != 1:
raise ValueError(
f"raster:bands has {len(raster_bands)} elements for asset {asset_id!r}. "
"Multi-band rasters are not currently supported.\n"
"If you don't care about this asset, you can skip it by giving a list "
"of asset IDs you *do* want in `assets=`, and leaving this one out."
)
asset_scale = raster_bands[0].get('scale', np.nan)
asset_offset = raster_bands[0].get('offset', np.nan)

No need to catch our own AssertionError here; just raise the ValueError when necessary.

scale, offset = reader.scale_offset
scale, offset = self.scale_offset

if np.isnan(scale) or np.isnan(offset):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it seems to be valid for only one of scale and offset to be set in raster:bands: https://github.com/stac-extensions/raster#transform-height-measurement-to-water-level.

In that case, it seems like it would still be appropriate to apply the rescaling from STAC metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah very interesting! yes I agree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have enabled rescaling to be used even if only scale or offset are present.

if scale != 1:
if self.dtype in INTEGER_DTYPES:
raise ValueError(f'Requested asset dtype ({self.dtype}) is not compatible with '
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check here would need to be more sophisticated. The goal of this check is to prevent silent information loss, and alert users that they need to either not rescale, or change the output data type to something else that can fit the rescaled values.

But just checking if the output dtype is integral is too strict. Rescaling in an integer dtype could be perfectly fine, if you're scaling by an integer, and the resulting values still fit in the new dtype. Rescaling from float to int could even be fine; maybe the values are all integers, but unnecessarily stored in float64 on disk.

Ideally we'd probably temporarily set NumPy to raise on overflow and just try to do the rescale, and handle that error if it occurs, but np.errstate is probably not threadsafe, which this method needs to be.

Note that right now, the docs for rescale say

Note that this could produce floating-point data when the original values are ints, so set dtype accordingly. You will NOT be warned if the cast to dtype is losing information!

which makes me feel that for this first PR, maybe it's better to not validate at all than to have overly strict validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yes I see how it is overly strict. I will remove the validation.

stackstac/rio_reader.py Outdated Show resolved Hide resolved
@@ -304,6 +304,7 @@ def __init__(
dtype: np.dtype,
fill_value: Union[int, float],
rescale: bool,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a little weird to have both rescale and scale_offset. We should probably just remove rescale, and at the same time, remove support for rescaling from metadata in the COG. As mentioned in #63 (comment), this doesn't seem to get used often anyway, and makes things more complex (it's unclear which should take precedence).

Then, for stack(..., rescale=False), we'd just pass a no-op scale_offset of (1, 0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the recommended changes, happy to get feedback to see if it makes sense how I use DEFAULT_SCALE, DEFAULT_OFFSET across the repo.

stackstac/prepare.py Show resolved Hide resolved
@@ -304,6 +304,7 @@ def __init__(
dtype: np.dtype,
fill_value: Union[int, float],
rescale: bool,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the docstring for the rescale parameter should be updated after this to mention that the values come from STAC metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@RSchueder RSchueder requested a review from gjoseph92 June 29, 2023 00:31
@RSchueder
Copy link
Contributor Author

@gjoseph92 Thanks very much for the initial review. I was wondering if you had any additional suggestions for this MR? Thanks!

Comment on lines 68 to 69
DEFAULT_SCALE = 1
DEFAULT_OFFSET = 0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd actually rather not have these be pulled out into variables, because they're not something that could ever change (DEFAULT_SCALE is never going to be 2, for instance), and the fact that they're 0 and 1 are somewhat significant to the code (if scale != DEFAULT_SCALE: result *= scale is strange unless you know DEFAULT_SCALE is 1) . To me, it's easier to just read/write 0 instead of DEFAULT_OFFSET.

Copy link
Contributor Author

@RSchueder RSchueder Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I had just done this because then we have the number 1 and 0 hard-coded in numerous places. You are right that they will not change.

Comment on lines 403 to 404
if result.dtype != self.dtype:
result = result.astype(self.dtype, copy=False)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this becomes unnecessary after #208, so depending on merge order, it can be removed. (This is actually just a different way of solving #206.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to remove and delegate to #208 as it is authored by you. What is your timeline for merging that?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to hear confirmation from @Berhinj that it solves the problem before merging, since I don't have a reproducer. However, I'm quite confident it'll work, so I'll just merge it now so we can get these both in.

I think I'd slightly prefer that approach to what you have here, just because it maybe saves one copy (letting GDAL read the data into an array of the desired output dtype, vs copying into a new array). There are so many copies internally already though, I doubt it matters much.

stackstac/rio_reader.py Show resolved Hide resolved
stackstac/stack.py Outdated Show resolved Hide resolved
@RSchueder RSchueder requested a review from gjoseph92 July 20, 2023 17:28
@RSchueder
Copy link
Contributor Author

@gjoseph92 comments round 2 addressed, except for the conflict with #208, which I will probably address after it is merged to master (assuming you would like to do #208 first).

Copy link
Owner

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RSchueder, great work! I'm going to:

If you don't mind (since there's still no testing infrastructure in stackstac 😞), once it's merged, could you test this out yourself and confirm it works as expected? (I assume you have something in mind, since you went to the trouble of opening this PR.) After that, I'll tag a release.

Comment on lines 403 to 404
if result.dtype != self.dtype:
result = result.astype(self.dtype, copy=False)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to hear confirmation from @Berhinj that it solves the problem before merging, since I don't have a reproducer. However, I'm quite confident it'll work, so I'll just merge it now so we can get these both in.

I think I'd slightly prefer that approach to what you have here, just because it maybe saves one copy (letting GDAL read the data into an array of the desired output dtype, vs copying into a new array). There are so many copies internally already though, I doubt it matters much.

@gjoseph92 gjoseph92 merged commit 7836a36 into gjoseph92:main Jul 23, 2023
@RSchueder
Copy link
Contributor Author

@gjoseph92 sounds like a good plan, I have a functional test in my company repo that should validate this use case. I will install from main, demonstrate, and get back to you.

@RSchueder
Copy link
Contributor Author

RSchueder commented Jul 27, 2023

@gjoseph92 When installing this branch in my system, the following test passes:

from common.eo.catalog.eo_catalog import EOCatalog
from common.eo.catalog.stac.config import Catalog
from common.geometry import polygon_from_bounds
from common.types import EOProductType, TimeRange


def test_scale_offset():
    aoi = [-97.8028, 49.8776, -97.1300, 50.0845]
    toi = TimeRange(value='2021-12-01/2022-03-01')
    bands = ["red"]
    aoi = polygon_from_bounds(aoi)

    client = EOCatalog(product_type=EOProductType('sentinel-2-l2a'), source=Catalog('skyfox'))
    stac_items = client.search(
        aoi=aoi,
        toi=toi,
        cloud_cover=100,
    )
    x_idx = 500
    y_idx = 500
    datacube = client.stack(stac_items[:1], bands, rescale=False)
    unscaled_pixel = datacube[0, 0, y_idx, x_idx].values
    assert unscaled_pixel == 9288

    datacube = client.stack(stac_items[:1], bands, rescale=True)
    scaled_pixel = datacube[0, 0, y_idx, x_idx].values
    assert scaled_pixel == (unscaled_pixel * 0.0001) - 0.1


You can ignore the method for obtaining the items that constitute the content of the test aside from recognizing that they are sentinel-2-l2a items that have scale and offset data in their metadata.

stackstac.stack is called in client.stack() in the following way

            stack = stackstac.stack(
                list(items),
                assets=query_band_names,
                epsg=modal_epsg_code,
                resolution=product_type.RESOLUTION,
                dtype=product_type.DTYPE if not rescale else 'float64',
                fill_value=product_type.FILL_VALUE,
                bounds_latlon=bounds_latlon,
                rescale=rescale
            )

If you want to run it yourself or encode it in the repo somehow I can serialize the items to omit the search component of the test.

What do you think of this?

@RSchueder
Copy link
Contributor Author

RSchueder commented Jul 27, 2023

@gjoseph92 I also wanted to point out that this is a breaking change for my existing codebase:

  File "/usr/local/lib/python3.10/dist-packages/stackstac/rio_reader.py", line 408, in read
    result *= scale
  File "/usr/local/lib/python3.10/dist-packages/numpy/ma/core.py", line 4285, in __imul__
    self._data.__imul__(np.where(self._mask, self.dtype.type(1),
numpy.core._exceptions._UFuncOutputCastingError: Cannot cast ufunc 'multiply' output from dtype('float64') to dtype('uint16') with casting rule 'same_kind'

I suspect that this is caused by me specifying dtype='uint16' in stackstac.stack, but the default for rescale is True, causing the float scale to be multiplied by the int result. What do you think about this behaviour?

@gjoseph92
Copy link
Owner

What do you think of this?

Seems good! Thanks for testing it out.

I suspect that this is caused by me specifying dtype='uint16' in stackstac.stack, but the default for rescale is True, causing the float scale to be multiplied by the int result. What do you think about this behaviour?

Interesting. So right now, the doctoring for rescale says:

Default: True. Note that this could produce floating-point data when the
original values are ints, so set ``dtype`` accordingly. You will NOT be warned
if the cast to ``dtype`` is losing information!

However, now that we know the scaling factors ahead of time, I think we can actually do better than this. If you specify dtyple=uint16, and we know any scaling or offset factor is floating-point, we could raise an error about this during stack explaining the problem and telling you to either set rescale=False or change the dtype. That would be much more helpful behavior, especially since this error from NumPy is hard to interpret.

Any interest in opening a PR to do that?

@RSchueder
Copy link
Contributor Author

RSchueder commented Jul 28, 2023

@gjoseph92 the PR for the fix you describe above has been created here: #209

@gjoseph92
Copy link
Owner

FYI @RSchueder, this is now released in 0.5.0. Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants