-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale offset from item asset #202
Scale offset from item asset #202
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RSchueder! Not the easiest part of the code to work on. There are some tricky bits around how this integrates with the dtype
, and how to use other parts of raster:bands
. I think overall, the approach here is good.
pyproject.toml
Outdated
@@ -15,7 +15,7 @@ license = {text = "MIT"} | |||
name = "stackstac" | |||
readme = "README.md" | |||
requires-python = ">=3.8,<4.0" | |||
version = "0.4.3" | |||
version = "0.4.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
version = "0.4.4" | |
version = "0.4.3" |
I'd prefer to do the version bump as a separate PR, like #176 for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
stackstac/prepare.py
Outdated
try: | ||
assert len(raster_bands) == 1 | ||
asset_scale = raster_bands[0].get('scale', np.nan) | ||
asset_offset = raster_bands[0].get('offset', np.nan) | ||
except AssertionError: | ||
raise ValueError(f'raster:bands has more than one element for asset {asset_id}.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try: | |
assert len(raster_bands) == 1 | |
asset_scale = raster_bands[0].get('scale', np.nan) | |
asset_offset = raster_bands[0].get('offset', np.nan) | |
except AssertionError: | |
raise ValueError(f'raster:bands has more than one element for asset {asset_id}.') | |
if len(raster_bands) != 1: | |
raise ValueError( | |
f"raster:bands has {len(raster_bands)} elements for asset {asset_id!r}. " | |
"Multi-band rasters are not currently supported.\n" | |
"If you don't care about this asset, you can skip it by giving a list " | |
"of asset IDs you *do* want in `assets=`, and leaving this one out." | |
) | |
asset_scale = raster_bands[0].get('scale', np.nan) | |
asset_offset = raster_bands[0].get('offset', np.nan) |
No need to catch our own AssertionError
here; just raise the ValueError
when necessary.
stackstac/rio_reader.py
Outdated
scale, offset = reader.scale_offset | ||
scale, offset = self.scale_offset | ||
|
||
if np.isnan(scale) or np.isnan(offset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it seems to be valid for only one of scale and offset to be set in raster:bands
: https://github.com/stac-extensions/raster#transform-height-measurement-to-water-level.
In that case, it seems like it would still be appropriate to apply the rescaling from STAC metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah very interesting! yes I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have enabled rescaling to be used even if only scale
or offset
are present.
stackstac/rio_reader.py
Outdated
if scale != 1: | ||
if self.dtype in INTEGER_DTYPES: | ||
raise ValueError(f'Requested asset dtype ({self.dtype}) is not compatible with ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the check here would need to be more sophisticated. The goal of this check is to prevent silent information loss, and alert users that they need to either not rescale, or change the output data type to something else that can fit the rescaled values.
But just checking if the output dtype is integral is too strict. Rescaling in an integer dtype could be perfectly fine, if you're scaling by an integer, and the resulting values still fit in the new dtype. Rescaling from float to int could even be fine; maybe the values are all integers, but unnecessarily stored in float64 on disk.
Ideally we'd probably temporarily set NumPy to raise on overflow and just try to do the rescale, and handle that error if it occurs, but np.errstate
is probably not threadsafe, which this method needs to be.
Note that right now, the docs for rescale
say
Note that this could produce floating-point data when the original values are ints, so set dtype accordingly. You will NOT be warned if the cast to dtype is losing information!
which makes me feel that for this first PR, maybe it's better to not validate at all than to have overly strict validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yes I see how it is overly strict. I will remove the validation.
stackstac/rio_reader.py
Outdated
@@ -304,6 +304,7 @@ def __init__( | |||
dtype: np.dtype, | |||
fill_value: Union[int, float], | |||
rescale: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a little weird to have both rescale
and scale_offset
. We should probably just remove rescale
, and at the same time, remove support for rescaling from metadata in the COG. As mentioned in #63 (comment), this doesn't seem to get used often anyway, and makes things more complex (it's unclear which should take precedence).
Then, for stack(..., rescale=False)
, we'd just pass a no-op scale_offset
of (1, 0)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the recommended changes, happy to get feedback to see if it makes sense how I use DEFAULT_SCALE, DEFAULT_OFFSET
across the repo.
stackstac/rio_reader.py
Outdated
@@ -304,6 +304,7 @@ def __init__( | |||
dtype: np.dtype, | |||
fill_value: Union[int, float], | |||
rescale: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the docstring for the rescale
parameter should be updated after this to mention that the values come from STAC metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@gjoseph92 Thanks very much for the initial review. I was wondering if you had any additional suggestions for this MR? Thanks! |
stackstac/rio_reader.py
Outdated
DEFAULT_SCALE = 1 | ||
DEFAULT_OFFSET = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd actually rather not have these be pulled out into variables, because they're not something that could ever change (DEFAULT_SCALE is never going to be 2, for instance), and the fact that they're 0 and 1 are somewhat significant to the code (if scale != DEFAULT_SCALE: result *= scale
is strange unless you know DEFAULT_SCALE
is 1) . To me, it's easier to just read/write 0
instead of DEFAULT_OFFSET
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I had just done this because then we have the number 1 and 0 hard-coded in numerous places. You are right that they will not change.
stackstac/rio_reader.py
Outdated
if result.dtype != self.dtype: | ||
result = result.astype(self.dtype, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to remove and delegate to #208 as it is authored by you. What is your timeline for merging that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to hear confirmation from @Berhinj that it solves the problem before merging, since I don't have a reproducer. However, I'm quite confident it'll work, so I'll just merge it now so we can get these both in.
I think I'd slightly prefer that approach to what you have here, just because it maybe saves one copy (letting GDAL read the data into an array of the desired output dtype, vs copying into a new array). There are so many copies internally already though, I doubt it matters much.
@gjoseph92 comments round 2 addressed, except for the conflict with #208, which I will probably address after it is merged to master (assuming you would like to do #208 first). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RSchueder, great work! I'm going to:
- merge Explicitly pass dtype during dataset read #208
- push a commit here fixing the conflict
- merge this
If you don't mind (since there's still no testing infrastructure in stackstac 😞), once it's merged, could you test this out yourself and confirm it works as expected? (I assume you have something in mind, since you went to the trouble of opening this PR.) After that, I'll tag a release.
stackstac/rio_reader.py
Outdated
if result.dtype != self.dtype: | ||
result = result.astype(self.dtype, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to hear confirmation from @Berhinj that it solves the problem before merging, since I don't have a reproducer. However, I'm quite confident it'll work, so I'll just merge it now so we can get these both in.
I think I'd slightly prefer that approach to what you have here, just because it maybe saves one copy (letting GDAL read the data into an array of the desired output dtype, vs copying into a new array). There are so many copies internally already though, I doubt it matters much.
@gjoseph92 sounds like a good plan, I have a functional test in my company repo that should validate this use case. I will install from main, demonstrate, and get back to you. |
@gjoseph92 When installing this branch in my system, the following test passes:
You can ignore the method for obtaining the items that constitute the content of the test aside from recognizing that they are
If you want to run it yourself or encode it in the repo somehow I can serialize the items to omit the search component of the test. What do you think of this? |
@gjoseph92 I also wanted to point out that this is a breaking change for my existing codebase:
I suspect that this is caused by me specifying |
Seems good! Thanks for testing it out.
Interesting. So right now, the doctoring for Lines 207 to 209 in 7836a36
However, now that we know the scaling factors ahead of time, I think we can actually do better than this. If you specify Any interest in opening a PR to do that? |
@gjoseph92 the PR for the fix you describe above has been created here: #209 |
FYI @RSchueder, this is now released in 0.5.0. Thanks for your contribution! |
This MR enables users to set
rescale=True
instackstac.stack
and have thescale
andoffset
values be taken from the STAC Item with priority over the COG.