-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement get_abs_max on BucketResampler #418
Conversation
Add a (failing) unit test for the new (not yet implemented) get_abs_max and get_abs_min methods on the BucketResampler class.
First attempt to implement get_abs_max for the bucket resampler. It's failing because ``da.where`` return ``NotImplemented``. I don't understand the problem.
Fix the implementation for get_abs_max. Remove implementation and test for get_abs_min because it's harder to implement and I don't need it.
Codecov Report
@@ Coverage Diff @@
## main #418 +/- ##
=======================================
Coverage 93.95% 93.96%
=======================================
Files 65 65
Lines 11269 11287 +18
=======================================
+ Hits 10588 10606 +18
Misses 681 681
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Remove some code that is unreachable and untested
Fix bug in get_abs_max that was getting the wrong longitudes. Uses the new method in the bucket resampler implemented at pytroll/pyresample#418
Recover the unit test accidentally deleted in a previous commit.
I don't understand why this fails. It works in practice. |
It also seems to be very slow ☹ Computing |
I think this was @zxdawn's problem in the other PR. It is hard to get this stuff to perform well. |
I'm using it in the parallax correction (pytroll/satpy#1904), which, as it stands now using |
@gerritholl Here's the question I asked on StackOverflow. The root should be the slow process of pandas groupby(). You can test the time it costs. |
It's hard to see what's happening here but you could try something like. flox.groupby_reduce(data, by=???, expected_groups=((0,1,2,3),), isbin=True, func="nanmax") Saying By default
A minimal example would be useful. I've been trying to collect problematic usecases and eventually hope to write some "user story" like blogposts so we can do some community-wide brainstorming. The really cool thing about |
Thanks @dcherian for the reply! I don't have a minimal example yet. In my non-minimal real use case, I have 3712×3712 ≅ 13.7M bins. 99% of the bins have exactly one value. About 0.5% have 2 values and less than 0.1% have 3 or more values, with the largest number of values in the image being 5. Slightly less than 0.5% have 0 values. The chunks are much larger than the bins. I have no time dimension. My aim is to get the largest absolute value in each bin, or nan for bins without any values. I suspect that my current approach is widely suboptimal, regardless of using dask or not. For the 99% I should just keep the single value, and only the ones with 2 or more should need any calculations. I should certainly look at flox. Unfortunately, I'm neither very well versed in dask nor in pandas, so I'm not sure yet how to approach optimising my implementation (and at the same time keep it flexible and generic enough for it to fit in pyresample). |
Thanks @gerritholl what are your chunk sizes then?
It sounds like this is mostly reindexing and a tiny amount of groupby-reduction. Are the lat, lon for the input image stored as numpy arrays i.e. do you know ahead of time which x,y pixels belong to which bin? I'm guessing these are dask arrays (based on the documentation for bucketing). In that case, any general solution (like the one in flox) will effectively rechunk to one chunk along x,y for the output grid. Can you start with that chunking scheme instead?
It's not worth it right now, grouping by multiple dask variables has not been implemented yet. It would help if you provided some example code, I can take a look next week but no promises. I'd need the array to group, arrays that we are grouping by(lat,lon i'm guessing) and the output grid. It's OK if the stuff is being read from a file. |
Each of the arrays for data, lat, and lon have shape I really appreciate your offer for assistance! I will work on some example code. |
If I increase max_computes in the unit test, the test passes
Likely related to the slowness is the observation that the unit test passes only with |
I suppose that ideally, they should pass with Seeing that the slowness is probably a consequence of |
I think that's up to @pnuu or @zxdawn and you to agree on. I'd be ok with a future PR, but would also be OK if it was part of this PR. And yes, ideally there would be zero computes inside the resampler. If there have to be then we at least want to make them intentional and retained so they only have to happen a minimal number of times and preferably don't require the user to compute the same tasks again when they compute their final result. |
I would like to get this PR merged, so that the tests on pytroll/satpy#1904 can pass and a first version of parallax correction might be merged into Satpy main. If this PR gets merged, I will open a pyresample issue noting that:
For both, it would be very helpful to complete #368. |
@zxdawn do you have time to give a formal review of this PR? |
@djhoese sorry, I'm on the vacation in the Iceland and will back on 10th April. If I find any free time, I will check it. |
Get off of GitHub! |
@gerritholl Just one small comment: When I see absolute max, the maximum value of absolute values comes to my mind. Actually, your function gets the original value which has the largest absolute value. Maybe it's better to clarify it in the function? |
Right! Clarified in the method docstring. |
rename min/max local variables as to not shadow builtins
Change max-computes from 3 to 0 in get_abs_max test
The latest tests pass if and only if the latest version of #368 is merged first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very useful, thanks a lot for adding this! Just a few questions.
In the bucket resampler get_abs_max method, add the missing fill_value to the method documentation.
Refactor the BucketResampler.get_abs_max method. Move the calculation of the abs_max from min and max to its own (pseudoprivate) method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just needs #368 to be merged so the tests can complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume in the future the get_max
and get_min
logic could be combined and allow for more optimized performance on this, but for now I think this looks pretty good.
Implement get_abs_max on the BucketResampler class.
git diff origin/main **/*py | flake8 --diff