Implement get_abs_max on BucketResampler #418

gerritholl · 2022-02-04T13:22:40Z

Implement get_abs_max on the BucketResampler class.

Closes Add get_abs_max (and get_abs_min) to BucketResampler #417
Tests added
Tests passed
Passes git diff origin/main **/*py | flake8 --diff
Fully documented

Add a (failing) unit test for the new (not yet implemented) get_abs_max and get_abs_min methods on the BucketResampler class.

First attempt to implement get_abs_max for the bucket resampler. It's failing because ``da.where`` return ``NotImplemented``. I don't understand the problem.

pyresample/bucket/__init__.py

Fix the implementation for get_abs_max. Remove implementation and test for get_abs_min because it's harder to implement and I don't need it.

codecov · 2022-02-04T15:36:15Z

Codecov Report

Merging #418 (8520670) into main (70f7fbd) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #418   +/-   ##
=======================================
  Coverage   93.95%   93.96%           
=======================================
  Files          65       65           
  Lines       11269    11287   +18     
=======================================
+ Hits        10588    10606   +18     
  Misses        681      681

Flag	Coverage Δ
unittests	`93.96% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pyresample/bucket/__init__.py	`97.95% <100.00%> (+0.10%)`	⬆️
pyresample/test/test_bucket.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70f7fbd...8520670. Read the comment docs.

Remove some code that is unreachable and untested

coveralls · 2022-02-04T16:07:57Z

Coverage increased (+0.01%) to 93.791% when pulling 8520670 on gerritholl:get-abs-max into 70f7fbd on pytroll:main.

Fix bug in get_abs_max that was getting the wrong longitudes. Uses the new method in the bucket resampler implemented at pytroll/pyresample#418

Recover the unit test accidentally deleted in a previous commit.

gerritholl · 2022-02-07T10:31:11Z

I don't understand why this fails. It works in practice.

gerritholl · 2022-02-07T15:14:02Z

It also seems to be very slow ☹

Computing br.get_abs_max() for a full disc image takes 20.7 seconds on my machine. Even just constructing the dask graph (calling br.get_abs_max() without computing) takes 24.9 seconds, such that in total it takes around 45 seconds to get the absolute maximum for a full disc image… not great.

djhoese · 2022-02-07T15:56:32Z

I think this was @zxdawn's problem in the other PR. It is hard to get this stuff to perform well.

zxdawn · 2022-02-07T16:06:16Z

@djhoese Yes, #368. That's the slow process of calculating the min/max of group results. Maybe @dcherian has some ideas? He's developing flox which supports dask for groupby().

gerritholl · 2022-02-07T16:20:39Z

I'm using it in the parallax correction (pytroll/satpy#1904), which, as it stands now using .get_abs_max(), is too slow to be used operationally (but I still need to check if that has other causes the .get_abs_max() being slow).

zxdawn · 2022-02-07T16:29:10Z

@gerritholl Here's the question I asked on StackOverflow. The root should be the slow process of pandas groupby(). You can test the time it costs.

dcherian · 2022-02-07T17:30:36Z

It's hard to see what's happening here but you could try something like.

flox.groupby_reduce(data, by=???, expected_groups=((0,1,2,3),), isbin=True, func="nanmax")

Saying isbin=True treats expected_groups like bin edges. It also uses pd.cut whose defaults are different from np.digitize so be careful there.

By default flox implements whatever dask.dataframe does for dask.array. So if that's working this should work too (and hopefully better).

How is this data chunked relative to the binning? Are your chunksizes big in space, small in time, and you're binning in space?
Also how many bins are in a single chunk?
The first step is a blockwise groupby-reduction. Do you expect this to significantly reduce chunksize given the number of bins, and distribution of bins in each chunk?

A minimal example would be useful. I've been trying to collect problematic usecases and eventually hope to write some "user story" like blogposts so we can do some community-wide brainstorming.

The really cool thing about flox here is you can do "custom aggregations" which stuff a lot of work in a single task (you can do this with dask.dataframe.groupby too). So you can accumulate max and min at the same time (2x fewer tasks!) and do your funny where business in the finalize step. See https://flox.readthedocs.io/en/latest/custom.html for some documentation (but really its best to just look at how things work in flox/aggregations.py

gerritholl · 2022-02-08T10:55:48Z

Thanks @dcherian for the reply! I don't have a minimal example yet. In my non-minimal real use case, I have 3712×3712 ≅ 13.7M bins. 99% of the bins have exactly one value. About 0.5% have 2 values and less than 0.1% have 3 or more values, with the largest number of values in the image being 5. Slightly less than 0.5% have 0 values. The chunks are much larger than the bins. I have no time dimension. My aim is to get the largest absolute value in each bin, or nan for bins without any values.

I suspect that my current approach is widely suboptimal, regardless of using dask or not. For the 99% I should just keep the single value, and only the ones with 2 or more should need any calculations.

I should certainly look at flox. Unfortunately, I'm neither very well versed in dask nor in pandas, so I'm not sure yet how to approach optimising my implementation (and at the same time keep it flexible and generic enough for it to fit in pyresample).

dcherian · 2022-02-08T20:16:41Z

Thanks @gerritholl what are your chunk sizes then?

My aim is to get the largest absolute value in each bin, or nan for bins without any values.

It sounds like this is mostly reindexing and a tiny amount of groupby-reduction. Are the lat, lon for the input image stored as numpy arrays i.e. do you know ahead of time which x,y pixels belong to which bin?

I'm guessing these are dask arrays (based on the documentation for bucketing). In that case, any general solution (like the one in flox) will effectively rechunk to one chunk along x,y for the output grid. Can you start with that chunking scheme instead?

I should certainly look at flox.

It's not worth it right now, grouping by multiple dask variables has not been implemented yet.

It would help if you provided some example code, I can take a look next week but no promises. I'd need the array to group, arrays that we are grouping by(lat,lon i'm guessing) and the output grid. It's OK if the stuff is being read from a file.

gerritholl · 2022-02-09T08:08:18Z

Each of the arrays for data, lat, and lon have shape (3712, 3712) and chunksize (1024, 1024), all with dtype float64.

I really appreciate your offer for assistance! I will work on some example code.

If I increase max_computes in the unit test, the test passes

gerritholl · 2022-02-10T09:08:25Z

Likely related to the slowness is the observation that the unit test passes only with CustomScheduler(max_computes=3), not with 2 or 1.

gerritholl · 2022-02-10T09:12:06Z

Likely related to the slowness is the observation that the unit test passes only with CustomScheduler(max_computes=3), not with 2 or 1.

I suppose that ideally, they should pass with CustomScheduler(max_computes=0), but that's not the case for .get_max() or .get_min() either.

Seeing that the slowness is probably a consequence of .get_max() and .get_min() being slow (they share almost all code between them), which may be to the premature calculations, is this something I should fix here or may I defer this to another PR (such as #368)?

djhoese · 2022-02-10T13:16:57Z

I think that's up to @pnuu or @zxdawn and you to agree on. I'd be ok with a future PR, but would also be OK if it was part of this PR. And yes, ideally there would be zero computes inside the resampler. If there have to be then we at least want to make them intentional and retained so they only have to happen a minimal number of times and preferably don't require the user to compute the same tasks again when they compute their final result.

gerritholl · 2022-04-06T11:16:08Z

I would like to get this PR merged, so that the tests on pytroll/satpy#1904 can pass and a first version of parallax correction might be merged into Satpy main. If this PR gets merged, I will open a pyresample issue noting that:

get_abs_max should work with 0 computes, and (related)
get_abs_max should be faster.

For both, it would be very helpful to complete #368.

djhoese · 2022-04-06T11:51:13Z

@zxdawn do you have time to give a formal review of this PR?

zxdawn · 2022-04-06T17:59:27Z

@djhoese sorry, I'm on the vacation in the Iceland and will back on 10th April. If I find any free time, I will check it.

djhoese · 2022-04-06T18:04:20Z

I'm on the vacation

Get off of GitHub!

zxdawn · 2022-04-12T18:46:42Z

@gerritholl Just one small comment: When I see absolute max, the maximum value of absolute values comes to my mind. Actually, your function gets the original value which has the largest absolute value. Maybe it's better to clarify it in the function?

gerritholl · 2022-04-13T07:24:18Z

@gerritholl Just one small comment: When I see absolute max, the maximum value of absolute values comes to my mind. Actually, your function gets the original value which has the largest absolute value. Maybe it's better to clarify it in the function?

Right! Clarified in the method docstring.

rename min/max local variables as to not shadow builtins

Change max-computes from 3 to 0 in get_abs_max test

gerritholl · 2022-05-02T09:32:00Z

The latest tests pass if and only if the latest version of #368 is merged first.

sfinkens

Very useful, thanks a lot for adding this! Just a few questions.

pyresample/test/test_bucket.py

pyresample/bucket/__init__.py

pyresample/test/test_bucket.py

pyresample/bucket/__init__.py

In the bucket resampler get_abs_max method, add the missing fill_value to the method documentation.

Refactor the BucketResampler.get_abs_max method. Move the calculation of the abs_max from min and max to its own (pseudoprivate) method.

gerritholl · 2022-05-04T15:51:09Z

The test passes only with #368, due to the requirement for there to be no calculations. So it would be better to merge #368 first.

sfinkens

LGTM

pnuu

LGTM, just needs #368 to be merged so the tests can complete.

djhoese

I assume in the future the get_max and get_min logic could be combined and allow for more optimized performance on this, but for now I think this looks pretty good.

gerritholl added 2 commits February 4, 2022 14:20

Add test for get_abs_max

ee6acc1

Add a (failing) unit test for the new (not yet implemented) get_abs_max and get_abs_min methods on the BucketResampler class.

First attempt to implement get_abs_max

3b0a7da

First attempt to implement get_abs_max for the bucket resampler. It's failing because ``da.where`` return ``NotImplemented``. I don't understand the problem.

stickler-ci reviewed Feb 4, 2022

View reviewed changes

pyresample/bucket/__init__.py Outdated Show resolved Hide resolved

gerritholl added 2 commits February 4, 2022 15:15

PEP8 errors

862f91b

Fix implementation for get_abs_max

429e01e

Fix the implementation for get_abs_max. Remove implementation and test for get_abs_min because it's harder to implement and I don't need it.

gerritholl marked this pull request as ready for review February 4, 2022 15:35

Remove unreachable code

a742fd8

Remove some code that is unreachable and untested

gerritholl added a commit to gerritholl/satpy that referenced this pull request Feb 4, 2022

Fix get_abs_max implementation

a67a206

Fix bug in get_abs_max that was getting the wrong longitudes. Uses the new method in the bucket resampler implemented at pytroll/pyresample#418

Return accidentally removed unit test.

c1239da

Recover the unit test accidentally deleted in a previous commit.

gerritholl mentioned this pull request Feb 7, 2022

Add parallax correction via new ParallaxCorrectionModifier pytroll/satpy#1904

Merged

28 tasks

increase max_computer to make test pass

2ae033c

If I increase max_computes in the unit test, the test passes

zxdawn mentioned this pull request Feb 10, 2022

Speed up Bucket get_min and get_max #368

Merged

4 tasks

gerritholl changed the title ~~Implement get_abs_max and get_abs_min on BucketResampler~~ Implement get_abs_max on BucketResampler Apr 6, 2022

gerritholl requested review from pnuu and djhoese April 6, 2022 11:16

gerritholl requested a review from mraspaud April 6, 2022 11:16

Clarify get_abs_max docs.

9d9ece8

gerritholl added 2 commits April 13, 2022 10:07

don't shadow builtins

325c00c

rename min/max local variables as to not shadow builtins

Ensure no computations

933f4b9

Change max-computes from 3 to 0 in get_abs_max test

pnuu assigned gerritholl May 4, 2022

pnuu added the enhancement label May 4, 2022

sfinkens reviewed May 4, 2022

View reviewed changes

pyresample/test/test_bucket.py Show resolved Hide resolved

pyresample/bucket/__init__.py Outdated Show resolved Hide resolved

pyresample/test/test_bucket.py Show resolved Hide resolved

pyresample/bucket/__init__.py Show resolved Hide resolved

gerritholl added 2 commits May 4, 2022 16:55

Add doc for missing argument in get_abs_max

cc377d6

In the bucket resampler get_abs_max method, add the missing fill_value to the method documentation.

Refactor get_abs_max

70b24c1

Refactor the BucketResampler.get_abs_max method. Move the calculation of the abs_max from min and max to its own (pseudoprivate) method.

sfinkens approved these changes May 4, 2022

View reviewed changes

pnuu approved these changes May 5, 2022

View reviewed changes

Merge branch 'main' into get-abs-max

8520670

djhoese approved these changes May 6, 2022

View reviewed changes

djhoese merged commit 0cb8914 into pytroll:main May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement get_abs_max on BucketResampler #418

Implement get_abs_max on BucketResampler #418

gerritholl commented Feb 4, 2022 •

edited

Loading

codecov bot commented Feb 4, 2022 •

edited

Loading

coveralls commented Feb 4, 2022 •

edited

Loading

gerritholl commented Feb 7, 2022

gerritholl commented Feb 7, 2022

djhoese commented Feb 7, 2022

zxdawn commented Feb 7, 2022

gerritholl commented Feb 7, 2022

zxdawn commented Feb 7, 2022

dcherian commented Feb 7, 2022 •

edited

Loading

gerritholl commented Feb 8, 2022

dcherian commented Feb 8, 2022

gerritholl commented Feb 9, 2022

gerritholl commented Feb 10, 2022

gerritholl commented Feb 10, 2022

djhoese commented Feb 10, 2022

gerritholl commented Apr 6, 2022 •

edited

Loading

djhoese commented Apr 6, 2022

zxdawn commented Apr 6, 2022

djhoese commented Apr 6, 2022

zxdawn commented Apr 12, 2022

gerritholl commented Apr 13, 2022

gerritholl commented May 2, 2022

sfinkens left a comment

gerritholl commented May 4, 2022 •

edited

Loading

sfinkens left a comment

pnuu left a comment

djhoese left a comment

Implement get_abs_max on BucketResampler #418

Implement get_abs_max on BucketResampler #418

Conversation

gerritholl commented Feb 4, 2022 • edited Loading

codecov bot commented Feb 4, 2022 • edited Loading

Codecov Report

coveralls commented Feb 4, 2022 • edited Loading

gerritholl commented Feb 7, 2022

gerritholl commented Feb 7, 2022

djhoese commented Feb 7, 2022

zxdawn commented Feb 7, 2022

gerritholl commented Feb 7, 2022

zxdawn commented Feb 7, 2022

dcherian commented Feb 7, 2022 • edited Loading

gerritholl commented Feb 8, 2022

dcherian commented Feb 8, 2022

gerritholl commented Feb 9, 2022

gerritholl commented Feb 10, 2022

gerritholl commented Feb 10, 2022

djhoese commented Feb 10, 2022

gerritholl commented Apr 6, 2022 • edited Loading

djhoese commented Apr 6, 2022

zxdawn commented Apr 6, 2022

djhoese commented Apr 6, 2022

zxdawn commented Apr 12, 2022

gerritholl commented Apr 13, 2022

gerritholl commented May 2, 2022

sfinkens left a comment

Choose a reason for hiding this comment

gerritholl commented May 4, 2022 • edited Loading

sfinkens left a comment

Choose a reason for hiding this comment

pnuu left a comment

Choose a reason for hiding this comment

djhoese left a comment

Choose a reason for hiding this comment

gerritholl commented Feb 4, 2022 •

edited

Loading

codecov bot commented Feb 4, 2022 •

edited

Loading

coveralls commented Feb 4, 2022 •

edited

Loading

dcherian commented Feb 7, 2022 •

edited

Loading

gerritholl commented Apr 6, 2022 •

edited

Loading

gerritholl commented May 4, 2022 •

edited

Loading