Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement get_abs_max on BucketResampler #418

Merged
merged 13 commits into from
May 6, 2022

Conversation

gerritholl
Copy link
Collaborator

@gerritholl gerritholl commented Feb 4, 2022

Implement get_abs_max on the BucketResampler class.

Add a (failing) unit test for the new (not yet implemented) get_abs_max
and get_abs_min methods on the BucketResampler class.
First attempt to implement get_abs_max for the bucket resampler.  It's
failing because ``da.where`` return ``NotImplemented``.  I don't
understand the problem.
Fix the implementation for get_abs_max.

Remove implementation and test for get_abs_min because it's harder to
implement and I don't need it.
@gerritholl gerritholl marked this pull request as ready for review February 4, 2022 15:35
@codecov
Copy link

codecov bot commented Feb 4, 2022

Codecov Report

Merging #418 (8520670) into main (70f7fbd) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #418   +/-   ##
=======================================
  Coverage   93.95%   93.96%           
=======================================
  Files          65       65           
  Lines       11269    11287   +18     
=======================================
+ Hits        10588    10606   +18     
  Misses        681      681           
Flag Coverage Δ
unittests 93.96% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pyresample/bucket/__init__.py 97.95% <100.00%> (+0.10%) ⬆️
pyresample/test/test_bucket.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70f7fbd...8520670. Read the comment docs.

Remove some code that is unreachable and untested
@coveralls
Copy link

coveralls commented Feb 4, 2022

Coverage Status

Coverage increased (+0.01%) to 93.791% when pulling 8520670 on gerritholl:get-abs-max into 70f7fbd on pytroll:main.

gerritholl added a commit to gerritholl/satpy that referenced this pull request Feb 4, 2022
Fix bug in get_abs_max that was getting the wrong longitudes.  Uses the
new method in the bucket resampler implemented at
pytroll/pyresample#418
Recover the unit test accidentally deleted in a previous commit.
@gerritholl
Copy link
Collaborator Author

I don't understand why this fails. It works in practice.

@gerritholl
Copy link
Collaborator Author

It also seems to be very slow ☹

Computing br.get_abs_max() for a full disc image takes 20.7 seconds on my machine. Even just constructing the dask graph (calling br.get_abs_max() without computing) takes 24.9 seconds, such that in total it takes around 45 seconds to get the absolute maximum for a full disc image… not great.

@djhoese
Copy link
Member

djhoese commented Feb 7, 2022

I think this was @zxdawn's problem in the other PR. It is hard to get this stuff to perform well.

@zxdawn
Copy link
Member

zxdawn commented Feb 7, 2022

@djhoese Yes, #368. That's the slow process of calculating the min/max of group results. Maybe @dcherian has some ideas? He's developing flox which supports dask for groupby().

@gerritholl
Copy link
Collaborator Author

I'm using it in the parallax correction (pytroll/satpy#1904), which, as it stands now using .get_abs_max(), is too slow to be used operationally (but I still need to check if that has other causes the .get_abs_max() being slow).

@zxdawn
Copy link
Member

zxdawn commented Feb 7, 2022

@gerritholl Here's the question I asked on StackOverflow. The root should be the slow process of pandas groupby(). You can test the time it costs.

@dcherian
Copy link

dcherian commented Feb 7, 2022

It's hard to see what's happening here but you could try something like.

flox.groupby_reduce(data, by=???, expected_groups=((0,1,2,3),), isbin=True, func="nanmax")

Saying isbin=True treats expected_groups like bin edges. It also uses pd.cut whose defaults are different from np.digitize so be careful there.

By default flox implements whatever dask.dataframe does for dask.array. So if that's working this should work too (and hopefully better).

  1. How is this data chunked relative to the binning? Are your chunksizes big in space, small in time, and you're binning in space?
  2. Also how many bins are in a single chunk?
  3. The first step is a blockwise groupby-reduction. Do you expect this to significantly reduce chunksize given the number of bins, and distribution of bins in each chunk?

A minimal example would be useful. I've been trying to collect problematic usecases and eventually hope to write some "user story" like blogposts so we can do some community-wide brainstorming.

The really cool thing about flox here is you can do "custom aggregations" which stuff a lot of work in a single task (you can do this with dask.dataframe.groupby too). So you can accumulate max and min at the same time (2x fewer tasks!) and do your funny where business in the finalize step. See https://flox.readthedocs.io/en/latest/custom.html for some documentation (but really its best to just look at how things work in flox/aggregations.py

@gerritholl
Copy link
Collaborator Author

Thanks @dcherian for the reply! I don't have a minimal example yet. In my non-minimal real use case, I have 3712×3712 ≅ 13.7M bins. 99% of the bins have exactly one value. About 0.5% have 2 values and less than 0.1% have 3 or more values, with the largest number of values in the image being 5. Slightly less than 0.5% have 0 values. The chunks are much larger than the bins. I have no time dimension. My aim is to get the largest absolute value in each bin, or nan for bins without any values.

I suspect that my current approach is widely suboptimal, regardless of using dask or not. For the 99% I should just keep the single value, and only the ones with 2 or more should need any calculations.

I should certainly look at flox. Unfortunately, I'm neither very well versed in dask nor in pandas, so I'm not sure yet how to approach optimising my implementation (and at the same time keep it flexible and generic enough for it to fit in pyresample).

@dcherian
Copy link

dcherian commented Feb 8, 2022

Thanks @gerritholl what are your chunk sizes then?

My aim is to get the largest absolute value in each bin, or nan for bins without any values.

It sounds like this is mostly reindexing and a tiny amount of groupby-reduction. Are the lat, lon for the input image stored as numpy arrays i.e. do you know ahead of time which x,y pixels belong to which bin?

I'm guessing these are dask arrays (based on the documentation for bucketing). In that case, any general solution (like the one in flox) will effectively rechunk to one chunk along x,y for the output grid. Can you start with that chunking scheme instead?

I should certainly look at flox.

It's not worth it right now, grouping by multiple dask variables has not been implemented yet.

It would help if you provided some example code, I can take a look next week but no promises. I'd need the array to group, arrays that we are grouping by(lat,lon i'm guessing) and the output grid. It's OK if the stuff is being read from a file.

@gerritholl
Copy link
Collaborator Author

Each of the arrays for data, lat, and lon have shape (3712, 3712) and chunksize (1024, 1024), all with dtype float64.

I really appreciate your offer for assistance! I will work on some example code.

If I increase max_computes in the unit test, the test passes
@gerritholl
Copy link
Collaborator Author

Likely related to the slowness is the observation that the unit test passes only with CustomScheduler(max_computes=3), not with 2 or 1.

@gerritholl
Copy link
Collaborator Author

Likely related to the slowness is the observation that the unit test passes only with CustomScheduler(max_computes=3), not with 2 or 1.

I suppose that ideally, they should pass with CustomScheduler(max_computes=0), but that's not the case for .get_max() or .get_min() either.

Seeing that the slowness is probably a consequence of .get_max() and .get_min() being slow (they share almost all code between them), which may be to the premature calculations, is this something I should fix here or may I defer this to another PR (such as #368)?

@djhoese
Copy link
Member

djhoese commented Feb 10, 2022

I think that's up to @pnuu or @zxdawn and you to agree on. I'd be ok with a future PR, but would also be OK if it was part of this PR. And yes, ideally there would be zero computes inside the resampler. If there have to be then we at least want to make them intentional and retained so they only have to happen a minimal number of times and preferably don't require the user to compute the same tasks again when they compute their final result.

@zxdawn zxdawn mentioned this pull request Feb 10, 2022
4 tasks
@gerritholl gerritholl changed the title Implement get_abs_max and get_abs_min on BucketResampler Implement get_abs_max on BucketResampler Apr 6, 2022
@gerritholl gerritholl requested review from pnuu and djhoese April 6, 2022 11:16
@gerritholl gerritholl requested a review from mraspaud April 6, 2022 11:16
@gerritholl
Copy link
Collaborator Author

gerritholl commented Apr 6, 2022

I would like to get this PR merged, so that the tests on pytroll/satpy#1904 can pass and a first version of parallax correction might be merged into Satpy main. If this PR gets merged, I will open a pyresample issue noting that:

  • get_abs_max should work with 0 computes, and (related)
  • get_abs_max should be faster.

For both, it would be very helpful to complete #368.

@djhoese
Copy link
Member

djhoese commented Apr 6, 2022

@zxdawn do you have time to give a formal review of this PR?

@zxdawn
Copy link
Member

zxdawn commented Apr 6, 2022

@djhoese sorry, I'm on the vacation in the Iceland and will back on 10th April. If I find any free time, I will check it.

@djhoese
Copy link
Member

djhoese commented Apr 6, 2022

I'm on the vacation

Get off of GitHub!

@zxdawn
Copy link
Member

zxdawn commented Apr 12, 2022

@gerritholl Just one small comment: When I see absolute max, the maximum value of absolute values comes to my mind. Actually, your function gets the original value which has the largest absolute value. Maybe it's better to clarify it in the function?

@gerritholl
Copy link
Collaborator Author

@gerritholl Just one small comment: When I see absolute max, the maximum value of absolute values comes to my mind. Actually, your function gets the original value which has the largest absolute value. Maybe it's better to clarify it in the function?

Right! Clarified in the method docstring.

rename min/max local variables as to not shadow builtins
Change max-computes from 3 to 0 in get_abs_max test
@gerritholl
Copy link
Collaborator Author

The latest tests pass if and only if the latest version of #368 is merged first.

Copy link
Member

@sfinkens sfinkens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very useful, thanks a lot for adding this! Just a few questions.

pyresample/test/test_bucket.py Show resolved Hide resolved
pyresample/bucket/__init__.py Outdated Show resolved Hide resolved
pyresample/test/test_bucket.py Show resolved Hide resolved
pyresample/bucket/__init__.py Show resolved Hide resolved
In the bucket resampler get_abs_max method, add the missing fill_value
to the method documentation.
Refactor the BucketResampler.get_abs_max method.  Move the calculation
of the abs_max from min and max to its own (pseudoprivate) method.
@gerritholl
Copy link
Collaborator Author

gerritholl commented May 4, 2022

The test passes only with #368, due to the requirement for there to be no calculations. So it would be better to merge #368 first.

Copy link
Member

@sfinkens sfinkens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@pnuu pnuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just needs #368 to be merged so the tests can complete.

Copy link
Member

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume in the future the get_max and get_min logic could be combined and allow for more optimized performance on this, but for now I think this looks pretty good.

@djhoese djhoese merged commit 0cb8914 into pytroll:main May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add get_abs_max (and get_abs_min) to BucketResampler
8 participants