Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor bilinear #300

Merged
merged 96 commits into from
Oct 18, 2020
Merged

Refactor bilinear #300

merged 96 commits into from
Oct 18, 2020

Conversation

pnuu
Copy link
Member

@pnuu pnuu commented Sep 9, 2020

This is a big refactoring for bilinear interpolation. Most of the bits are also re-used in the legacy Numpy version, but I didn't want to put too much time in pulling out all the parts of pyresample.bilinear.get_sample_from_bil_info().

There will be some renaming and such that can be improved, but I've run out of steam for now 😅 Suggestions are wellcome!

This refactoring also brings some performance improvements when using pre-computed resampling info. With the script below, using Satpy, I got the following timings:

EDIT: Timings updated for latest Satpy master (Oct-7 2020) with minimal other load on the laptop.

Pyresample master branch

  • initial run (generate=False): 3 m 28 s 3 m 14 s
  • initial run (generate=True): 4 m 42 s 4 m 20 s
  • reusing the cached resampling data (generate=False): 18 s
  • reusing the cached resampling data (generate=True): 1 m 25 s 1 m 21 s

This PR:

  • initial run (generate=False): 2 m 50 s 1 m 36 s
  • initial run (generate=True): 3 m 30 s 2 m 14 s
  • reusing the cached resampling data (generate=False): 16 s 15 s
  • reusing the cached resampling data (generate=True): 48 s 45 s
#!/usr/bin/env python

import os
os.environ['DASK_NUM_WORKERS'] = '2'
os.environ['OMP_NUM_THREADS'] = '1'

import glob
from satpy import Scene

def main():
    fnames = glob.glob('/home/lahtinep/data/satellite/geo/msg/*201611281100*__')

    glbl = Scene(reader='seviri_l1b_hrit', filenames=fnames)
    glbl.load(["natural_color", "fog", "overview",
               "hrv_clouds", "hrv_fog", "convection"],
              generate=True)
#              generate=False)
    lcl = glbl.resample('euro4', resampler="bilinear", cache_dir="/tmp", reduce_data=False)
    lcl.save_datasets(base_dir='/tmp')


if __name__ == "__main__":
    main()

@pnuu
Copy link
Member Author

pnuu commented Sep 10, 2020

I'm going to ignore the DeepCode errors/warnings. The access to private attributes are within the tests, and in deprecated functions (resample_bilinear(), get_bil_info() and get_sample_from_bil_info()) for which I don't want to change the names of the attributes.

The resample_bilinear() function should be replaced with .resample() convenience method in a later PR.

@pnuu
Copy link
Member Author

pnuu commented Sep 14, 2020

Any other suggestions?

@djhoese made a PR for Satpy (pytroll/satpy#1361) so that Windows testing would happen in Travis instead of Appveyor. Should we make similar change also to Pyresample, merge that, rebase/merge to this and see that also the Windows tests pass?

@djhoese
Copy link
Member

djhoese commented Sep 14, 2020

How backwards compatible is this PR?

I would wait to turn off appveyor. Plus the Azure builds are failing for some other reason. Not sure why yet.

@pnuu
Copy link
Member Author

pnuu commented Sep 14, 2020

This is fully backwards compatible, including both Numpy and XArray versions.

And again I mixed up Azure and Appveyor 🙄 And yeah, didn't figure what's wrong with the Azure builds.

@pnuu
Copy link
Member Author

pnuu commented Sep 14, 2020

Looking again at the logs again, it seems that Python 3.9 fails to install pyproj because proj executable isn't available:

created virtual environment CPython3.9.0.candidate.1-64 in 430ms
  creator CPython3Posix(dest=/tmp/tmp.uuC6OgfBeC/venv, clear=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==20.2.2, setuptools==49.6.0, wheel==0.35.1
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
    + pip install /tmp/cibuildwheel/repaired_wheel/pyresample-1.16.0+96.g6e01eaa-cp39-cp39-manylinux2010_x86_64.whl
  ERROR: Command errored out with exit status 1:
   command: /tmp/tmp.uuC6OgfBeC/venv/bin/python /tmp/tmp.uuC6OgfBeC/venv/lib/python3.9/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmpe3vvkoco
       cwd: /tmp/pip-install-vgtgu41m/pyproj
  Complete output (1 lines):
  proj executable not found. Please set the PROJ_DIR variable.For more information see: https://pyproj4.github.io/pyproj/stable/installation.html

@djhoese
Copy link
Member

djhoese commented Sep 14, 2020

I think I posted it on slack but did you ever turn off python 3.9? Check the skip variable at the top of the azure config and add an entry for the cp39 equivalent.

@pnuu
Copy link
Member Author

pnuu commented Sep 14, 2020

Oops, either didn't notice or completely forgot. Trying now.

@pnuu
Copy link
Member Author

pnuu commented Sep 15, 2020

With these two additional da.compute() calls the overall compute calls drop from 74 to 33 for the initial run of the above example script. At the same time, the processing time drops to around 2 m 35 s, 15 s less than earlier.

@pnuu
Copy link
Member Author

pnuu commented Oct 7, 2020

I ran the timings again with the current Satpy master branch and minimal load on the laptop. The updated timings are in the description.

Copy link
Member

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job. It looks easier to follow and I like the documentation. I had a lot of suggestions for renaming and reordering, but mostly I felt like some of the refactoring went too far. A lot of the class methods seem unnecessary or that they don't do that much. With all the class instances this may just appear this way, but I'm wondering if this can be avoided.

On a larger note, what do you and @mraspaud think about avoiding putting a lot of code in __init__.py modules? Could a lot of the stuff be put in bilinear/base.py and/or bilinear/npy.py?

@@ -268,10 +268,74 @@ Click images to see the full resolution versions.

The *perceived* sharpness of the bottom image is lower, but there is more detail present.


XArrayResamplerBilinear
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think Resampler should be the last word: XArrayBilinearResampler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, while I think this section is fine here for now, I think we need to refactor the documentation. This document is getting really long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This is the name Satpy currently expects, so for backwards compatibility it needs to be like this for now. I can change this in the follow-up changes I already have waiting (not yet PR'd) for both Pyresample and Satpy.

***********************

**bilinear.XArrayResamplerBilinear** is a class that handles bilinear interpolation for data in
`xarray.DataArray` arrays. The parallelisation is done automatically using `dask`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

US English would be parallelization. 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

>>> result = resampler.get_sample_from_bil_info(data)


NumpyResamplerBilinear
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for the name, Resampler should be last in my opinion. Willing to debate it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on swapping these sections around? Numpy first then xarray? The tests could even use the same lons/lats/data from the numpy section (I think doctest lets you do that).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the naming in the follow-ups. The Xarray version is preferred for performance, so thought it should come first.

>>> source_def = geometry.SwathDefinition(lons=lons, lats=lats)
>>> resampler = XArrayResamplerBilinear(source_def, target_def, 30e3)
>>> resampler.get_bil_info()
>>> result = resampler.get_sample_from_bil_info(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_sample_from_bil_info seems odd given that you don't actually give it bil_info. Why can't get_bil_info be called automatically if it hasn't been already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The separate get_bil_info() step is necessary when caching the resampling info. In the follow-ups I move the caching from Satpy to Pyresample, so with that the caching step can be shown here. In another follow-up I'll add resampler.resample() method that wraps resampler.get_bil_info() and resampler.get_sample_from_bil_info() together. It could also have the cache_dir kwarg so that there'd be no need to call it separately.

Function for resampling using bilinear interpolation for irregular source grids.
Convenience function for resampling using bilinear interpolation for irregular source grids.

..note:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this syntax work? I think you need a space after .. and two colons :: and a blank line after.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't. Fixed.

if self._resample_kdtree is None:
return

self._get_target_lonlats()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chunk of code seems odd. Calling a series of methods sounds like it should maybe be some other method? Or even better, do these methods need to be creating/storing data in instance attributes? Could they instead return the values they are generating?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some the instance attributes are re-used at least once, and the computations are not cheap (not sure how well Dask can handle them) so I think this is necessary. I'll move the single-use ones to where they are needed and remove the instance attributes.

Copy link
Member Author

@pnuu pnuu Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._valid_output_index attribute removed, it's the only one that wasn't re-used and/or needed for caching (public attributes).

self.mask_slices = self._index_array >= self._source_geo_def.size

def _get_target_lonlats(self):
self._target_lons, self._target_lats = self._target_geo_def.get_lonlats()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this method buys you anything in readability. I personally would remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I had a separate version with chunks kwarg for the Xarray version, but the coordinates need to be computed for the query and the "laziness" just made things a lot slower. Removed the unnecessary method.

self._valid_output_index, self._resample_kdtree,
self._neighbours, self._epsilon,
self._radius_of_influence)
return res, None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method too seems odd as it doesn't really do much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, moved the query_no_distance() call to _get_index_array().

elif data.ndim == 3:
return _slice3d
else:
raise ValueError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have a message with it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added.

return _apply_fill_value_or_mask_data(
self._reshape_to_target_area(res, data.ndim),
fill_value
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the inline call of _reshape_to_target_area out would make this clearer in my opinion. So _finalize_output_data would be a two step method (reshape, fill/mask).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, adjusted.

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the comments on naming. Otherwise I can't see anything alarming. Some function/methods could be shrunk even more, but it's already a huge improvement over what we had before, great job!

@pnuu pnuu merged commit 8a7dbb2 into pytroll:master Oct 18, 2020
@pnuu pnuu deleted the refactor-bilinear branch October 18, 2020 16:34
@pnuu pnuu mentioned this pull request Oct 18, 2020
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor bilinear interpolation
4 participants