Add a blocked option for train_test_split #253

leouieda · 2020-04-10T15:49:20Z

Add option blocked=False to verde.train_test_split that uses BlockShuffleSplit if True instead of the scikit-learn ShuffleSplit. Also add a gallery example comparing the two options.

Reminders:

Run make format and make check to make sure the code follows the style guide.
Add tests for new features or tests that would have caught the bug that you're fixing.
Add new public functions/methods/classes to doc/api/index.rst and the base __init__.py file for the package.
Write detailed docstrings for all functions/classes/methods. It often helps to design better code if you write the docstrings first.
If adding new functionality, add an example to the docstring, gallery, and/or tutorials.
Add your full name, affiliation, and ORCID (optional) to the AUTHORS.md file (if you haven't already) in case you'd like to be listed as an author on the Zenodo archive of the next release.

Rework the doctests and add a failing version using a `blocked=True` argument.

santisoler

Awesome job!
I only left a minor fix on one of the gallery example comments.

examples/train_test_split.py

Co-Authored-By: Santiago Soler <santiago.r.soler@gmail.com>

leouieda · 2020-04-16T09:38:14Z

Thanks @santisoler! Done

* Bug fix in grid_to_table for wrong order of Dataset coords (fatiando#229) We were relying on the order of the coords attribute of the data set to meshgrid the coordinates. This is wrong because xarray takes the coordinate order from dims instead, which is what we should also do. Took this chance to refactor the code and included a test that would have caught the bug. * Update files from fatiando/contributing (fatiando#233) Changes have been made in https://github.com/fatiando/contributing to: MAINTENANCE.md Update the copies in this repository to match. See fatiando/community@91699a0 * Use newer Mac on Azure Pipelines (fatiando#234) Changed from 10.13 (High Sierra) to 10.15 (Catalina) because Azure will remove the 10.13 VMs soon. * Allow grid_to_table to take DataArrays (fatiando#235) Previously would fail because we rely on getting the data variable name from the Dataset. Need to check if it's a Dataset (has the `data_vars` attribute) and set the data column name to "scalars" if it's not. Fixes fatiando#225 * Add function for rolling windows on irregular data (fatiando#236) The `rolling_window` function returns the indices of points falling on rolling/moving windows. It works for irregularly sampled data and is a nice complement to `xarray.DataArray.rolling` (which only works on grids). Uses a KDTree to select points but is not compatible with pykdtree (it's missing the method we need). * Add function to split data on an expanding window (fatiando#238) For selecting data in windows of various size around a center point. Sizes do not have to be increasing necessarily. Returns only the indices of points falling inside each window. * Explicitly set PR trigger for all branches on Azure (fatiando#244) For some reason, Azure Pipelines CI wasn't starting up PRs, only on master and tags. Explicitly setting a trigger for PRs against any branch seems to work. * Fix bug for 3+ coordinates in windowing functions (fatiando#243) The `rolling_window` and `expanding_window` functions fail if given more 2 coordinates. If we had full support for N-D coordinates, they could be N-D windowing functions but until that time, they will have to be 2D (horizontal) only functions. Add test cases using `extra_coords` in `grid_coordinates` to generate a third coordinate. Fixes fatiando#242 * Generalize coordinate systems in BaseGridder docs (fatiando#240) Make docstrings of `BaseGridder` more general, without specifying a coordinate system. Add `dims` as a class variable that will be used by default on `grid`, `scatter` and `profile` methods. * Add a convex hull masking function (fatiando#237) Use scipy's Delaunay triangulation to determine points that are outside the convex hull of the data points. Useful to mask out wildly extrapolated points from grids. It's easier to use than `distance_mask` because it doesn't require distance calculations (and the associated projections). Use the new function in some examples and tutorials. Normalize the coordinates prior to calculations to avoid errors with floating point conversion when triangulating (as suggested by Ari Hartikainen). Fixes fatiando#126 * Fix bug with profile distances being unprojected (fatiando#231) `BaseGridder.profile` has the option to apply a projection to the coordinates before predicting so we can pass geographic coordinates to cartesian gridders. In these cases, the distance along the profile is calculated by `profile_coordinates` with the unprojected coordinates (in the geographic case it would be degrees). The profile point calculation is also done assuming that coordinates are Cartesian, which is clearly wrong if inputs are longitude and latitude. To fix this, project the input points prior to passing to `profile_coordinates`. This means that the distances are Cartesian and generation of profile points is also Cartesian (as is assumed by `profile_coordinates`). The generated coordinates are projected back so that the user gets longitude and latitude but distances are still projected Cartesian. Add a warning in the docstring of the modified behaviour due to this bug and a section in the tutorial on projections. * Update TravisCI configuration to new formats (fatiando#247) Travis started warning that some options in the config were deprecated. Update them to the new format to avoid crashing builds later on. * Add function to project 2D gridded data (fatiando#246) The `project_grid` function transforms a grid to the given projection. It re-samples the data using `ScipyGridder` (by default) and runs a blocked mean (optional) to avoid aliasing when the points aren't evenly distributed in the projected coordinates (like in polar projections). Applies a `convexhul_mask` to the grid always to avoid extrapolation to points that had no original data (this is done by scipy for linear and cubic interpolation already). The function only works for `xarray.DataArray`s (not `Dataset`) because I couldn't find an easy way to guarantee that a variables have the same dimensions. Move the new function and `project_region` to the new module `verde/projections.py`. * Changelog entry for v1.4.0 (fatiando#248) New minor release with new features and bug fixes. No breaking changes or deprecations. * Fix broken TravisCI deploy (fatiando#250) The docs and configuration validation tool had recommended using `cleanup: false` instead of `skip_cleanup: true` when deploying. Turns out Travis decided to ignore the new parameter and our Github Pages deploy was broken since the docs were deleted before we could deploy. Revert back to the old argument until Travis really removes it. * Add a block version of ShuffleSplit cross-validator (fatiando#251) Cross-validation of spatial data with random splits can often lead to overestimation of accuracy (see https://doi.org/10.1111/ecog.02881 for a nice overview). To account for this, the splits can be done by spatial blocks. In this case, the data are partitioned into blocks and blocks are split randomly. That guarantees some independence of measurements between blocks assigned to test/train sets. This change adds the `BlockShuffleSplit` cross-validator which does the random splitting of blocks in a scikit-learn compatible class. It can be used with `verde.cross_val_score` like any other scikit-learn cross- validator. In fact, it can be used entirely outside of Verde for any type of machine learning on spatial data. The class takes care of balancing the random splits so that the right amount of train/test data are included in each (since blocks can have wildly different numbers of points). * Add optional argument to least_squares to copy Jacobian matrix (fatiando#255) Add new copy_jacobian optional argument to least_squares to avoid scaling Jacobian matrix inplace. Add warning on its docstring warning about memory consumption when copy_jacobian is True. Add test function for this new feature. * Update files from fatiando/contributing (fatiando#256) Changes have been made in https://github.com/fatiando/contributing to: MAINTENANCE.md Update the copies in this repository to match. See fatiando/community@340dafd * Add a blocked option for train_test_split (fatiando#253) Add option `blocked=False` to `verde.train_test_split` to split along blocks instead of randomly. Uses `BlockShuffleSplit` if `True` instead of the scikit-learn `ShuffleSplit`. Add a gallery example comparing the two options. * Fix typo in README contributing section (fatiando#258) Equality -> Equally * Pin pylint to 2.4.* (fatiando#259) Version 2.5.0 introduced a bug with the multiprocessing option. This happens often enough that we should have pylint pinned and only upgrade when we choose to do it. * Update tests to latest release of Scikit Learn (fatiando#267) Change the expected representation of any gridder based on the new default behaviour of Scikit Learn. After 0.23, Scikit Learn only shows those parameters whose default value has been changed when giving a string representation of the gridder. See scikit-learn/scikit-learn#17061 and its changelog for more information. * Gridders apply projections using only the first two coordinates (fatiando#264) Create a private function used by BaseGridder that applies projections only on the first two elements of the coordinates, but returns the whole set of coordinates, replacing the first two with their projected versions. Add a test function that passes extra_coords to include more than two coordinates when predicting through the grid, scatter and profile methods. * Remove "blocked" argument from train_test_split (fatiando#257) It was a bit redundant since we have to specify `shape` or `spacing` along with it. So we can just ask for `spacing` or `shape` and assume `blocked=True` if they are provided. Also fix the missing description in the gallery example and make slight changes to its wording. Co-authored-by: Santiago Soler <santiago.r.soler@gmail.com> * Add extra_coords on BaseGridder generated objects (fatiando#265) After predicting through the profile, scatter and grid methods, BaseGridder returns generated objects like xr.Dataset or pd.DataFrame. Any extra coordinate passed through the extra_coords keyword argument is now included on those objects as new columns or non-dimension coordinates (depending on the generated object). * Add a blocked K-Fold cross-validator (fatiando#254) The `BlockKFold` class splits the data into spatial blocks and does K-fold cross-validation splits on the blocks. Refactor code from `BlockShuffleSplit` into a base class that can be shared with the new class. Folds are balanced by making splits based on the number of data points in each fold instead of the number of blocks. * Pin Cartopy to 0.17.* until we sort out problems (fatiando#270) For some reason, the coast lines features are plotting on the entire figure instead of only the land or ocean. Might be an error on our side. For now, pin cartopy on CI to fix our docs. * Changelog entry for v1.5.0 (fatiando#271) Includes the new cross-validators and fixes to our base classes to accommodate Harmonica. * Allow specifing the scoring function in cross_val_score (fatiando#273) Currently, `cross_val_score` only uses the estimator's `.score` method (which is R² for all our gridders). But sometimes we want to use a different metric, like MSE or MAE. These changes allow passing in any scikit-learn compatible scorer through a new `scoring` parameter. This can be a string or callable, like in scikit-learn's `cross_val_score`. Needed to delegate score computation to a separate function in `verde.base.utils` to reuse in `cross_val_score` and a dummy class to be able to use the scikit-learn scorers correctly. * Format the doc/conf.py file with Black (fatiando#275) Ran black on `doc/conf.py` and added it to the `black` and `flake8` runs in the main Makefile. Removed the setting of `sys.path` since we're installing the packaged in edit mode so it should be importable without that hack. * Require Black >=20.8b1 (fatiando#284) After version 19, black changed the formatting slightly in a few files. This updates the format and sets a minimum version of black in the development environment and the CI. * Fix Cartopy issue and require >= 0.18 (fatiando#283) Unpin Cartopy because Cartopy 0.17.0 has a compatibility issue with Matplotlib > 3.3.0. Remove the unneeded tight_layout that was causing errors. Replace the filled polygons used for plotting land and ocean for the coastlines. * Fix typos on docstrings (fatiando#281) Fix typo on description of data_names parameters on BaseGridder.grid method. Fix typo on description of BaseGridder.project_coordinates method. * Raise errors for invalid rolling windows arguments (fatiando#280) Raise a specific error if no spacing or shape is passed to rolling_window(). Raise a specific error if window size is larger than the smaller dimension of the region. Add tests for these new methods that include matching error messages to identify which error has been raised. * Remove configuration files for unused CI (fatiando#292) We no longer use Stickler, Codacy, or CodeClimate. So there is no reason to keep these files around. * Update files from fatiando/contributing (fatiando#294) Changes have been made in https://github.com/fatiando/contributing to: CONTRIBUTING.md Update the copies in this repository to match. See fatiando/community@e899f3f * Add function to convert grid to Dataset (fatiando#282) Add a function that converts coordinates and data as np.arrays to xarray.Dataset. This would make handling grids easier, with the option of efficiently saving them as netCDF files. Make BaseGridder.grid method to use this function instead of building the xarray.Dataset by itself. Add new check_data_names function that converts a str into a tuple with a single str. Make BaseGridder.grid, BaseGridder.scatter and BaseGridder.profile able to take a single str for the data_names argument. Add test function for default data_names. Change data_names on all examples and tutorials to a single str. * Refactor get_data_names and related check functions (fatiando#295) Refactor `get_data_names` as a private method of `BaseGridder`: simplify `get_data_names` logic, make `get_data_names` to use `check_data_names` to convert single string to a tuple and check that `data_names` has the same number of elements as data. Add `data_names_defaults` as a class attribute of `BaseGridder`. Move `check_extra_coords_names` to `verde/base/utils.py` along with the other check functions. Make `check_data_names` to raise error if `data_names` is `None`. But `BaseGridder.get_data_names` will return default values if `data_names` is `None`. Rename test functions for `make_xarray_grid`. Co-authored-by: Leonardo Uieda <leouieda@gmail.com> Co-authored-by: Fatiando a Terra Bot <50936856+fatiando-bot@users.noreply.github.com> Co-authored-by: Santiago Soler <santiago.r.soler@gmail.com> Co-authored-by: Rowan Cockett <rowanc1@gmail.com>

leouieda added 3 commits April 10, 2020 15:59

Add failing test for a blocked train_test_split

4e0f5c1

Rework the doctests and add a failing version using a `blocked=True` argument.

Add blocked option for train_test_split

5463846

Add gallery example for train_test_split

a9240ec

leouieda requested a review from santisoler April 10, 2020 16:06

santisoler requested changes Apr 15, 2020

View reviewed changes

examples/train_test_split.py Outdated Show resolved Hide resolved

leouieda and others added 2 commits April 16, 2020 10:37

Merge branch 'master' into block-train-test-split

0dd3fb5

Update examples/train_test_split.py

f463a59

Co-Authored-By: Santiago Soler <santiago.r.soler@gmail.com>

leouieda merged commit ba2131d into master Apr 28, 2020

leouieda deleted the block-train-test-split branch April 28, 2020 15:29

leouieda mentioned this pull request Apr 28, 2020

Remove "blocked" argument from train_test_split #257

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a blocked option for train_test_split #253

Add a blocked option for train_test_split #253

leouieda commented Apr 10, 2020 •

edited

Loading

santisoler left a comment

leouieda commented Apr 16, 2020

Add a blocked option for train_test_split #253

Add a blocked option for train_test_split #253

Conversation

leouieda commented Apr 10, 2020 • edited Loading

santisoler left a comment

Choose a reason for hiding this comment

leouieda commented Apr 16, 2020

leouieda commented Apr 10, 2020 •

edited

Loading