Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add 2D versions of several baseline algorithms #27

Merged
merged 56 commits into from
Feb 13, 2024
Merged

ENH: Add 2D versions of several baseline algorithms #27

merged 56 commits into from
Feb 13, 2024

Conversation

derb12
Copy link
Owner

@derb12 derb12 commented Feb 13, 2024

Description

Added two dimensional versions of several baseline correction algorithms, with the main focus being on polynomials, Whittaker smoothing, and penalized splines. Documentation for the added 2D algorithms is complete. The only remaining todo is to add a 2D optimizer that applies a selected 1D baseline correction method along each individual row and/or column, and maybe an example or two in the documentation.

Type of Pull Request

  • Bug Fix
  • New Feature
  • Miscellaneous Changes (refactor, code improvements, etc.)
  • Documentation or Example Programs

Pull Request Checklist

  • New code and/or documentation is valid for use with the BSD 3-clause license.
  • New code is fully documented with docstrings that follow Numpy style,
    if applicable.
  • New code follows PEP 8 standards as closely as possible, if applicable.
  • Added/updated tests and ensured they pass locally, if applicable.
  • Verified that documentation builds locally, if applicable.

Currently implemented versions are mor, imor, rolling_ball, and tophat. Note that this is all experimental at this point. Design decision: no functional interface will be provided for the 2D versions.
Implemented the poly, modpoly, imodpoly, penalized_poly, and goldindec algorithms (didn't have to actually do anything outside of the polynomial setup). Had to skip several validations, so need to add that back in later.
Someone should give Paul Eilers the Nobel prize, dude is the GOAT. On a more serious note, the internals of the PSpline2D class will most likely change, but the external calls within the baseline algorithms should remain the same. Implemented the 2D versions of irsqr, pspline_asls, pspline_airpls, pspline_arpls, pspline_iarpls, and pspline_psalsa.
Switched to Eilers's alternate method of solving psplines, which is more memory efficient as long as the number of knots is relatively low; also about 50-90% faster than the previous method.
Foolish error from switching branches without thinking.
Added 2D versions of asls, airpls, arpls, iarpls, and psalsa. Extremely experimental. Note: currently using sparse implementation; a banded implementation is ~5 times faster but uses significantly more memory, so going with the sparse solution for now.
Also allow limiting the cross terms for simpler fits.
Also fixed docstrings to correctly refer to numpy's Polynomial class.
Baseline objects will no longer apply sorting when the input x-values were already in the correct sort order.
Also removed the `axis` keyword for several functions since the data is always assumed to be in the last dimension.
Addressed initializing Baseline2D given optional x and z, sorting in 2d is finally supported using utils._sort_array2d., and the output parameters are now correctly reshaped and sorted.
Fixed missing sorting output params and incorrect handling of input weights for 2D.
Also added some meta tests for the polynomial and weights testers.
Changed from z, x to x, z to represent the rows and columns, respectively. This mixup was causing some internal discrepancies for 2D PSplines and Whittaker system.
Only the 2D version of arpls was implemented correctly. All the others were just copy pasted from the 1d case and did not work properly in 2D. Wrote tests for setting up Whittaker systems.
Also added the 2D versions of iasls, drpls, and aspls.
yxz_arrays was not updated in the recent change that switched x and z to represent rows and columns of y. Added tests to ensure the output for the function is correct.
The pspline_arpls test is failing at avoiding non-nan values, so will need to check that later. Also added tests for the Baseline2D class.
Still need to work out minor details for it.
Bumped required numpy to 1.17 in order to use default_rng; will update all the other places with rng in the dev branch after merging this branch into it.
Allowed setting up the 2D penalized system with banded matrices for the simpler Whittaker algorithms. Time reduction is significant, especially for larger matrices. Probably not worth the effort to convert the more complex ones from sparse to banded.
The small datasets reduced test time from 200 seconds to 100 seconds and still address what needs to be tested. Also skip the tests for spline non-finite weighting since they are very slow; may just take them out since they are covered by the 1D tests.
Also added additional tests to check more difference orders for pspline algorithms.
Fixed handling of array-like inputs for 2d morphological and smoothing algorithms. Also added extrapolation for 2D.
Also switched from napoleon to numpydoc for documenting docstrings. Fixed reference numbers throughout. Re-enabled the autosection label and just ignore the repeated header warnings from the changelog. Fixed method role to point to Baseline or Baseline2D. Deleted two_d.classification since I probably will not make any before the next release.
Same as 1d, optimizers no longer do two unnecessary sorts.
Also fixed the 1d docstring for rolling_ball which mentioned using array-like half windows, which was removed several versions ago.
Makes it clearer than using x and z.
The exit criteria is based on a value from the publication, which does not make sense to apply for 2d data.
The check fails only for a meta test, but not worth keeping.
The exception that was raised was indirectly because of 1d input, so made the error easier to trace. Also no longer allow values less than or equal to 0 in gaussian.
Using eigendecomposition to solve 2D whittaker baselines reduces the computation time significantly, and the computation time scales relatively linear with data size since the number of eigenvalues depends only on baseline curvature and does not increase with size. Need to add some tests and explanations in docstrings and the main docs about the eigendecomposition. Also renamed solve_pspline to just solve.
Makes the sparse solver just a tad faster to better represent it. Also mention that the sparse solution could be sped up with CHOLMOD in case others are interested. Also re-enable autosectionlabel since the branch rebase must have undone that.
Also added a sanity check test for the 1D case to ensure the banded multiplication is the same as the matrix multiplication.
Will use scipy's sparse arrays if the installed scipy version is 1.12 or newer. Thank goodness for unit tests catching the change in matrix multiplication.
Put all metadata into pyproject.toml, so setup.py and setup.cfg can be removed. Switched from flake8 to ruff and bump2version to bump-my-version. Will need to update pinned requirements  once ready to release. Made a separate CI job for linting so that linting can fail but will at least show up now instead of ignoring.
Ensures that any new changes in numpy or scipy will be caught early. Fixed one last place where numpy.trapz was used. Side note: pybaselines works with numpy 2.0, which is a huge relief.
Simplifies the checking of 2d variables. Also added checks to ensure polynomial orders are never non-negative.
No longer mention editable installs in the contribution guide since it requires passing additional options to setuptools in order to work, which could be confusing for new contributors. Update min setuptools for building to allow editable installs, and update ruff settings.
Also updated pinned dependencies.
Increased min numpy version from 1.18 to 1.20 to allow using dtype within numpy.concatenate.
@derb12
Copy link
Owner Author

derb12 commented Feb 13, 2024

The one test failure is from pentapy v1.0 being unable to be installed from source (numpy not found?). Weird issue since I can't replicate locally on ubuntu or windows, but I may have to just bump the required version to one with an available wheel for python 3.8. Does not block merging this branch.

@derb12 derb12 merged commit c0f1057 into development Feb 13, 2024
7 of 8 checks passed
@derb12 derb12 deleted the 2D branch February 13, 2024 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant