Refactor `filtering` module to take DataArrays as input #209

lochhh · 2024-06-06T11:28:46Z

Description

What is this PR

Bug fix
Addition of a new feature
Other

Why is this PR needed?
This PR closes #191 #138 and #235

What does this PR do?

This PR rewrites the filtering module to accept DataArrays (instead of Datasets) as inputs.
A major change from the previous Dataset median_filter and savgol_filter is the window_length (renamed as window to better align with xarray and pandas). Before this PR, the unit of the window is determined based on ds.time_unit. If the time_unit is in seconds, the window supplied will be treated as seconds and converted to frames using window = int(window * ds.fps). Now that we no longer have access to the ds.time_unit and ds.fps attributes, the definition of window is changed to "The size of the filter window, representing the fixed number of observations used for each window".
With use_coordinate=False in xarray.interpolate_na, interpolate_over_time now also operates based on number of consecutive (missing) observations (NaNs). Specifically, max_gap now specifies the maximum number of consecutive NaNs to fill, whilst noting that this differs slightly from max_gap in xarray.interpolate_na.
The filter_and_interpolate and smooth examples have been updated to use the move accessor methods and both "frames" and "time" are mentioned with window sizes.
Existing tests for the overwritten filters are also adapted for the new ones. These new unit tests use the simpler valid_poses_dataset and valid_poses_dataset_with_nan in place of the sample_dataset fixture (now removed).
The helper functions to count NaNs in conftest.py are simplified to calculate NaNs in the entire data array, rather than for a single dimension of a single keypoint from a single individual.
The report_nan_values function has also been adapted to report NaN stats for a single DataArray . This also accounts for cases when the individuals and/or keypoints dimension(s) are missing (e.g. ds.position.sel(individuals="ind1"), ds.position.sel(keypoints="snout").
Add accessor methods for filters (see summary)
To enable 8. new <module>_wrappers have been added to handle forwarding accessor method calls to the respective modules.
Reporting and logging functions in filtering.py have now been moved to reports.py and logging.py.
The usage of update() is documented in getting_started/movement_dataset.md.
Use sphinx.ext.autosectionlabel to allow referencing sections using its title - this is used to reference the Filtering multiple data variables section in the filtering example from the smooth example.

References

#191 , #138 , #235

How has this PR been tested?

Tests were written (and then removed) to first compare equality of the outputs from the initial Dataset filters and the newly added DataArray filters. Existing tests for the overwritten filters are then adapted for the new ones.

Is this a breaking change?

Yes. See points 1 and 2 above.

Does this PR require an update to the documentation?

Affected examples have been updated. Accessor methods are also added to api.rst

TO-DO

Discuss points 2 and 3?
Demonstrate accessor data_vars usage in examples (this is available as a "tip" in the filter_and_interpolate example); smaller examples are available in the filtering_wrapper docstrings.
Reporting functions should deal with cases with missing individuals and/or keypoints dims

Checklist:

The code has been tested locally
Tests have been added to cover all new functionality
The documentation has been updated to reflect any changes
The code has been formatted with pre-commit

codecov · 2024-06-06T11:37:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.71%. Comparing base (b98cac0) to head (f3f80c3).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #209   +/-   ##
=======================================
  Coverage   99.70%   99.71%           
=======================================
  Files          12       13    +1     
  Lines         681      702   +21     
=======================================
+ Hits          679      700   +21     
  Misses          2        2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

niksirbi

Thanks @lochhh for tackling this tricky job!

I think we have to somehow face and solve the issues you yourself have detected.

Windows, gaps and time units

I'm fine with renaming window_length to window. Alignment with pandas and xarray is a good point. That said, I'm not very happy with the introduced inconsistency in time units, between median_filter and savgol_filter on one hand, and interpolate_over_time on the other. Some thoughts on this:

you indeed don't have access to ds.fps and ds.time_unit if you only pass a DataArray. On the other hand when calling these filters through the accessor's filtering_wrapper, you do have access to these attributes, so you could do the unit conversion before passing onto the underlying filtering.py function. However, I think that too will be confusing, because then we'll end up with an inconsistency between using these functions directly vs via the accessor.
One way to achieve consistency would be to make max_gap in interpolate_over_time also be in frames (observations). I think this could be done by setting use_coordinate=False in xarray's interpolate_na. Alternatively, we could explicitly define an extra coordinate variable - called frames - along the time dimension, and set use_coordinate="frames". This is trickier to do, but it's something we planned to do anyway, and it will pave the way for choosing to work with "real time" or with "frames". Maybe in this PR we can just try to solve it by use_coordinate=False (if it indeed works) and update the docstring of interpolate_over_time accordingly? At least this way we would consistently operate in "frames"/observations throughout, and @sfmig will get her wish.
Another way to achieve consistency is to do the opposite, i.e. force median_filter and savgol_filter to also operate in "seconds". Although we don't have access to fps or time_unit, we have access to the time dimension coordinate values. If those are not consecutively increasing integers, we can assume "seconds" and derive the fps from the time delta between coordinate values. In the special case of fps=1, it anyway doesn't make any difference. This solution breaks down for variable time intervals between frames, but we anyway have the constant time sampling assumption baked in (for now), including in the way we compute time derivatives. That said, this solution is kind of hacky, so I'm not super happy about it.

Let me know what you think about these "solutions".

`sphinx-gallery` examples

If we end up adopting the 2nd solution from above (everything works in frames), we could update the examples such that all window lengths and gaps are mentioned in numbers of frames, and hence avoid the "awkwardness" of having to convert each time. If we adopt some version of the 3rd solution, the "awkwardness" will naturally go away.
In the kinematics example we present the accessor .move syntax as the preferred way of computing the variables, and only briefly mention what gets called under the hood (aka the "functional" way of doing things). In the filtering-related examples, we present the "functional" way as the only way. When we update these examples, in this or in a future PR, we should present both ways of doing things, but maybe present the .move way as the preferred way for consistency? My point is that we shouldn't let this inconsistency in documentation persist for long.
Another thought that came to mind is that in the examples you very often apply a filter to a data variable and then use ds.update to replace that variable in the dataset. If we expect this to be commonly done, should we provide an update or inplace argument directly in the ds.move. filtering calls? So if someone calls ds.move.median_filter(window=3, data_vars=["position", "velocity"], inplace=True), those variables will be updated directly in ds (but inplace=False should still be the default). Is that even possible? If it's too much of a hustle, or if you think it will cause more trouble, forget about it. We have discussed the inplace option before and decided against it, and so did xarray.

`report_nan_values` and squeezing

Do you think we should move the whole machinery about counting and reporting NaNs to a module in utils?
To avoid the whole awkwardness with squeezing, should we already "harden" report_nan_values against missing dimensions? So, if the individuals dimension doesn't exist, assume 1 individual, and report per keypoint only. Similarly, if keypoints doesn't exist, assume 1 keypoint and report per individual only. If neither exists, assume 1 individual and 1 keypoint and report the 1 value for the only existing track.

movement/filtering.py

examples/smooth.py

movement/move_accessor.py

docs/source/api_index.rst

movement/filtering.py

movement/move_accessor.py

movement/utils/reports.py

lochhh · 2024-07-08T16:58:38Z

Thanks for the thorough review @niksirbi !

Windows, gaps and time units

One way to achieve consistency would be to make max_gap in interpolate_over_time also be in frames (observations). I think this could be done by setting use_coordinate=False in xarray's interpolate_na. Alternatively, we could explicitly define an extra coordinate variable - called frames - along the time dimension, and set use_coordinate="frames". This is trickier to do, but it's something we planned to do anyway, and it will pave the way for choosing to work with "real time" or with "frames". Maybe in this PR we can just try to solve it by use_coordinate=False (if it indeed works) and update the docstring of interpolate_over_time accordingly? At least this way we would consistently operate in "frames"/observations throughout, and @sfmig will get her wish.

I've decided to go with the simpler use_coordinate=False.

sphinx-gallery examples

If we end up adopting the 2nd solution from above (everything works in frames), we could update the examples such that all window lengths and gaps are mentioned in numbers of frames, and hence avoid the "awkwardness" of having to convert each time. If we adopt some version of the 3rd solution, the "awkwardness" will naturally go away.

In the kinematics example we present the accessor .move syntax as the preferred way of computing the variables, and only briefly mention what gets called under the hood (aka the "functional" way of doing things). In the filtering-related examples, we present the "functional" way as the only way. When we update these examples, in this or in a future PR, we should present both ways of doing things, but maybe present the .move way as the preferred way for consistency? My point is that we shouldn't let this inconsistency in documentation persist for long.

All done. Wrtinplace and ds.move.update, I've chosen instead to document the use of xarray's own update in Getting started with modifying movement datasets, as you suggested.

report_nan_values and squeezing

Do you think we should move the whole machinery about counting and reporting NaNs to a module in utils?

To avoid the whole awkwardness with squeezing, should we already "harden" report_nan_values against missing dimensions? So, if the individuals dimension doesn't exist, assume 1 individual, and report per keypoint only. Similarly, if keypoints doesn't exist, assume 1 keypoint and report per individual only. If neither exists, assume 1 individual and 1 keypoint and report the 1 value for the only existing track.

I've moved the nan reporting to utils/reports.py as you suggested. The NaN-reporting utility also now handles data arrays with missing individuals and/or keypoints dimensions, by assuming that there is only a single individual or keypoint.

niksirbi

Thanks for comprehensively addressing all my comments @lochhh!

I only have some remaining suggestions on the examples, and there appears to be an issue with logging all filtering operation to ds.attrs (see my comment below).

Other than that, this PR is good to go.

examples/filter_and_interpolate.py

examples/smooth.py

Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>

sonarqubecloud · 2024-07-17T08:42:38Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

lochhh force-pushed the dataarray-accessor branch from 6856806 to 40d61bc Compare June 7, 2024 10:11

lochhh requested a review from a team June 7, 2024 16:37

lochhh marked this pull request as ready for review June 7, 2024 16:58

niksirbi requested changes Jun 12, 2024

View reviewed changes

lochhh force-pushed the dataarray-accessor branch from 157684b to f0d4f1f Compare July 3, 2024 10:24

VascoSch92 reviewed Jul 8, 2024

View reviewed changes

movement/utils/reports.py Outdated Show resolved Hide resolved

lochhh mentioned this pull request Jul 10, 2024

Log to the console rather than print() #235

Closed

lochhh requested a review from niksirbi July 10, 2024 14:37

This was linked to issues Jul 11, 2024

Log to the console rather than print() #235

Closed

More thorough tests for the filtering.py module #138

Closed

niksirbi approved these changes Jul 11, 2024

View reviewed changes

lochhh and others added 17 commits July 17, 2024 09:34

Draft dataarray accessor

386dc21

Move dataarray accessor methods to filtering

8d11695

Add dataarray functions, test equality

a8eb366

Add tests

2fbf9d3

Add integration test

9c18055

Remove filters taking Dataset as input

79b1a7b

Reorganise filtering module

e0950fc

Update filter and smooth examples

59d748c

Replace window_length with window

b4ec31a

Format assert string

a62a491

Remove old code accidentally reintroduced during rebase

99c106e

Update docstrings

64265a4

Add filtering methods to the move accessor

f370886

Add example to docstring

ec28361

Remove obsolete and unused function imports

c2dd2ea

Move util functions to reports.py and logging.py

4845655

Apply suggestions from code review

c5e8ddc

Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>

lochhh added 21 commits July 17, 2024 10:40

Update docstrings

e02a49c

Add missing docstring

4353e78

Add move accessor examples in docstrings

45664fa

Remove position check in kinematics wrapper

170c946

Changeinterpolate_over_time to operate on num of observations

b9860e7

Add test for different max_gap values

a58d52c

Update filter_and_interpolate.py example

4ff58f5

Fix filtering_wrapper bug

15a734a

Update filter examples

1e9ad9b

Use dictionary update in smooth example

f775815

Move logger assignment to top of file

971e2a2

Add update example to "getting started"

090349a

Cover both dataarray and dataset in test_log_to_attrs

5b06b90

Test that log contains the filtering method applied

9df68d6

Use :py:meth: syntax for xarray.DataArray.squeeze() in examples

a703828

Update reports.py docstrings

874725e

Handle missing individuals and keypoints dims in NaN-reports

d1005c8

Return str in report_nan_values

36b748f

Clean up examples

25ebc50

Convert filtering multiple data variables tip to section

26d9191

Use update() in filter_and_interpolate example

f3f80c3

lochhh force-pushed the dataarray-accessor branch from 1d7a5cc to f3f80c3 Compare July 17, 2024 08:42

lochhh added this pull request to the merge queue Jul 17, 2024

Merged via the queue into main with commit b27d7a2 Jul 17, 2024
27 checks passed

lochhh deleted the dataarray-accessor branch July 17, 2024 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `filtering` module to take DataArrays as input #209

Refactor `filtering` module to take DataArrays as input #209

lochhh commented Jun 6, 2024 •

edited

Loading

codecov bot commented Jun 6, 2024 •

edited

Loading

niksirbi left a comment

lochhh commented Jul 8, 2024 •

edited

Loading

Windows, gaps and time units

`sphinx-gallery` examples

`report_nan_values` and squeezing

niksirbi left a comment

sonarqubecloud bot commented Jul 17, 2024

Refactor filtering module to take DataArrays as input #209

Refactor filtering module to take DataArrays as input #209

Conversation

lochhh commented Jun 6, 2024 • edited Loading

Description

References

How has this PR been tested?

Is this a breaking change?

Does this PR require an update to the documentation?

TO-DO

Checklist:

codecov bot commented Jun 6, 2024 • edited Loading

Codecov Report

niksirbi left a comment

Choose a reason for hiding this comment

Windows, gaps and time units

sphinx-gallery examples

report_nan_values and squeezing

lochhh commented Jul 8, 2024 • edited Loading

Windows, gaps and time units

sphinx-gallery examples

report_nan_values and squeezing

niksirbi left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Jul 17, 2024

Quality Gate passed

Refactor `filtering` module to take DataArrays as input #209

Refactor `filtering` module to take DataArrays as input #209

lochhh commented Jun 6, 2024 •

edited

Loading

codecov bot commented Jun 6, 2024 •

edited

Loading

`sphinx-gallery` examples

`report_nan_values` and squeezing

lochhh commented Jul 8, 2024 •

edited

Loading

`sphinx-gallery` examples

`report_nan_values` and squeezing