Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor filtering module to take DataArrays as input #209

Merged
merged 38 commits into from
Jul 17, 2024
Merged

Conversation

lochhh
Copy link
Collaborator

@lochhh lochhh commented Jun 6, 2024

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?
This PR closes #191 #138 and #235

What does this PR do?

  1. This PR rewrites the filtering module to accept DataArrays (instead of Datasets) as inputs.
  2. A major change from the previous Dataset median_filter and savgol_filter is the window_length (renamed as window to better align with xarray and pandas). Before this PR, the unit of the window is determined based on ds.time_unit. If the time_unit is in seconds, the window supplied will be treated as seconds and converted to frames using window = int(window * ds.fps). Now that we no longer have access to the ds.time_unit and ds.fps attributes, the definition of window is changed to "The size of the filter window, representing the fixed number of observations used for each window".
  3. With use_coordinate=False in xarray.interpolate_na, interpolate_over_time now also operates based on number of consecutive (missing) observations (NaNs). Specifically, max_gap now specifies the maximum number of consecutive NaNs to fill, whilst noting that this differs slightly from max_gap in xarray.interpolate_na.
  4. The filter_and_interpolate and smooth examples have been updated to use the move accessor methods and both "frames" and "time" are mentioned with window sizes.
  5. Existing tests for the overwritten filters are also adapted for the new ones. These new unit tests use the simpler valid_poses_dataset and valid_poses_dataset_with_nan in place of the sample_dataset fixture (now removed).
  6. The helper functions to count NaNs in conftest.py are simplified to calculate NaNs in the entire data array, rather than for a single dimension of a single keypoint from a single individual.
  7. The report_nan_values function has also been adapted to report NaN stats for a single DataArray . This also accounts for cases when the individuals and/or keypoints dimension(s) are missing (e.g. ds.position.sel(individuals="ind1"), ds.position.sel(keypoints="snout").
  8. Add accessor methods for filters (see summary)
  9. To enable 8. new <module>_wrappers have been added to handle forwarding accessor method calls to the respective modules.
  10. Reporting and logging functions in filtering.py have now been moved to reports.py and logging.py.
  11. The usage of update() is documented in getting_started/movement_dataset.md.
  12. Use sphinx.ext.autosectionlabel to allow referencing sections using its title - this is used to reference the Filtering multiple data variables section in the filtering example from the smooth example.

References

#191 , #138 , #235

How has this PR been tested?

Tests were written (and then removed) to first compare equality of the outputs from the initial Dataset filters and the newly added DataArray filters. Existing tests for the overwritten filters are then adapted for the new ones.

Is this a breaking change?

Yes. See points 1 and 2 above.

Does this PR require an update to the documentation?

Affected examples have been updated. Accessor methods are also added to api.rst

TO-DO

  • Discuss points 2 and 3?
  • Demonstrate accessor data_vars usage in examples (this is available as a "tip" in the filter_and_interpolate example); smaller examples are available in the filtering_wrapper docstrings.
  • Reporting functions should deal with cases with missing individuals and/or keypoints dims

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

Copy link

codecov bot commented Jun 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.71%. Comparing base (b98cac0) to head (f3f80c3).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #209   +/-   ##
=======================================
  Coverage   99.70%   99.71%           
=======================================
  Files          12       13    +1     
  Lines         681      702   +21     
=======================================
+ Hits          679      700   +21     
  Misses          2        2           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lochhh lochhh force-pushed the dataarray-accessor branch from 6856806 to 40d61bc Compare June 7, 2024 10:11
@lochhh lochhh requested a review from a team June 7, 2024 16:37
@lochhh lochhh marked this pull request as ready for review June 7, 2024 16:58
Copy link
Member

@niksirbi niksirbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lochhh for tackling this tricky job!

I think we have to somehow face and solve the issues you yourself have detected.

Windows, gaps and time units

I'm fine with renaming window_length to window. Alignment with pandas and xarray is a good point. That said, I'm not very happy with the introduced inconsistency in time units, between median_filter and savgol_filter on one hand, and interpolate_over_time on the other. Some thoughts on this:

  • you indeed don't have access to ds.fps and ds.time_unit if you only pass a DataArray. On the other hand when calling these filters through the accessor's filtering_wrapper, you do have access to these attributes, so you could do the unit conversion before passing onto the underlying filtering.py function. However, I think that too will be confusing, because then we'll end up with an inconsistency between using these functions directly vs via the accessor.
  • One way to achieve consistency would be to make max_gap in interpolate_over_time also be in frames (observations). I think this could be done by setting use_coordinate=False in xarray's interpolate_na. Alternatively, we could explicitly define an extra coordinate variable - called frames - along the time dimension, and set use_coordinate="frames". This is trickier to do, but it's something we planned to do anyway, and it will pave the way for choosing to work with "real time" or with "frames". Maybe in this PR we can just try to solve it by use_coordinate=False (if it indeed works) and update the docstring of interpolate_over_time accordingly? At least this way we would consistently operate in "frames"/observations throughout, and @sfmig will get her wish.
  • Another way to achieve consistency is to do the opposite, i.e. force median_filter and savgol_filter to also operate in "seconds". Although we don't have access to fps or time_unit, we have access to the time dimension coordinate values. If those are not consecutively increasing integers, we can assume "seconds" and derive the fps from the time delta between coordinate values. In the special case of fps=1, it anyway doesn't make any difference. This solution breaks down for variable time intervals between frames, but we anyway have the constant time sampling assumption baked in (for now), including in the way we compute time derivatives. That said, this solution is kind of hacky, so I'm not super happy about it.

Let me know what you think about these "solutions".

sphinx-gallery examples

  • If we end up adopting the 2nd solution from above (everything works in frames), we could update the examples such that all window lengths and gaps are mentioned in numbers of frames, and hence avoid the "awkwardness" of having to convert each time. If we adopt some version of the 3rd solution, the "awkwardness" will naturally go away.
  • In the kinematics example we present the accessor .move syntax as the preferred way of computing the variables, and only briefly mention what gets called under the hood (aka the "functional" way of doing things). In the filtering-related examples, we present the "functional" way as the only way. When we update these examples, in this or in a future PR, we should present both ways of doing things, but maybe present the .move way as the preferred way for consistency? My point is that we shouldn't let this inconsistency in documentation persist for long.
  • Another thought that came to mind is that in the examples you very often apply a filter to a data variable and then use ds.update to replace that variable in the dataset. If we expect this to be commonly done, should we provide an update or inplace argument directly in the ds.move. filtering calls? So if someone calls ds.move.median_filter(window=3, data_vars=["position", "velocity"], inplace=True), those variables will be updated directly in ds (but inplace=False should still be the default). Is that even possible? If it's too much of a hustle, or if you think it will cause more trouble, forget about it. We have discussed the inplace option before and decided against it, and so did xarray.

report_nan_values and squeezing

  • Do you think we should move the whole machinery about counting and reporting NaNs to a module in utils?
  • To avoid the whole awkwardness with squeezing, should we already "harden" report_nan_values against missing dimensions? So, if the individuals dimension doesn't exist, assume 1 individual, and report per keypoint only. Similarly, if keypoints doesn't exist, assume 1 keypoint and report per individual only. If neither exists, assume 1 individual and 1 keypoint and report the 1 value for the only existing track.

movement/filtering.py Outdated Show resolved Hide resolved
movement/filtering.py Outdated Show resolved Hide resolved
movement/filtering.py Outdated Show resolved Hide resolved
movement/filtering.py Outdated Show resolved Hide resolved
movement/filtering.py Outdated Show resolved Hide resolved
examples/smooth.py Outdated Show resolved Hide resolved
movement/move_accessor.py Outdated Show resolved Hide resolved
docs/source/api_index.rst Outdated Show resolved Hide resolved
movement/filtering.py Outdated Show resolved Hide resolved
movement/move_accessor.py Show resolved Hide resolved
@lochhh lochhh force-pushed the dataarray-accessor branch from 157684b to f0d4f1f Compare July 3, 2024 10:24
@lochhh
Copy link
Collaborator Author

lochhh commented Jul 8, 2024

Thanks for the thorough review @niksirbi !

Windows, gaps and time units

  • One way to achieve consistency would be to make max_gap in interpolate_over_time also be in frames (observations). I think this could be done by setting use_coordinate=False in xarray's interpolate_na. Alternatively, we could explicitly define an extra coordinate variable - called frames - along the time dimension, and set use_coordinate="frames". This is trickier to do, but it's something we planned to do anyway, and it will pave the way for choosing to work with "real time" or with "frames". Maybe in this PR we can just try to solve it by use_coordinate=False (if it indeed works) and update the docstring of interpolate_over_time accordingly? At least this way we would consistently operate in "frames"/observations throughout, and @sfmig will get her wish.

I've decided to go with the simpler use_coordinate=False.

sphinx-gallery examples

  • If we end up adopting the 2nd solution from above (everything works in frames), we could update the examples such that all window lengths and gaps are mentioned in numbers of frames, and hence avoid the "awkwardness" of having to convert each time. If we adopt some version of the 3rd solution, the "awkwardness" will naturally go away.
  • In the kinematics example we present the accessor .move syntax as the preferred way of computing the variables, and only briefly mention what gets called under the hood (aka the "functional" way of doing things). In the filtering-related examples, we present the "functional" way as the only way. When we update these examples, in this or in a future PR, we should present both ways of doing things, but maybe present the .move way as the preferred way for consistency? My point is that we shouldn't let this inconsistency in documentation persist for long.

All done. Wrtinplace and ds.move.update, I've chosen instead to document the use of xarray's own update in Getting started with modifying movement datasets, as you suggested.

report_nan_values and squeezing

  • Do you think we should move the whole machinery about counting and reporting NaNs to a module in utils?
  • To avoid the whole awkwardness with squeezing, should we already "harden" report_nan_values against missing dimensions? So, if the individuals dimension doesn't exist, assume 1 individual, and report per keypoint only. Similarly, if keypoints doesn't exist, assume 1 keypoint and report per individual only. If neither exists, assume 1 individual and 1 keypoint and report the 1 value for the only existing track.

I've moved the nan reporting to utils/reports.py as you suggested. The NaN-reporting utility also now handles data arrays with missing individuals and/or keypoints dimensions, by assuming that there is only a single individual or keypoint.

Copy link
Member

@niksirbi niksirbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comprehensively addressing all my comments @lochhh!

I only have some remaining suggestions on the examples, and there appears to be an issue with logging all filtering operation to ds.attrs (see my comment below).

Other than that, this PR is good to go.

examples/filter_and_interpolate.py Outdated Show resolved Hide resolved
examples/filter_and_interpolate.py Outdated Show resolved Hide resolved
examples/filter_and_interpolate.py Outdated Show resolved Hide resolved
examples/filter_and_interpolate.py Show resolved Hide resolved
examples/filter_and_interpolate.py Outdated Show resolved Hide resolved
examples/filter_and_interpolate.py Outdated Show resolved Hide resolved
examples/filter_and_interpolate.py Outdated Show resolved Hide resolved
examples/smooth.py Outdated Show resolved Hide resolved
examples/smooth.py Show resolved Hide resolved
examples/smooth.py Outdated Show resolved Hide resolved
@lochhh lochhh force-pushed the dataarray-accessor branch from 1d7a5cc to f3f80c3 Compare July 17, 2024 08:42
Copy link

@lochhh lochhh added this pull request to the merge queue Jul 17, 2024
Merged via the queue into main with commit b27d7a2 Jul 17, 2024
27 checks passed
@lochhh lochhh deleted the dataarray-accessor branch July 17, 2024 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants