-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor filtering
module to take DataArrays as input
#209
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #209 +/- ##
=======================================
Coverage 99.70% 99.71%
=======================================
Files 12 13 +1
Lines 681 702 +21
=======================================
+ Hits 679 700 +21
Misses 2 2 ☔ View full report in Codecov by Sentry. |
6856806
to
40d61bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lochhh for tackling this tricky job!
I think we have to somehow face and solve the issues you yourself have detected.
Windows, gaps and time units
I'm fine with renaming window_length
to window
. Alignment with pandas
and xarray
is a good point. That said, I'm not very happy with the introduced inconsistency in time units, between median_filter
and savgol_filter
on one hand, and interpolate_over_time
on the other. Some thoughts on this:
- you indeed don't have access to
ds.fps
andds.time_unit
if you only pass aDataArray
. On the other hand when calling these filters through the accessor'sfiltering_wrapper
, you do have access to these attributes, so you could do the unit conversion before passing onto the underlyingfiltering.py
function. However, I think that too will be confusing, because then we'll end up with an inconsistency between using these functions directly vs via the accessor. - One way to achieve consistency would be to make
max_gap
ininterpolate_over_time
also be in frames (observations). I think this could be done by settinguse_coordinate=False
in xarray'sinterpolate_na
. Alternatively, we could explicitly define an extra coordinate variable - calledframes
- along thetime
dimension, and setuse_coordinate="frames"
. This is trickier to do, but it's something we planned to do anyway, and it will pave the way for choosing to work with "real time" or with "frames". Maybe in this PR we can just try to solve it byuse_coordinate=False
(if it indeed works) and update the docstring ofinterpolate_over_time
accordingly? At least this way we would consistently operate in "frames"/observations throughout, and @sfmig will get her wish. - Another way to achieve consistency is to do the opposite, i.e. force
median_filter
andsavgol_filter
to also operate in "seconds". Although we don't have access tofps
ortime_unit
, we have access to thetime
dimension coordinate values. If those are not consecutively increasing integers, we can assume "seconds" and derive thefps
from the time delta between coordinate values. In the special case offps=1
, it anyway doesn't make any difference. This solution breaks down for variable time intervals between frames, but we anyway have the constant time sampling assumption baked in (for now), including in the way we compute time derivatives. That said, this solution is kind of hacky, so I'm not super happy about it.
Let me know what you think about these "solutions".
sphinx-gallery
examples
- If we end up adopting the 2nd solution from above (everything works in frames), we could update the examples such that all window lengths and gaps are mentioned in numbers of frames, and hence avoid the "awkwardness" of having to convert each time. If we adopt some version of the 3rd solution, the "awkwardness" will naturally go away.
- In the kinematics example we present the accessor
.move
syntax as the preferred way of computing the variables, and only briefly mention what gets called under the hood (aka the "functional" way of doing things). In the filtering-related examples, we present the "functional" way as the only way. When we update these examples, in this or in a future PR, we should present both ways of doing things, but maybe present the.move
way as the preferred way for consistency? My point is that we shouldn't let this inconsistency in documentation persist for long. - Another thought that came to mind is that in the examples you very often apply a filter to a data variable and then use
ds.update
to replace that variable in the dataset. If we expect this to be commonly done, should we provide anupdate
orinplace
argument directly in theds.move.
filtering calls? So if someone callsds.move.median_filter(window=3, data_vars=["position", "velocity"], inplace=True)
, those variables will be updated directly inds
(butinplace=False
should still be the default). Is that even possible? If it's too much of a hustle, or if you think it will cause more trouble, forget about it. We have discussed theinplace
option before and decided against it, and so didxarray
.
report_nan_values
and squeezing
- Do you think we should move the whole machinery about counting and reporting NaNs to a module in
utils
? - To avoid the whole awkwardness with squeezing, should we already "harden"
report_nan_values
against missing dimensions? So, if theindividuals
dimension doesn't exist, assume 1 individual, and report per keypoint only. Similarly, ifkeypoints
doesn't exist, assume 1 keypoint and report per individual only. If neither exists, assume 1 individual and 1 keypoint and report the 1 value for the only existing track.
157684b
to
f0d4f1f
Compare
Thanks for the thorough review @niksirbi !
I've decided to go with the simpler
All done. Wrt
I've moved the nan reporting to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comprehensively addressing all my comments @lochhh!
I only have some remaining suggestions on the examples, and there appears to be an issue with logging all filtering operation to ds.attrs
(see my comment below).
Other than that, this PR is good to go.
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
1d7a5cc
to
f3f80c3
Compare
Quality Gate passedIssues Measures |
Description
What is this PR
Why is this PR needed?
This PR closes #191 #138 and #235
What does this PR do?
filtering
module to accept DataArrays (instead of Datasets) as inputs.median_filter
andsavgol_filter
is thewindow_length
(renamed aswindow
to better align with xarray and pandas). Before this PR, the unit of thewindow
is determined based onds.time_unit
. If thetime_unit
is in seconds, thewindow
supplied will be treated as seconds and converted to frames usingwindow = int(window * ds.fps)
. Now that we no longer have access to theds.time_unit
andds.fps
attributes, the definition ofwindow
is changed to "The size of the filter window, representing the fixed number of observations used for each window".use_coordinate=False
inxarray.interpolate_na
,interpolate_over_time
now also operates based on number of consecutive (missing) observations (NaNs). Specifically,max_gap
now specifies the maximum number of consecutive NaNs to fill, whilst noting that this differs slightly frommax_gap
inxarray.interpolate_na
.filter_and_interpolate
andsmooth
examples have been updated to use themove
accessor methods and both "frames" and "time" are mentioned with window sizes.valid_poses_dataset
andvalid_poses_dataset_with_nan
in place of thesample_dataset
fixture (now removed).conftest.py
are simplified to calculate NaNs in the entire data array, rather than for a single dimension of a single keypoint from a single individual.report_nan_values
function has also been adapted to report NaN stats for a single DataArray . This also accounts for cases when theindividuals
and/orkeypoints
dimension(s) are missing (e.g.ds.position.sel(individuals="ind1")
,ds.position.sel(keypoints="snout")
.<module>_wrapper
s have been added to handle forwarding accessor method calls to the respective modules.filtering.py
have now been moved toreports.py
andlogging.py
.update()
is documented ingetting_started/movement_dataset.md
.sphinx.ext.autosectionlabel
to allow referencing sections using its title - this is used to reference theFiltering multiple data variables
section in thefiltering
example from thesmooth
example.References
#191 , #138 , #235
How has this PR been tested?
Tests were written (and then removed) to first compare equality of the outputs from the initial
Dataset
filters and the newly addedDataArray
filters. Existing tests for the overwritten filters are then adapted for the new ones.Is this a breaking change?
Yes. See points 1 and 2 above.
Does this PR require an update to the documentation?
Affected examples have been updated. Accessor methods are also added to
api.rst
TO-DO
data_vars
usage in examples (this is available as a "tip" in thefilter_and_interpolate
example); smaller examples are available in thefiltering_wrapper
docstrings.individuals
and/orkeypoints
dimsChecklist: