-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass data arrays to filtering functions instead of a dataset #191
Comments
An alternative is to separate filtering functions into multiple modules (which we'll end up doing anyway eventually). The categories I can envision would be: Outlier detectionThis could reside in Future work on #152 and #151 can also go here and be named SmoothingThis could reside in InterpolationThis could reside in UtilsThis would contain utilities useful for all the above, like |
I agree with passing data arrays into the filters (and in general utility functions), rather than the entire dataset. As for the special case of |
I kind of imagined
I like the explicitness. But we may want to leave flexibility for the case in which someone may want to filter by confidence a derived variable of I know it is not the classic pipeline we usually think about (remove outliers -> then derive variables from position) but I like that On the categories of pre/post processing, I think it may be early to constrain. I would suggest having loose categories at this point (maybe just one), since adapting things later shouldn't be too costly right? |
Yes, that is indeed true. I think the main problem is that How about getting rid of it altogether? The only thing the function does is the following: ds_thresholded = ds.copy()
ds_thresholded.update(
{"position": ds.position.where(ds.confidence >= threshold)}
)
if print_report:
report_nan_values(ds, "input dataset")
report_nan_values(ds_thresholded, "filtered dataset")
return ds_thresholded So, by getting rid of it we would lose the convenient report printing, and the automatic logging of the operation provided by the Instead we could simply recommend doing the following: position_filtered = ds.position.where(ds.confidence >= threshold) That's just native I do feel bad for all the hard work @b-peri did on the nan reports, but these were not in vain, because they'll still work for the other filtering functions, and users can always call the Regarding the What do you think about this more "radical" approach? To summarise:
|
We can continue using the We didn't like the idea of a
Again this seems like an overkill, as mentioned before, we can simply do:
Plus having the DataArray accessor, e.g. Perhaps we could consider ds["position"] = filter_by_other_ge_threshold(position, other=confidence, threshold=0.6)
ds["velocity"] = filter_by_other_ge_threshold(velocity, other=confidence, threshold=0.6)
# and for convenience
ds = ds.move.filter_by_confidence(threshold=0.6, data_vars=["position", "velocity"]) likewise ds["position"] = savgol_filter(position, window_length=1, polyorder=2)
ds["velocity"] = savgol_filter(velocity, window_length=1, polyorder=2)
# and for convenience
ds = ds.move.savgol_filter(window_length=1, polyorder=2, data_vars=["position", "velocity"]) |
Comments to @niksirbi's message
I am not sure, I do think it is useful for users to wrap these lines, even if they are a few. They will be used quite a lot....
On the other hand, the
I think logging for reproducibility it's a cool idea that we should keep in mind, but agree that maybe we can revisit this once movement is pass the toddler phase :P Shall we make an issue to keep an eye on it?
100% yes 👍 👍 |
Comments to @lochhh
I really like this prototyping approach, thanks for doing it.
I agree - it is quite clear when the syntax is compared directly.
Agreed (it confuses me a bit at least).
IIUC, we are considering three main syntaxes for filtering by confidence:
So there is a choice in how much we constrain the function. My (mild) opinion is that we make "constrained" utils if they wrap a specific case that is used a lot. The justification being that it would save the user a lot of typing. With this in mind, my thoughts atm are:
On the accessor method vs functional approach, I like the functional one a bit more. I find it less confusing, but that may be my specific background with programming (matlab cough cough). |
To clarify, I was referring to having two main syntaxes for filtering:
|
aah gotcha! I don't feel strongly either way tbh... 🤔 What about just having these options? # savgol & most filters:
# take data array as first input, rest of inputs are filter-specific parameters
ds["position"] = savgol_filter(ds.position, window_length=1, polyorder=2)
ds["velocity"] = savgol_filter(ds.velocity, window_length=1, polyorder=2)
# filter by confidence following the same structure for the function signature
# (confidence as a filter-specific parameter)
ds["position"] = filter_by_confidence(ds.position, confidence_array=ds.confidence, threshold=0.6)
# then use .where for more flexible conditions
ds["position"] = ds.position.where(
ds.position.sel(space="x") <= 6000 &
ds.position.sel(space="x") >= 2000
) |
Thanks a lot @sfmig and @lochhh for the back and forth! These discussions are super useful. The thing we all agree onAll functions in the ds["position"] = savgol_filter(ds.position, window_length=1, polyorder=2) We won't plan to implement a Accessor methods for filtersI also like the idea of having accessor methods that would call the above filters, but can work on multiple arrays at once: ds = ds.move.savgol_filter(window_length=1, polyorder=2, data_vars=["position", "velocity"]) This might look like "two ways of doing the same thing", but I don't see it as such. The whole point of having datasets is that you can group multiple arrays together, so it makes sense to provide methods that simplify operations on multiple related arrays. So to summarise, functions in What about
|
Thanks, very nicely summarised, @niksirbi!
I feel like there would be a "surprise" element here: Why is
The main benefit of preferring the above functions to the simpler |
I'm fine with that solution! |
The attributes are attached to the DataArray instead of the Dataset, so you would need to do |
Hmm nice idea, I could see that working... |
Is your feature request related to a problem? Please describe.
Almost all filtering/smoothing operations are meant to operate on a single data array (in most cases the
position
data variable). However, they are currently designed such that they take in an entiremovement
dataset (soposition
+confidence
) and return a different dataset in whuch only theposition
variable has been updated.This made sense for our first filtering function -
filter_by_confidence()
, which needs both variables, but it no longer makes sense for the other filters likemedian_filter()
andsavgol_filter()
Describe the solution you'd like
All filtering functions should explicitly accept a
DataArray
and return the modifiedDataArray
. This is the transparent, no-magic option. For the special case offilter_by_confidence()
, we will have to explicitly provide theconfidence
data array as a 2nd argument.Describe alternatives you've considered
The alternative would be to leave the
filter_by_confidence()
as is, and only change the other functions. This would make the interface somewhat inconsistent though.Additional context
Modifying
filter_by_confidence()
to accept bothposition
andconfidence
data arrays as separate arguments would be a bit awkward. Under the hood, we would have to validate that those match (in shape and labels), and this is almost equivalent to re-building amovement
dataset. It feels like we would be disassembling a dataset to pass it to this function, reassemble it internally to do the validation and thresholding, and then return onlyposition
.We've already touched on this topic in #162 but we didn't agree on a solution. Me and @niksirbi agree that the current status-quo feels counter-intuitive. Thoughts @lochhh and @b-peri ?
The text was updated successfully, but these errors were encountered: