Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: return entire array if no outliers are found #181

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gengotech
Copy link

fix in the filter_outliers method, with previous behavior if no outliers were found, an empty array would have been returned. this results in adjusted ur calculation being None in some replays (while normal ur calculation would be fine)

this conflicts with the usage and documentation of the method, which states that it'd return the array with the outliers removed.

to fix this, we simply return the original array if no outliers were found.

@tybug
Copy link
Contributor

tybug commented Feb 16, 2024

am I misreading or doesn't this return the entire array if everything is an outlier? if nothing if an outlier, then this function still returns the entire array.

I think it's actually somewhat valid to return an invalid adjusted ur when everything is an outlier (because it's very unreliable if everything is an outlier), but open to opinions.

@gengotech
Copy link
Author

if everything is an outlier?

i don't see how that's possible with how interquartile range was implemented, could you provide an example?

@tybug
Copy link
Contributor

tybug commented Feb 16, 2024

I mean, I'm just reading the code as written:

    arr_without_outliers = [x for x in arr if lower_limit < x < upper_limit]
    return arr if not arr_without_outliers else arr_without_outliers

this returns the original array arr if arr_without_outliers is empty (i.e. the array was empty to begin with, or everything in the array is an outlier)

@tybug
Copy link
Contributor

tybug commented Feb 16, 2024

put another way: this function in master already does what this pr says, which is return the entire array when there are no outliers. so something is being lost in translation here

@gengotech
Copy link
Author

skipping semantics, as it is in master an .ur(adjusted=True) calculation will return None if there are no outliers

if that's an intended outcome then you can close the pr, per personal preference i need it to return the same value as adjusted=False instead of None and merged downstream incase it wasnt intended (wasnt clear to me)

@tybug
Copy link
Contributor

tybug commented Feb 16, 2024

skipping semantics, as it is in master an .ur(adjusted=True) calculation will return None if there are no outliers

do you have an example? I can't reproduce this. If this is the case, that's definitely a bug, like you mentioned.

@gengotech
Copy link
Author

do you have an example? I can't reproduce this. If this is the case, that's definitely a bug, like you mentioned.

it usually happens at very low cvUR, one example would be this replay

it has lower limit of 2 and upper limit of 2 so [x for x in arr if lower_limit < x < upper_limit] would return an empty array (and thus UR calc would return None)

another option could be including the boundary with lower_limit <= x <= upper_limit

image

@tybug
Copy link
Contributor

tybug commented Feb 16, 2024

That's a case where everything is an outlier, not nothing.

fwiw, I get nan and an error for normal and adjusted ur respectively for that replay:

r = cg.ReplayMap(1711983, 33839341)
print(cg.ur(r)) # nan
print(cg.ur(r, adjusted=True)) # error
...
  File "/opt/homebrew/lib/python3.12/site-packages/numpy/lib/function_base.py", line 4830, in _quantile
    slices_having_nans = np.isnan(arr[-1, ...])
                                  ~~~^^^^^^^^^
IndexError: index -1 is out of bounds for axis 0 with size 0

I'm fine with including the endpoints for outlier filtering to avoid this in the case of 0 iqr. We also probably want to raise an error when calculating the ur of a replay with < 3 hits (I thought we already did, to be honest).

@gengotech
Copy link
Author

fwiw, I get nan and an error for normal and adjusted ur respectively for that replay:

Huh, weird. I ran your script and could not reproduce the results on circleguard 5.4.1, I get an UR calculation of 4.705853985027059 and the adjusted UR d oes return nan with the following error

venv/lib/python3.11/site-packages/numpy/core/_methods.py:206: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
venv/lib/python3.11/site-packages/numpy/core/_methods.py:163: RuntimeWarning: invalid value encountered in divide
  arrmean = um.true_divide(arrmean, div, out=arrmean,
venv/lib/python3.11/site-packages/numpy/core/_methods.py:198: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
nan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants