Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discretizer gives error with NaNs #39

Closed
JanBenisek opened this issue Mar 12, 2021 · 2 comments · Fixed by #52
Closed

Discretizer gives error with NaNs #39

JanBenisek opened this issue Mar 12, 2021 · 2 comments · Fixed by #52
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed
Milestone

Comments

@JanBenisek
Copy link
Contributor

(reported by user, to be investigated)

When fitting the preprocessor, some continuous variables gave Value Error.
I suspect this happends if a continuous variables of type np.float64 has only NaN values (or in one of the splits).
The error probably is raised because pandas cannot set interval index properly.

To be investigated (attached picture and traceback in text file)
image
traceback.txt

@JanBenisek JanBenisek added the bug Something isn't working label Mar 12, 2021
@JanBenisek JanBenisek added this to the v1.0.2 milestone Mar 12, 2021
@JanBenisek JanBenisek added the help wanted Extra attention is needed label Mar 12, 2021
@JanBenisek
Copy link
Contributor Author

JanBenisek commented Mar 16, 2021

This happens when a continuous variable has np.inf or -np.inf
Simple solution is to replace all values like this with np.nan:

basetable = basetable.replace(-np.inf, np.nan)
basetable = basetable.replace(np.inf, np.nan

This should be printed in the log (how many infs there were).

The replacement has to happen before any preprocessing starts (the error is caused because pandas interval cannot handle these values), probably in this function:

def _fit_column(self, data: pd.DataFrame,

@JanBenisek JanBenisek self-assigned this Mar 16, 2021
@JanBenisek
Copy link
Contributor Author

JanBenisek commented Mar 17, 2021

After more investigation, I was able to reproduce the bug and the conclusion from the previous comment is wrong. Below correct source.

Description

What actually happens is that during binning, sorting of the bin edges does not always produce the same results.
Consider two variables which are binned (_compute_bin_edges()) with n_bins=2 (but occurs with different number of bins as well):

df_base['var_A'].quantile(np.linspace(0, 1, n_bins + 1),interpolation='linear')
0.0       NaN
0.5   -0.0001
1.0       NaN
Name: var_A, dtype: float64

So the unsorted bins look like this:
[nan, -9.973934819479391e-05, nan]

df_base['var_B'].quantile(np.linspace(0, 1, n_bins + 1),interpolation='linear')
0.0    NaN
0.5    0.0
1.0    NaN
Name: var_B, dtype: float64

And here the unsorted bins look like this:
[nan, 0.0, nan]

However, after sorting, the results is different:

var_A:
[nan, nan, -9.973934819479391e-05]

var_B
[nan, 0.0, nan]

You can easily test this:

test1 = list([float('nan'), 0.0, float('nan')])
test2 = list([float('nan'), -9.973934819479391e-05, float('nan')])

sorted(list(set(test1)))
# gives [nan, 0.0, nan]

sorted(list(set(test2)))
# gives [nan, nan, -9.973934819479391e-05]

Interestingly when the float is smaller (04):
test2 = list([float('nan'), -9.973934819479391e-04, float('nan')]), the problem disappears.

Now, when we construct the interval index pd.IntervalIndex.from_tuples(_intervals, closed) in _create_index(), var_B is ok:
(output is printed from _create_index() print(f" bin: {intervals}, intervals: {_intervals}"))

bin: [(nan, 0.0), (0.0, nan)], intervals: [(-inf, 0.0), (0.0, inf)]

but for var_A, we get this:

bin: [(nan, nan), (nan, -0.0)], intervals: [(-inf, nan), (nan, inf)]

And this raises the error during transform():
ValueError: missing values must be missing in the same location both left and right sides
Because the two nans are on both of the inner edges.

Extra sources:

JanBenisek added a commit that referenced this issue Mar 17, 2021
@JanBenisek JanBenisek linked a pull request Mar 17, 2021 that will close this issue
JanBenisek added a commit that referenced this issue Mar 19, 2021
JanBenisek added a commit that referenced this issue Apr 2, 2021
JanBenisek added a commit that referenced this issue Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant