Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]- error with SMOTENC fit_resample: ValueError: could not broadcast input array from shape (137,12) into shape (272,12 #837

Closed
jox79 opened this issue May 10, 2021 · 18 comments · Fixed by #1015

Comments

@jox79
Copy link

jox79 commented May 10, 2021

Describe the bug

Error with SMOTENC.fit_resample: ValueError: could not broadcast input array from shape (137,12) into shape (272,12)

Steps/Code to Reproduce

Using the two X and y csv dataset attached:

X.zip
y.zip

I'm running:

smote = SMOTENC(
  categorical_features=[19],
  sampling_strategy="auto",
  random_state=0,
  n_jobs=8
)
X, y = smote.fit_resample(X, y)

Expected Results

No error is thrown.

Actual Results

File "C:\Users\c42steguerri\PycharmProjects\StrategyLab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 577, in _generate_samples
    ] = self._X_categorical_minority_encoded
ValueError: could not broadcast input array from shape (137,12) into shape (272,12) 

Versions

System:
    python: 3.7.7 (tags/v3.7.7:d7c567b08f, Mar 10 2020, 10:41:24) [MSC v.1900 64 bit (AMD64)]
executable: C:\Users\c42steguerri\PycharmProjects\StrategyLab\venv\Scripts\python.exe
   machine: Windows-10-10.0.16299-SP0

Python dependencies:
          pip: 19.0.3
   setuptools: 40.8.0
      sklearn: 0.24.1
        numpy: 1.18.4
        scipy: 1.4.1
       Cython: None
       pandas: 1.0.5
   matplotlib: None
       joblib: 0.14.1
threadpoolctl: 2.0.0

Built with OpenMP: True
@jox79 jox79 changed the title [BUG] [BUG]- error with SMOTENC fit_resample: ValueError: could not broadcast input array from shape (137,12) into shape (272,12 May 12, 2021
@SkylarTrigueiro
Copy link

I'm having a similar issue with some code I'm testing. If I discover anything I'll let you know.

@chkoar
Copy link
Member

chkoar commented Jun 1, 2021

What are your imbalanced-learn versions?

@chkoar
Copy link
Member

chkoar commented Jun 1, 2021

@jox79 please post a code snippet in order to reproduce the error.

@jonasjostmann
Copy link

jonasjostmann commented Jun 24, 2021

I'm having the same problem. I'm using imbalanced-learn version 0.8.0.

@jonasjostmann
Copy link

jonasjostmann commented Jun 24, 2021

I have found a rather unattractive workaround for the meantime. I choose sampling_strategy='minority' and loop over all labels.

smotenc = SMOTENC(
    categorical_features=[250],
    random_state=42,
    k_neighbors=5,
    sampling_strategy="minority",
)

for label in np.unique(y):
    X, y = smotenc.fit_resample(X, y)

Did I miss something?

@jox79
Copy link
Author

jox79 commented Dec 2, 2021

I'm still having this error also with v 0.8.1

File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\base.py", line 83, in fit_resample
    output = self._fit_resample(X, y)
  File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 518, in _fit_resample
    X_resampled, y_resampled = super()._fit_resample(X_encoded, y)
  File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 311, in _fit_resample
    X_class, y.dtype, class_sample, X_class, nns, n_samples, 1.0
  File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 103, in _make_samples
    X_new = self._generate_samples(X, nn_data, nn_num, rows, cols, steps)
  File "C:\CRIF\StrategyOne\S170\wspace\lab\venv\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 577, in _generate_samples
    ] = self._X_categorical_minority_encoded
Exception: could not broadcast input array from shape (6,154) into shape (455,154)

I do not have idea how to solve it.....

@glemaitre
Copy link
Member

The issue here is that the internal algorithm was wrongly thought only for binary classification for the case when the median of the std. dev. == 0. This need to be adapted to multiclass. I assume that it boils down to _X_categorical_minority_encoded for all the classes to be over-sampled and not only the minority class.

@glemaitre
Copy link
Member

In short:

        # we can replace the 1 entries of the categorical features with the
        # median of the standard deviation. It will ensure that whenever
        # distance is computed between 2 samples, the difference will be equal
        # to the median of the standard deviation as in the original paper.

        # In the edge case where the median of the std is equal to 0, the 1s
        # entries will be also nullified. In this case, we store the original
        # categorical encoding which will be later used for inversing the OHE
        if math.isclose(self.median_std_, 0):
            self._X_categorical_minority_encoded = _safe_indexing(
                X_ohe.toarray(), np.flatnonzero(y == class_minority)
            )

Here, we need to store not only for the minority class but all class to be resampled.

@jox79
Copy link
Author

jox79 commented Jan 31, 2022

no way to have that issue fixed in one of the next releases? It is really important in my opinion. Thanks very much!

@glemaitre
Copy link
Member

@jox79 feel free to open a PR to fix the bug

@freddyaboulton
Copy link

I put up a fix here @jox79 #905

@kelvinheng92
Copy link

Hi everyone, can i check the status of this MR? I am facing the same error. However, its pretty random, sometimes it is able to run, sometimes it isn't. Please see the error log below. Thanks a lot!
image

@lolloconsoli
Copy link

I got the same error
this is the traceback

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_112/2018849994.py in <module>
      6 Y_validation = np.asarray(LabelEncoder().fit_transform(Y_validation))
      7 print(f"Y_type {type(Y_training)}\tshape Y_train {Y_training.shape}")
----> 8 X_training_rus, Y_training_rus = over_sampler.fit_resample(X_train_concat, Y_training)
      9 print("Sampled!")
     10 

/opt/conda/lib/python3.7/site-packages/imblearn/base.py in fit_resample(self, X, y)
     75         check_classification_targets(y)
     76         arrays_transformer = ArraysTransformer(X, y)
---> 77         X, y, binarize_y = self._check_X_y(X, y)
     78 
     79         self.sampling_strategy_ = check_sampling_strategy(

/opt/conda/lib/python3.7/site-packages/imblearn/over_sampling/_random_over_sampler.py in _check_X_y(self, X, y)
    144             accept_sparse=["csr", "csc"],
    145             dtype=None,
--> 146             force_all_finite=False,
    147         )
    148         return X, y, binarize_y

/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    430                 y = check_array(y, **check_y_params)
    431             else:
--> 432                 X, y = check_X_y(X, y, **check_params)
    433             out = X, y
    434 

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    800                     ensure_min_samples=ensure_min_samples,
    801                     ensure_min_features=ensure_min_features,
--> 802                     estimator=estimator)
    803     if multi_output:
    804         y = check_array(y, accept_sparse='csr', force_all_finite=True,

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    596                     array = array.astype(dtype, casting="unsafe", copy=False)
    597                 else:
--> 598                     array = np.asarray(array, order=order, dtype=dtype)
    599             except ComplexWarning:
    600                 raise ValueError("Complex data not supported\n"

/opt/conda/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81 
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84 
     85 

It looks like when internally its calling /opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)

there may be some parameter that need to be reset:
the error is thrown by numpy when it calls array = np.asarray(array, order=order, dtype=dtype)

i checked my input by calling the same np.asarray() function

print(f"Y_type {type(Y_training)}\tshape Y_train {np.asarray(Y_training).shape}")

and it is:

Y_type <class 'numpy.ndarray'>	shape Y_train (56123,)

I was thinking maybe the force_all_finite or the ensure_2d arguments are the issue, even becasue we can read the lines:

/opt/conda/lib/python3.7/site-packages/imblearn/over_sampling/_random_over_sampler.py in _check_X_y(self, X, y)
    144             accept_sparse=["csr", "csc"],
    145             dtype=None,
--> 146             force_all_finite=False,
    147         )
    148         return X, y, binarize_y

from the traceback.

I dont know tho if this makes sense or could be helpful, i desperately need a fix to this hahaha

@glemaitre
Copy link
Member

It should be solved in #1015

@LukebethamStonehaven
Copy link

LukebethamStonehaven commented Sep 7, 2023

Hi @glemaitre, just wondering when this change is going to be released. I think it didn't make it in to 0.11.0 right? Seems like #1015 was merged a couple days after the last release?

@glemaitre
Copy link
Member

It should aready be available in the latest release in 0.11

@LukebethamStonehaven
Copy link

Oh right I have updated to 0.11 and am still getting this error - it only seems to happen sometimes though...

@glemaitre
Copy link
Member

It could be another bug with the same error.
Don't hesitate to open a new issue with a minimal example that trigger the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants