Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Ability to stratify with cols that contain some Nans values, this way people can hyperparameter tune best imputation methods #681

Open
3 tasks
dec1costello opened this issue Jun 26, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@dec1costello
Copy link

dec1costello commented Jun 26, 2024

Hello!

  • I have a training pipeline that hyperparameter tunes the best imputation method
  • My pipeline fails when sklearn's train_test_split(stratify=stratify_data) is insufficient with cols containing Nan values
  • Curious if this seems like a scikit-lego feature people would want

Here's my attempt to stratify cols with some Nans for more context, I am a beginner so open to better ideas or comments if this feature request is out of scope. Thanks in advance!! Appreciate everyone's contributions to this package!

Strat attempt:

X = result_df[feature_cols]
y = result_df['strokes_to_hole_out']

#Extract the columns for stratification
stratify_cols = ['from_location_scorer','from_location_laser']
stratify_data = result_df[stratify_cols]

#Split the data, using 'stratify_data' for stratification
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=stratify_data)

error I receive come training: Trial failed with exception: Found unknown categories ['blue'] in column 9 during transform

@dec1costello dec1costello added the enhancement New feature or request label Jun 26, 2024
@FBruzzesi
Copy link
Collaborator

Hey @dec1costello , thank for the feature request. I have a few questions:

  • Could you provide some minimal input data?
  • Could you provide some minimal expected output data?
  • The error seems to be related to a transformer failing in the .transform(X_valid) step. How would the proposal fix that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants