Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom CV #10

Open
wants to merge 14 commits into
base: dev
Choose a base branch
from

Conversation

AntonBiryukovUofC
Copy link

This PR adds ability to use a custom sklearn-compatible CV generator, as long as it compatible with cross_val_score.

I have also provided a test for test_score_cv that iterates over different cross-validation strategies implemented in sklearn.model_selection.
(branch has regression in the name, but I think the approach should work for both regression and classification problems)
I still need to carve time to devise a test for " feature leakage".

@AntonBiryukovUofC
Copy link
Author

@8080labs thoughts ?


# Crossvalidation is stratifiedKFold for classification, KFold for regression
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to keep the annotation somewhere when no explicit CV object is passed but just the number of folds

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is already in the score function. I also added it to the docstring

# if there is a strong pattern in the rows eg 0,0,0,0,1,1,1,1
# then this will lead to problems because the first cv sees mostly 0 and the later 1
# this approach might be wrong for timeseries because it might leak information
df = df.sample(frac=1, random_state=RANDOM_SEED, replace=False)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case that we use the default CV, I think that we still need this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I meant to bring it back and forgot..

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, no worries :)

@@ -158,7 +185,7 @@ def _infer_task(df, x, y):
if category_count == 2:
return "classification"
if category_count == len(df[y]) and (
is_string_dtype(df[y]) or is_categorical_dtype(df[y])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using black as a default formatter

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take care of it :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you. We also have to document this somewhere

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do not seem to use any specific settings for black, right ? (nothing I found in the repo at least)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, just the standard

@8080labs
Copy link
Owner

8080labs commented May 18, 2020

Thank you for your PR! I had a first glance on the changes and made some comments. But I definitely need some more time to have a deeper look at it and try it out myself

@AntonBiryukovUofC
Copy link
Author

AntonBiryukovUofC commented May 22, 2020

BS commit there is related to a bug in Github - for some reason commit #5 did not show up in the PR.

Otherwise @8080labs let me know what you think re:code changes.

cv = CV_ITERATIONS
df = df.sample(frac=1, random_state=RANDOM_SEED, replace=False)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a notion that this might be too easy. For example, when the user just wants to use 6 folds and passes cv=6, then we still need to perform the random resampling/shuffling in case of a normal KFold or stratifiedKFold.
Just in case of a CV strategy that needs to keep the order of the rows in the dataset, there should be no reshuffling.
What do you think about this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you wanted to keep it in the default case only, hence I added it back as per your comment above. Alternatively, one can pass KFold with shuffle either True, or False, thus explicitly setting a CV procedure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, setting CV to a train-test index generator as opposed to an integer means the user is doing it willingly, with understanding on consequences and a specific purpose in mind

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe I just do not understand what you d like me to do here :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the most flexible and robust solution is to perform the default shuffling when the user passes no CV or an integer (assuming he wants to adjust the number of folds). In all the other cases, we do not have to shuffle as he has to pass a valid CV iterator.

What do you think about this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok i see. As of now it shuffles only if no CV is passed, and you're saying "let's shuffle the data if no CV is passed or if an integer is passed", is that correct ?

If the above is True, then I see your point and will adjust the if-statement above in the code accordingly.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks fine, thank you :)

@AntonBiryukovUofC
Copy link
Author

AntonBiryukovUofC commented May 29, 2020

Should we merge it then, or add a few specific CV tests (like timeseriees one)?

@8080labs
Copy link
Owner

Before merge I will have a look over everything again to make sure that it is consistent and we did not miss anything. Afterwards, I will get back to you

Thank you so much,
Florian

@AntonBiryukovUofC
Copy link
Author

Yeah, fair!

@tkrabel
Copy link
Collaborator

tkrabel commented Jun 9, 2020

@AntonBiryukovUofC FYI: we are currently reviewing everything. Just wanted to communicate that there is progress :)

@AntonBiryukovUofC
Copy link
Author

are you guys alive out there ? :)

@FlorianWetschoreck
Copy link
Collaborator

Hi Anton,

sorry for letting you wait so long. We hope that you (and potentially others) can still use your branch internally if you need it.
Your PR will need some extensive time for the final review in order to make sure that we did not miss something and thus we did not find the time yet because we are currently busy with many other tasks and deadlines.

All the best,
Florian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants