-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom CV #10
base: dev
Are you sure you want to change the base?
Custom CV #10
Conversation
@8080labs thoughts ? |
|
||
# Crossvalidation is stratifiedKFold for classification, KFold for regression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to keep the annotation somewhere when no explicit CV object is passed but just the number of folds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is already in the score
function. I also added it to the docstring
# if there is a strong pattern in the rows eg 0,0,0,0,1,1,1,1 | ||
# then this will lead to problems because the first cv sees mostly 0 and the later 1 | ||
# this approach might be wrong for timeseries because it might leak information | ||
df = df.sample(frac=1, random_state=RANDOM_SEED, replace=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case that we use the default CV, I think that we still need this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I meant to bring it back and forgot..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, no worries :)
src/ppscore/calculation.py
Outdated
@@ -158,7 +185,7 @@ def _infer_task(df, x, y): | |||
if category_count == 2: | |||
return "classification" | |||
if category_count == len(df[y]) and ( | |||
is_string_dtype(df[y]) or is_categorical_dtype(df[y]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using black as a default formatter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take care of it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you. We also have to document this somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you do not seem to use any specific settings for black, right ? (nothing I found in the repo at least)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, just the standard
Thank you for your PR! I had a first glance on the changes and made some comments. But I definitely need some more time to have a deeper look at it and try it out myself |
cv = CV_ITERATIONS | ||
df = df.sample(frac=1, random_state=RANDOM_SEED, replace=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a notion that this might be too easy. For example, when the user just wants to use 6 folds and passes cv=6, then we still need to perform the random resampling/shuffling in case of a normal KFold or stratifiedKFold.
Just in case of a CV strategy that needs to keep the order of the rows in the dataset, there should be no reshuffling.
What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you wanted to keep it in the default case only, hence I added it back as per your comment above. Alternatively, one can pass KFold with shuffle either True, or False, thus explicitly setting a CV procedure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, setting CV to a train-test index generator as opposed to an integer means the user is doing it willingly, with understanding on consequences and a specific purpose in mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe I just do not understand what you d like me to do here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the most flexible and robust solution is to perform the default shuffling when the user passes no CV or an integer (assuming he wants to adjust the number of folds). In all the other cases, we do not have to shuffle as he has to pass a valid CV iterator.
What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok i see. As of now it shuffles only if no CV is passed, and you're saying "let's shuffle the data if no CV is passed or if an integer is passed", is that correct ?
If the above is True, then I see your point and will adjust the if-statement above in the code accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks fine, thank you :)
Should we merge it then, or add a few specific CV tests (like timeseriees one)? |
Before merge I will have a look over everything again to make sure that it is consistent and we did not miss anything. Afterwards, I will get back to you Thank you so much, |
Yeah, fair! |
@AntonBiryukovUofC FYI: we are currently reviewing everything. Just wanted to communicate that there is progress :) |
are you guys alive out there ? :) |
Hi Anton, sorry for letting you wait so long. We hope that you (and potentially others) can still use your branch internally if you need it. All the best, |
This PR adds ability to use a custom sklearn-compatible CV generator, as long as it compatible with
cross_val_score
.I have also provided a test for
test_score_cv
that iterates over different cross-validation strategies implemented insklearn.model_selection
.(branch has
regression
in the name, but I think the approach should work for both regression and classification problems)I still need to carve time to devise a test for " feature leakage".