Custom CV #10

AntonBiryukovUofC · 2020-05-13T21:14:19Z

This PR adds ability to use a custom sklearn-compatible CV generator, as long as it compatible with cross_val_score.

I have also provided a test for test_score_cv that iterates over different cross-validation strategies implemented in sklearn.model_selection.
(branch has regression in the name, but I think the approach should work for both regression and classification problems)
I still need to carve time to devise a test for " feature leakage".

AntonBiryukovUofC · 2020-05-15T23:36:57Z

@8080labs thoughts ?

8080labs · 2020-05-18T05:30:36Z

src/ppscore/calculation.py


-    # Crossvalidation is stratifiedKFold for classification, KFold for regression


It would be great to keep the annotation somewhere when no explicit CV object is passed but just the number of folds

it is already in the score function. I also added it to the docstring

8080labs · 2020-05-18T05:33:38Z

src/ppscore/calculation.py

-    # if there is a strong pattern in the rows eg 0,0,0,0,1,1,1,1
-    # then this will lead to problems because the first cv sees mostly 0 and the later 1
-    # this approach might be wrong for timeseries because it might leak information
-    df = df.sample(frac=1, random_state=RANDOM_SEED, replace=False)


In case that we use the default CV, I think that we still need this

Yeah I think I meant to bring it back and forgot..

Alright, no worries :)

8080labs · 2020-05-18T05:34:02Z

src/ppscore/calculation.py

@@ -158,7 +185,7 @@ def _infer_task(df, x, y):
    if category_count == 2:
        return "classification"
    if category_count == len(df[y]) and (
-        is_string_dtype(df[y]) or is_categorical_dtype(df[y])


We are using black as a default formatter

Will take care of it :)

Great, thank you. We also have to document this somewhere

you do not seem to use any specific settings for black, right ? (nothing I found in the repo at least)

Exactly, just the standard

8080labs · 2020-05-18T05:36:59Z

Thank you for your PR! I had a first glance on the changes and made some comments. But I definitely need some more time to have a deeper look at it and try it out myself

AntonBiryukovUofC · 2020-05-22T21:01:30Z

BS commit there is related to a bug in Github - for some reason commit #5 did not show up in the PR.

Otherwise @8080labs let me know what you think re:code changes.

8080labs · 2020-05-23T17:08:47Z

src/ppscore/calculation.py

        cv = CV_ITERATIONS
+        df = df.sample(frac=1, random_state=RANDOM_SEED, replace=False)


I have a notion that this might be too easy. For example, when the user just wants to use 6 folds and passes cv=6, then we still need to perform the random resampling/shuffling in case of a normal KFold or stratifiedKFold.
Just in case of a CV strategy that needs to keep the order of the rows in the dataset, there should be no reshuffling.
What do you think about this?

I thought you wanted to keep it in the default case only, hence I added it back as per your comment above. Alternatively, one can pass KFold with shuffle either True, or False, thus explicitly setting a CV procedure.

In fact, setting CV to a train-test index generator as opposed to an integer means the user is doing it willingly, with understanding on consequences and a specific purpose in mind

Or maybe I just do not understand what you d like me to do here :)

I think the most flexible and robust solution is to perform the default shuffling when the user passes no CV or an integer (assuming he wants to adjust the number of folds). In all the other cases, we do not have to shuffle as he has to pass a valid CV iterator.

What do you think about this?

Ok i see. As of now it shuffles only if no CV is passed, and you're saying "let's shuffle the data if no CV is passed or if an integer is passed", is that correct ?

If the above is True, then I see your point and will adjust the if-statement above in the code accordingly.

That looks fine, thank you :)

AntonBiryukovUofC · 2020-05-29T02:09:41Z

Should we merge it then, or add a few specific CV tests (like timeseriees one)?

8080labs · 2020-05-29T06:31:51Z

Before merge I will have a look over everything again to make sure that it is consistent and we did not miss anything. Afterwards, I will get back to you

Thank you so much,
Florian

AntonBiryukovUofC · 2020-05-30T04:51:28Z

Yeah, fair!

tkrabel · 2020-06-09T20:33:39Z

@AntonBiryukovUofC FYI: we are currently reviewing everything. Just wanted to communicate that there is progress :)

…isable warning

AntonBiryukovUofC · 2020-07-09T04:34:24Z

are you guys alive out there ? :)

FlorianWetschoreck · 2020-07-09T07:43:31Z

Hi Anton,

sorry for letting you wait so long. We hope that you (and potentially others) can still use your branch internally if you need it.
Your PR will need some extensive time for the final review in order to make sure that we did not miss something and thus we did not find the time yet because we are currently busy with many other tasks and deadlines.

All the best,
Florian

AntonBiryukovUofC added 3 commits May 10, 2020 22:07

adding custom CV

22db80e

added tests with varying cv

c7c9e6e

merged upstream master

b56f494

8080labs reviewed May 18, 2020

View reviewed changes

AntonBiryukovUofC added 2 commits May 22, 2020 14:26

blackened and added annotation

af91343

added bs comment

aca6ca2

8080labs reviewed May 23, 2020

View reviewed changes

shuffle df in case of no cv passed or cv passed as integer

46b5086

tkrabel added 6 commits June 9, 2020 22:46

add settings.json to gitignore

6a6a8a9

reformat using black

d0cd2f2

set paths from root + added explicit n_splits in StratifiedKFold to d…

92c4ad2

…isable warning

add test case for classification with StratifiedKFold

df5ba0c

cv_list: change wording of comment

7fac496

test_score_cv_stable: init

e0286ee

FlorianWetschoreck added 2 commits July 21, 2020 15:21

first naive try to resolve merge conflicts

144c5eb

fixed minor issues to let tests pass

3aea711

fwetdb mentioned this pull request Dec 14, 2021

Readme / docs unclear about using ppscore on time series data #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom CV #10

Custom CV #10

AntonBiryukovUofC commented May 13, 2020

AntonBiryukovUofC commented May 15, 2020

8080labs May 18, 2020

AntonBiryukovUofC May 22, 2020

8080labs May 18, 2020

AntonBiryukovUofC May 19, 2020

8080labs May 19, 2020

8080labs May 18, 2020

AntonBiryukovUofC May 19, 2020

8080labs May 19, 2020

AntonBiryukovUofC May 22, 2020

8080labs May 23, 2020

8080labs commented May 18, 2020 •

edited

Loading

AntonBiryukovUofC commented May 22, 2020 •

edited

Loading

8080labs May 23, 2020

AntonBiryukovUofC May 24, 2020

AntonBiryukovUofC May 24, 2020

AntonBiryukovUofC May 24, 2020

8080labs May 26, 2020

AntonBiryukovUofC May 26, 2020

8080labs May 28, 2020

AntonBiryukovUofC commented May 29, 2020 •

edited

Loading

8080labs commented May 29, 2020

AntonBiryukovUofC commented May 30, 2020

tkrabel commented Jun 9, 2020

AntonBiryukovUofC commented Jul 9, 2020

FlorianWetschoreck commented Jul 9, 2020


		# Crossvalidation is stratifiedKFold for classification, KFold for regression

		cv = CV_ITERATIONS
		df = df.sample(frac=1, random_state=RANDOM_SEED, replace=False)

Custom CV #10

Are you sure you want to change the base?

Custom CV #10

Conversation

AntonBiryukovUofC commented May 13, 2020

AntonBiryukovUofC commented May 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

8080labs commented May 18, 2020 • edited Loading

AntonBiryukovUofC commented May 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AntonBiryukovUofC commented May 29, 2020 • edited Loading

8080labs commented May 29, 2020

AntonBiryukovUofC commented May 30, 2020

tkrabel commented Jun 9, 2020

AntonBiryukovUofC commented Jul 9, 2020

FlorianWetschoreck commented Jul 9, 2020

8080labs commented May 18, 2020 •

edited

Loading

AntonBiryukovUofC commented May 22, 2020 •

edited

Loading

AntonBiryukovUofC commented May 29, 2020 •

edited

Loading