Stratify sampling when split train/test data #143

YuxinB · 2023-10-12T16:12:11Z

Fixes #

Changes proposed in this pull request:

Before submitting

I've read and followed all steps in the Making a pull request
section of the CONTRIBUTING docs.
I've updated or added any relevant docstrings following the syntax described in the
Writing docstrings section of the CONTRIBUTING docs.
If this PR fixes a bug, I've added a test that will fail without my fix.
If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

All GitHub Actions jobs for my pull request have passed.

adam2392

If public API changes we need a test to control accidental misuse

sktree/stats/forestht.py

PSSF23

I have corrected the change regressions & formatted the file.

PSSF23

@adam2392 @YuxinB The failed checks are about train_test_samples_, which takes the argument of stratifier y and not recognized as an attribute in the test. The property label was removed. How do you think this should be resolved?

…it-tree into Stratified_Sample

adam2392 · 2023-10-17T18:25:00Z

@adam2392 @YuxinB The failed checks are about train_test_samples_, which takes the argument of stratifier y and not recognized as an attribute in the test. The property label was removed. How do you think this should be resolved?

I think the issue is that train_test_samples_ is meant to be a "fitted" property that can be recomputed after-the-fact. However, a reliance on y essentially means it is not replicable without access to y.

There are a few ideas that come to mind:

Add _y as a cached array: this increases RAM usage and disc usage when picking, but removes any reliance on keeping track of y
Only pass y into a private property
Rename train_test_samples_ and add a function outside the class _compute_train_test_samples, so we don't expose what are the training/testing indices

I am in favor of the first option which is only activated when we need y to be stratified (i.e. classification)

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-10-18T01:48:05Z

I've implemented said changes. @YuxinB you must add yourself to the contributors.rst file. You also need to test an additional two if/else statement blocks.

Assuming CIs work, I'll let @sampan501 and @PSSF23 review and merge if they are happy.

codecov · 2023-10-18T02:04:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (983b846) 89.73% compared to head (3bc05b5) 89.66%.
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #143      +/-   ##
==========================================
- Coverage   89.73%   89.66%   -0.08%     
==========================================
  Files          41       41              
  Lines        3352     3367      +15     
==========================================
+ Hits         3008     3019      +11     
- Misses        344      348       +4

Files	Coverage Δ
sktree/stats/forestht.py	`95.31% <100.00%> (-1.31%)`	⬇️
sktree/stats/tests/test_forestht.py	`99.55% <100.00%> (+0.02%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

examples/hypothesis_testing/plot_MI_gigantic_hypothesis_testing_forest.py

sktree/stats/forestht.py

sampan501

I think the other thing that is missing is documentation for the stratify parameter

PSSF23

Previously the tests for stratification were not working (they will always pass). I think I corrected them & added new ones @adam2392 requested.

PSSF23

I removed duplicate checks as those would be covered by check_input anyway. Explicitly setting the parameter to False just to have new checks that check the same thing doesn't make sense to me.

PSSF23

@adam2392 I believe the tests should be excluded from codecov? And do you think we need more tests?

adam2392 · 2023-10-19T14:26:55Z

Can you add docstring for stratify in the two classes?

PSSF23

I set the base forest default value to False and removed it from the regressor class init

sampan501 · 2023-10-19T14:49:17Z

What set the default of stratify to false? Isn't it better to stratify in general according to the sims? Or is it because of the variance in the performance?

adam2392 · 2023-10-19T14:50:20Z

Yeah just to give options.

PSSF23 · 2023-10-19T14:51:43Z

What set the default of stratify to false? Isn't it better to stratify in general according to the sims? Or is it because of the variance in the performance?

I only set the base forest value to false for regressor, classifier is still default true

PSSF23 · 2023-10-19T17:09:42Z

@adam2392 Do you know how to exclude experimental and all tests from codecov?

adam2392 · 2023-10-19T17:11:20Z

@adam2392 Do you know how to exclude experimental and all tests from codecov?

Experimental we can include, but the tests should be not part of it. I think this might be a bug on codecov cuz it wasn't showing up before.

adam2392 · 2023-10-19T17:12:05Z

The build-docs are failing because @YuxinB needs to add herself to the contributors doc:

/home/circleci/project/doc/whats_new/v0.3.rst:18: ERROR: Unknown target name: "yuxin bai".
/home/circleci/project/doc/whats_new/v0.3.rst:28: ERROR: Unknown target name: "yuxin bai".

The codecov/project doesn't need to pass

…it-tree into Stratified_Sample

adam2392

LGTM once CI green

adam2392 · 2023-10-19T17:34:08Z

You'll need to rename the references to the example file, since you changed the name of the example.

/home/circleci/project/doc/auto_examples/hypothesis_testing/plot_MI_imbalanced_hyppo_testing.rst:39: WARNING: undefined label: 'sphx_glr_auto_examples_hypothesis_testing_plot_mi_gigantic_hypothesis_testing_forest.py'

…it-tree into Stratified_Sample

PSSF23 · 2023-10-19T17:43:49Z

You'll need to rename the references to the example file, since you changed the name of the example.
/home/circleci/project/doc/auto_examples/hypothesis_testing/plot_MI_imbalanced_hyppo_testing.rst:39: WARNING: undefined label: 'sphx_glr_auto_examples_hypothesis_testing_plot_mi_gigantic_hypothesis_testing_forest.py'

Done

Addressed

adam2392 · 2023-10-19T18:05:50Z

Thanks @YuxinB @PSSF23 and @sampan501

Startify sampling when split tran/test data

e030050

adam2392 requested changes Oct 12, 2023

View reviewed changes

sktree/stats/forestht.py Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

PSSF23 reviewed Oct 12, 2023

View reviewed changes

sktree/stats/forestht.py Show resolved Hide resolved

PSSF23 reviewed Oct 12, 2023

View reviewed changes

sktree/stats/forestht.py Outdated Show resolved Hide resolved

PSSF23 reviewed Oct 12, 2023

View reviewed changes

sktree/stats/forestht.py Outdated Show resolved Hide resolved

YuxinB and others added 4 commits October 12, 2023 15:48

Stratified_Sample, Let startify = None for Regressor

5d60959

Merge branch 'main' into Stratified_Sample

c3df52e

Merge branch 'main' into Stratified_Sample

9a15c69

FIX correct changes & black format

78837d2

PSSF23 reviewed Oct 17, 2023

View reviewed changes

DOC modify warning text

4f88518

PSSF23 reviewed Oct 17, 2023

View reviewed changes

YuxinB and others added 4 commits October 17, 2023 13:09

Add unit test for verifying stratified sampling

ffb8136

Merge branch 'Stratified_Sample' of https://github.com/neurodata/scik…

8fa1277

…it-tree into Stratified_Sample

Correct Typo for Stratified

3ff6340

Merge branch 'main' into Stratified_Sample

3a67779

adam2392 added 2 commits October 17, 2023 21:46

Fixed example and whatsnew

70a14a5

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'main' into Stratified_Sample

98fbe5f

adam2392 mentioned this pull request Oct 18, 2023

[ENH] Add option to permute per forest fraction #145

Merged

5 tasks

adam2392 reviewed Oct 18, 2023

View reviewed changes

examples/hypothesis_testing/plot_MI_gigantic_hypothesis_testing_forest.py Outdated Show resolved Hide resolved

adam2392 reviewed Oct 18, 2023

View reviewed changes

sktree/stats/forestht.py Outdated Show resolved Hide resolved

adam2392 reviewed Oct 18, 2023

View reviewed changes

sktree/stats/forestht.py Outdated Show resolved Hide resolved

sampan501 previously requested changes Oct 18, 2023

View reviewed changes

PSSF23 added 2 commits October 18, 2023 14:29

ENH correct tests & add coverage

f555e2c

FIX change n_samples for test to be valid

4595df3

PSSF23 reviewed Oct 18, 2023

View reviewed changes

PSSF23 added 4 commits October 18, 2023 20:17

FIX correct variable shape

e248a7c

FIX correct test method

8ba06ef

FIX disable check_input for correct error

5d516a7

FIX remove duplicate checks

735a10b

PSSF23 reviewed Oct 19, 2023

View reviewed changes

PSSF23 approved these changes Oct 19, 2023

View reviewed changes

DOC add docstring for stratify

47857c3

PSSF23 reviewed Oct 19, 2023

View reviewed changes

Merge branch 'main' into Stratified_Sample

3ce68e7

sampan501 and others added 3 commits October 19, 2023 13:12

Merge branch 'main' into Stratified_Sample

888cb42

Add contributor

35eb776

Merge branch 'Stratified_Sample' of https://github.com/neurodata/scik…

9e2ba9e

…it-tree into Stratified_Sample

adam2392 approved these changes Oct 19, 2023

View reviewed changes

PSSF23 added 2 commits October 19, 2023 13:43

DOC update reference

3332e9a

Merge branch 'Stratified_Sample' of https://github.com/neurodata/scik…

3bc05b5

…it-tree into Stratified_Sample

adam2392 approved these changes Oct 19, 2023

View reviewed changes

adam2392 enabled auto-merge (squash) October 19, 2023 18:05

adam2392 merged commit 359ea75 into main Oct 19, 2023
25 of 26 checks passed

adam2392 deleted the Stratified_Sample branch October 19, 2023 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stratify sampling when split train/test data #143

Stratify sampling when split train/test data #143

YuxinB commented Oct 12, 2023

adam2392 left a comment

This comment was marked as resolved.

PSSF23 left a comment

PSSF23 left a comment

adam2392 commented Oct 17, 2023 •

edited

Loading

adam2392 commented Oct 18, 2023 •

edited

Loading

codecov bot commented Oct 18, 2023 •

edited

Loading

sampan501 left a comment

PSSF23 left a comment

PSSF23 left a comment

PSSF23 left a comment

adam2392 commented Oct 19, 2023

PSSF23 left a comment •

edited

Loading

sampan501 commented Oct 19, 2023

adam2392 commented Oct 19, 2023

PSSF23 commented Oct 19, 2023 •

edited

Loading

PSSF23 commented Oct 19, 2023

adam2392 commented Oct 19, 2023

adam2392 commented Oct 19, 2023

adam2392 left a comment

adam2392 commented Oct 19, 2023

PSSF23 commented Oct 19, 2023

adam2392 commented Oct 19, 2023

Stratify sampling when split train/test data #143

Stratify sampling when split train/test data #143

Conversation

YuxinB commented Oct 12, 2023

Before submitting

After submitting

adam2392 left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

PSSF23 left a comment

Choose a reason for hiding this comment

PSSF23 left a comment

Choose a reason for hiding this comment

adam2392 commented Oct 17, 2023 • edited Loading

adam2392 commented Oct 18, 2023 • edited Loading

codecov bot commented Oct 18, 2023 • edited Loading

Codecov Report

sampan501 left a comment

Choose a reason for hiding this comment

PSSF23 left a comment

Choose a reason for hiding this comment

PSSF23 left a comment

Choose a reason for hiding this comment

PSSF23 left a comment

Choose a reason for hiding this comment

adam2392 commented Oct 19, 2023

PSSF23 left a comment • edited Loading

Choose a reason for hiding this comment

sampan501 commented Oct 19, 2023

adam2392 commented Oct 19, 2023

PSSF23 commented Oct 19, 2023 • edited Loading

PSSF23 commented Oct 19, 2023

adam2392 commented Oct 19, 2023

adam2392 commented Oct 19, 2023

adam2392 left a comment

Choose a reason for hiding this comment

adam2392 commented Oct 19, 2023

PSSF23 commented Oct 19, 2023

adam2392 commented Oct 19, 2023

adam2392 commented Oct 17, 2023 •

edited

Loading

adam2392 commented Oct 18, 2023 •

edited

Loading

codecov bot commented Oct 18, 2023 •

edited

Loading

PSSF23 left a comment •

edited

Loading

PSSF23 commented Oct 19, 2023 •

edited

Loading