Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stratify sampling when split train/test data #143

Merged
merged 28 commits into from
Oct 19, 2023
Merged

Conversation

YuxinB
Copy link
Member

@YuxinB YuxinB commented Oct 12, 2023

Fixes #

Changes proposed in this pull request:

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

  • All GitHub Actions jobs for my pull request have passed.

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If public API changes we need a test to control accidental misuse

sktree/stats/forestht.py Outdated Show resolved Hide resolved
PSSF23

This comment was marked as resolved.

sktree/stats/forestht.py Outdated Show resolved Hide resolved
sktree/stats/forestht.py Outdated Show resolved Hide resolved
Copy link
Member

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have corrected the change regressions & formatted the file.

Copy link
Member

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adam2392 @YuxinB The failed checks are about train_test_samples_, which takes the argument of stratifier y and not recognized as an attribute in the test. The property label was removed. How do you think this should be resolved?

@adam2392
Copy link
Collaborator

adam2392 commented Oct 17, 2023

@adam2392 @YuxinB The failed checks are about train_test_samples_, which takes the argument of stratifier y and not recognized as an attribute in the test. The property label was removed. How do you think this should be resolved?

I think the issue is that train_test_samples_ is meant to be a "fitted" property that can be recomputed after-the-fact. However, a reliance on y essentially means it is not replicable without access to y.

There are a few ideas that come to mind:

  • Add _y as a cached array: this increases RAM usage and disc usage when picking, but removes any reliance on keeping track of y
  • Only pass y into a private property
  • Rename train_test_samples_ and add a function outside the class _compute_train_test_samples, so we don't expose what are the training/testing indices

I am in favor of the first option which is only activated when we need y to be stratified (i.e. classification)

@adam2392
Copy link
Collaborator

adam2392 commented Oct 18, 2023

I've implemented said changes. @YuxinB you must add yourself to the contributors.rst file. You also need to test an additional two if/else statement blocks.

Assuming CIs work, I'll let @sampan501 and @PSSF23 review and merge if they are happy.

@codecov
Copy link

codecov bot commented Oct 18, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (983b846) 89.73% compared to head (3bc05b5) 89.66%.
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #143      +/-   ##
==========================================
- Coverage   89.73%   89.66%   -0.08%     
==========================================
  Files          41       41              
  Lines        3352     3367      +15     
==========================================
+ Hits         3008     3019      +11     
- Misses        344      348       +4     
Files Coverage Δ
sktree/stats/forestht.py 95.31% <100.00%> (-1.31%) ⬇️
sktree/stats/tests/test_forestht.py 99.55% <100.00%> (+0.02%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sktree/stats/forestht.py Outdated Show resolved Hide resolved
sktree/stats/forestht.py Outdated Show resolved Hide resolved
Copy link
Member

@sampan501 sampan501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other thing that is missing is documentation for the stratify parameter

Copy link
Member

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the tests for stratification were not working (they will always pass). I think I corrected them & added new ones @adam2392 requested.

Copy link
Member

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed duplicate checks as those would be covered by check_input anyway. Explicitly setting the parameter to False just to have new checks that check the same thing doesn't make sense to me.

Copy link
Member

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adam2392 I believe the tests should be excluded from codecov? And do you think we need more tests?

@adam2392
Copy link
Collaborator

Can you add docstring for stratify in the two classes?

Copy link
Member

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set the base forest default value to False and removed it from the regressor class init

@sampan501
Copy link
Member

What set the default of stratify to false? Isn't it better to stratify in general according to the sims? Or is it because of the variance in the performance?

@adam2392
Copy link
Collaborator

Yeah just to give options.

@PSSF23
Copy link
Member

PSSF23 commented Oct 19, 2023

What set the default of stratify to false? Isn't it better to stratify in general according to the sims? Or is it because of the variance in the performance?

I only set the base forest value to false for regressor, classifier is still default true

@PSSF23
Copy link
Member

PSSF23 commented Oct 19, 2023

@adam2392 Do you know how to exclude experimental and all tests from codecov?

@adam2392
Copy link
Collaborator

@adam2392 Do you know how to exclude experimental and all tests from codecov?

Experimental we can include, but the tests should be not part of it. I think this might be a bug on codecov cuz it wasn't showing up before.

@adam2392
Copy link
Collaborator

The build-docs are failing because @YuxinB needs to add herself to the contributors doc:

/home/circleci/project/doc/whats_new/v0.3.rst:18: ERROR: Unknown target name: "yuxin bai".
/home/circleci/project/doc/whats_new/v0.3.rst:28: ERROR: Unknown target name: "yuxin bai".

The codecov/project doesn't need to pass

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once CI green

@adam2392
Copy link
Collaborator

You'll need to rename the references to the example file, since you changed the name of the example.

/home/circleci/project/doc/auto_examples/hypothesis_testing/plot_MI_imbalanced_hyppo_testing.rst:39: WARNING: undefined label: 'sphx_glr_auto_examples_hypothesis_testing_plot_mi_gigantic_hypothesis_testing_forest.py'

@PSSF23
Copy link
Member

PSSF23 commented Oct 19, 2023

You'll need to rename the references to the example file, since you changed the name of the example.

/home/circleci/project/doc/auto_examples/hypothesis_testing/plot_MI_imbalanced_hyppo_testing.rst:39: WARNING: undefined label: 'sphx_glr_auto_examples_hypothesis_testing_plot_mi_gigantic_hypothesis_testing_forest.py'

Done

@adam2392 adam2392 enabled auto-merge (squash) October 19, 2023 18:05
@adam2392 adam2392 merged commit 359ea75 into main Oct 19, 2023
25 of 26 checks passed
@adam2392 adam2392 deleted the Stratified_Sample branch October 19, 2023 18:05
@adam2392
Copy link
Collaborator

Thanks @YuxinB @PSSF23 and @sampan501

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants