-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] add PermutationForests and FeatureImportanceForests to sktree #125
Conversation
Co-Authored-By: Sambit Panda <36676569+sampan501@users.noreply.github.com> Co-Authored-By: Yuxin <99897042+YuxinB@users.noreply.github.com> Co-Authored-By: Adam Li <3460267+adam2392@users.noreply.github.com>
Will this also have MIRF with Mutual Info as a stat? |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #125 +/- ##
===========================================
- Coverage 87.71% 44.51% -43.21%
===========================================
Files 30 36 +6
Lines 2426 3116 +690
===========================================
- Hits 2128 1387 -741
- Misses 298 1729 +1431
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referring to this file in hyppo
. I removed everything related to hyppo
for now, including the mutual information function, to avoid confusion. The _might.py
file includes all other single-feature and multi-view methods we developed.
For now I will remove the multi-view stuff prolly and then also look at how we can introduce arbitrary metrics in here: e.g. MI, ROC_auc, etc. I'll take a look tonight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @sampan501 , added stat="MI"
and stat="AUC"
as parameters for statistic()
.
@PSSF23 Can you also rename it to MIGHT? Might make all this name changing easier to follow - no pun intended |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Renamed to
MIGHT
andMIGHT_MV
- Added y-label permutation test to
MIGHT
(previously removed due to connection tohyppo
)
@sampan501 I might be misunderstanding the MI calculation, but why the original method had the |
|
Before we make any serious changes like the one above, we really need unit tests for this method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made an empty test file. Let's edit and add to this message about what we need to test:
- test on iris for mutual info for accuracy
- partial AUROC
- multiview splitter (not this time)
Let's do multi view in a sep PR. |
Signed-off-by: Adam Li <adam2392@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By unit test results, MIGHT seems to perform the worst when coupled with PatchObliqueDecisionTreeClassifier()
. I remember similar situations back with honest tree tests. Should I lower the passing threshold or remove the estimator option?
I would remove the second two bullets to prevent those issues from getting closed w/o actually resolving the issue raised. |
Signed-off-by: Adam Li <adam2392@gmail.com>
Add permute_stat to class variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In FI, I cleared all the duplicate variables like posteriors_final_
and observe_posteriors_
, and save the permuted statistic as class variable permute_stat_
(all other results of permutation are saved, so this one should be there as well). I also make all default statistic MI, but we might change it to pAUC later.
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
sktree/stats/tests/test_forestht.py
Outdated
@pytest.mark.parametrize("backend", ["loky", "threading"]) | ||
@pytest.mark.parametrize("n_jobs", [1, -1]) | ||
def test_parallelization(backend, n_jobs): | ||
"""Test parallelization of training forests.""" | ||
n_samples = 100 | ||
n_features = 5 | ||
X = rng.uniform(size=(n_samples, n_features)) | ||
y = rng.integers(0, 2, size=n_samples) # Binary classification | ||
|
||
def run_forest(covariate_index=None): | ||
clf = FeatureImportanceForestClassifier( | ||
estimator=HonestForestClassifier( | ||
n_estimators=10, random_state=seed, n_jobs=n_jobs, honest_fraction=0.2 | ||
), | ||
test_size=0.5, | ||
) | ||
pvalue = clf.test(X, y, covariate_index=[covariate_index], metric="mi") | ||
return pvalue | ||
|
||
out = Parallel(n_jobs=1, backend=backend)( | ||
delayed(run_forest)(covariate_index) for covariate_index in range(n_features) | ||
) | ||
assert len(out) == n_features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sampan501 to my knowledge, any issues w/ joblib should be fixed or are the result of some other issues perhaps?
Lmk if this unit-test sufficiently captures what the usage looks like in the power simulations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_jobs = 1
should be n_jobs = n_jobs
in line 411, but otherwise good
@adam2392 can the print statements in FIClf be removed? |
Yeah feel free to push that. Im making some other changes tho so anything larger feel free to open a PR to this PR. |
* Update Signed-off-by: Adam Li <adam2392@gmail.com> * Fix submodule Signed-off-by: Adam Li <adam2392@gmail.com> * Possible change to might code Signed-off-by: Adam Li <adam2392@gmail.com> * Add fixes Signed-off-by: Adam Li <adam2392@gmail.com> * Fix style Signed-off-by: Adam Li <adam2392@gmail.com> --------- Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
sktree/stats/tests/test_forestht.py
Outdated
@@ -205,12 +209,12 @@ def test_linear_model(hypotester, model_kwargs, n_samples, n_repeats, test_size) | |||
n_jobs=-1, | |||
), | |||
"random_state": seed, | |||
"permute_per_tree": True, | |||
"sample_dataset_per_tree": True, | |||
"permute_per_tree": False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is testing MI Sep and not MI/Tree right? Where are we testing MI/Tree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This is MI Sep. We should test MI / Tree as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I will add a pytest.mark.parametrize in a bit. I think last night the pvalue behavior was not converging as well as the MI Sep.
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 @PSSF23 I'm still having an error with the sample size when using the parameters from our meeting today (n = 32, test_size = 0.2). Here's the trace:
|
Any chance you can reproduce the error w/ a small code snippet? The following code works for me:
|
Signed-off-by: Adam Li <adam2392@gmail.com>
FYI, I added a short unit-test to test small sample-sizes. |
I will once I find the simulation and sample size that's causing the issue |
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Changes proposed in this pull request:
#112 , #120 to be addressed in a future PR
Before submitting
section of the
CONTRIBUTING
docs.Writing docstrings section of the
CONTRIBUTING
docs.After submitting