-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH add treeple tutorials to website #256
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tutorial simulation needs to be modified after #254 is merged and the change is pushed to pip
.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #256 +/- ##
=======================================
Coverage 90.04% 90.04%
=======================================
Files 54 54
Lines 5105 5105
=======================================
Hits 4597 4597
Misses 508 508 ☔ View full report in Codecov by Sentry. |
if pvalue < 0.05: | ||
print("The null hypothesis is rejected.") | ||
else: | ||
print("The null hypothesis is not rejected.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very informative about how to compute a p-value, but not sure if we need all of this. I think that point of the tutorial is to show the user how to compute a p-value using the functions already in treeple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I would just include a sentence in the very top of the example stating how to interpret the pvalue and then remove these lines of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: update p-value calculation with native function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall great start!
I think the summary is:
- Let's try to tighten up how each example is presented to the user. Assume the user knows very little. I.e. abbreviations should be defined. Basic concepts should be defined.
- We also maybe want to consider allowing metrics to be public functions so you can just import them? Assuming all public functions either take in a 2D posterior, this is very similar to scikitlearns metrics API. If we need the public functions to take a 3D posterior over trees that's fine too.
==================================== | ||
Treeple tutorial for calculating CMI | ||
==================================== | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a short paragraph describing how we actually do the estimate. And maybe what we will show. It will help users acquaint with the example.
TODO: also link reference to paper when we make preprint online.
# ----------------------- | ||
|
||
|
||
def Calculate_MI(y_true, y_pred_proba): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth making the existing function we have public so you don't need to redefine one here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be best.
pAUC = roc_auc_score(y_true, y_pred_proba[:, 1], max_fpr=max_fpr) | ||
|
||
pos = np.where(fpr == max_fpr)[0][-1] | ||
plt.fill_between( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to separate plotting and obtaining the statistic? Makes it a bit cleaner to understand from a user perspective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be best if the local function could be hidden. But sphinx doesn't let me have a separate script file.
if pvalue < 0.05: | ||
print("The null hypothesis is rejected.") | ||
else: | ||
print("The null hypothesis is not rejected.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I would just include a sentence in the very top of the example stating how to interpret the pvalue and then remove these lines of code
# As we know the true priors of each class, we can generate a sufficient | ||
# amount of samples to estimate the true posteriors and corresponding | ||
# statistics like *MI*, *pAUC*, and *S@98*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# As we know the true priors of each class, we can generate a sufficient | |
# amount of samples to estimate the true posteriors and corresponding | |
# statistics like *MI*, *pAUC*, and *S@98*. | |
# In many simulation settings, we are interested in knowing the true statistics. Sometimes these statistics | |
# are not analytically computable, so we demonstrate how to approximate these numerically. Given enough | |
# samples, these numerical approximations converge to the true statistics. | |
# | |
# As we know the true priors of each class, we can generate a sufficient | |
# amount of samples to estimate the true posteriors and corresponding | |
# statistics like *MI*, *pAUC*, and *S@98*. |
# .. math:: pAUC@r = \frac{100}{100 - r} \int_{T_r}^\infty \int_{\mathcal{X}} \mathbb{I}\{\eta(X_1) > \eta(X_0) \} dF_1 dF_0 | ||
# | ||
# With a binary class simulation as an example, this tutorial will show | ||
# how to use ``treeple`` to calculate the statistic with 90% specificity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# how to use ``treeple`` to calculate the statistic with 90% specificity | |
# how to use ``treeple`` to calculate the AUC statistic at a specified | |
# 90% specificity |
# By computing the p-value using ``treeple``, we can test if :math:`H_0` | ||
# would be rejected, which confirms that X and Y are not independent. The p-value is | ||
# generated by comparing the observed statistic difference with permuted | ||
# differences, using mutual information as an example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# differences, using mutual information as an example. | |
# differences, using mutual information as the test statistic in this example. |
if pvalue < 0.05: | ||
print("The null hypothesis is rejected.") | ||
else: | ||
print("The null hypothesis is not rejected.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove these LOC? https://github.com/neurodata/scikit-tree/pull/256/files/2c487755d8a0827b7344b2bee6da8eabef1d400e#r1567919332
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. It could act as a conditional statement for users if they decide to modify the data.
if pvalue < 0.05: | ||
print("The null hypothesis is rejected.") | ||
else: | ||
print("The null hypothesis is not rejected.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove these LOC? https://github.com/neurodata/scikit-tree/pull/256/files/2c487755d8a0827b7344b2bee6da8eabef1d400e#r1567919332
Co-authored-by: Adam Li <adam2392@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, pending the larger changes in #257
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll incorporate the comments in a new PR. Feel free to comment under #257 as well if you have new ideas.
Add treeple tutorials for:
Open to suggestions and feedback. I realized that some content overlaps with existing tutorials, which could be outdated and should be integrated/removed.