Skip to content

Commit

Permalink
docs updates [skip ci] (1534)
Browse files Browse the repository at this point in the history
  • Loading branch information
Circle Ci committed Oct 16, 2023
1 parent 0f5d511 commit 1cd99ea
Show file tree
Hide file tree
Showing 145 changed files with 7,309 additions and 975 deletions.
2 changes: 1 addition & 1 deletion dev/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 78ac69da5fa39adca3daf95ef6cb47e1
config: 66daa7f80f7a79b1b0aa2c0d40b08166
tags: 645f666f9bcd5a90fca523b33c5a78b7
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Plot oblique forest and axis-aligned random forest predictions on cc18 datasets\n\nA performance comparison between oblique forest and standard axis-\naligned random forest using three datasets from OpenML benchmarking suites.\n\nTwo of these datasets, namely\n[WDBC](https://www.openml.org/search?type=data&sort=runs&id=1510)\nand [Phishing Website](https://www.openml.org/search?type=data&sort=runs&id=4534)\ndatasets consist of 31 features where the former dataset is entirely numeric\nand the latter dataset is entirely norminal. The third dataset, dubbed\n[cnae-9](https://www.openml.org/search?type=data&status=active&id=1468), is a\nnumeric dataset that has notably large feature space of 857 features. As you\nwill notice, of these three datasets, the oblique forest outperforms axis-aligned\nrandom forest on cnae-9 utilizing sparse random projection mechanism. All datasets\nare subsampled due to computational constraints.\n\nFor an example of using extra-oblique trees/forests in practice on data, see the following\nexample `sphx_glr_auto_examples_plot_extra_oblique_random_forest.py`.\n"
"\n# Plot oblique forest and axis-aligned random forest predictions on cc18 datasets\n\nA performance comparison between oblique forest and standard axis-\naligned random forest using three datasets from OpenML benchmarking suites.\n\nTwo of these datasets, namely\n[WDBC](https://www.openml.org/search?type=data&sort=runs&id=1510)\nand [Phishing Website](https://www.openml.org/search?type=data&sort=runs&id=4534)\ndatasets consist of 31 features where the former dataset is entirely numeric\nand the latter dataset is entirely norminal. The third dataset, dubbed\n[cnae-9](https://www.openml.org/search?type=data&status=active&id=1468), is a\nnumeric dataset that has notably large feature space of 857 features. As you\nwill notice, of these three datasets, the oblique forest outperforms axis-aligned\nrandom forest on cnae-9 utilizing sparse random projection mechanism. All datasets\nare subsampled due to computational constraints.\n\nFor an example of using extra-oblique trees/forests in practice on data, see the following\nexample `sphx_glr_auto_examples_sparse_oblique_trees_plot_extra_oblique_random_forest.py`.\n"
]
},
{
Expand Down
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Discussion\nAs we can see, the (sparse) oblique splitter samples random features to\nlinearly combine to form candidate split dimensions.\n\nIn contrast, the normal splitter in :class:`sklearn.tree.DecisionTreeClassifier` samples\nrandomly across all ``n_features`` features.\n\nFor an example of using oblique trees/forests in practice on data, see the following\nexamples:\n\n- `sphx_glr_auto_examples_plot_oblique_forests_iris.py`\n- `sphx_glr_auto_examples_plot_oblique_random_forest.py`\n\n"
"## Discussion\nAs we can see, the (sparse) oblique splitter samples random features to\nlinearly combine to form candidate split dimensions.\n\nIn contrast, the normal splitter in :class:`sklearn.tree.DecisionTreeClassifier` samples\nrandomly across all ``n_features`` features.\n\nFor an example of using oblique trees/forests in practice on data, see the following\nexamples:\n\n- `sphx_glr_auto_examples_sparse_oblique_trees_plot_oblique_forests_iris.py`\n- `sphx_glr_auto_examples_sparse_oblique_trees_plot_oblique_random_forest.py`\n\n"
]
}
],
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
"""
===============================================================================
Mutual Information for Gigantic Hypothesis Testing (MIGHT) with Imbalanced Data
===============================================================================
Here, we demonstrate how to do hypothesis testing on highly imbalanced data
in terms of their feature-set dimensionalities.
using mutual information as a test statistic. We use the framework of
:footcite:`coleman2022scalable` to estimate pvalues efficiently.
Here, we simulate two feature sets, one of which is important for the target,
but significantly smaller in dimensionality than the other feature set, which
is unimportant for the target. We then use the MIGHT framework to test for
the importance of each feature set. Instead of leveraging a normal honest random
forest to estimate the posteriors, here we leverage a multi-view honest random
forest, with knowledge of the multi-view structure of the ``X`` data.
For other examples of hypothesis testing, see the following:
- :ref:`sphx_glr_auto_examples_hypothesis_testing_plot_MI_gigantic_hypothesis_testing_forest.py`
- :ref:`sphx_glr_auto_examples_hypothesis_testing_plot_might_auc.py`
For more information on the multi-view decision-tree, see
:ref:`sphx_glr_auto_examples_multiview_plot_multiview_dtc.py`.
"""

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs

from sktree import HonestForestClassifier
from sktree.stats import FeatureImportanceForestClassifier
from sktree.tree import DecisionTreeClassifier, MultiViewDecisionTreeClassifier

seed = 12345
rng = np.random.default_rng(seed)

# %%
# Simulate data
# -------------
# We simulate the two feature sets, and the target variable. We then combine them
# into a single dataset to perform hypothesis testing.
seed = 12345
rng = np.random.default_rng(seed)


def make_multiview_classification(
n_samples=100, n_features_1=10, n_features_2=1000, cluster_std=2.0, seed=None
):
rng = np.random.default_rng(seed=seed)

# Create a high-dimensional multiview dataset with a low-dimensional informative
# subspace in one view of the dataset.
X0_first, y0 = make_blobs(
n_samples=n_samples,
cluster_std=cluster_std,
n_features=n_features_1 // 2,
random_state=rng.integers(1, 10000),
centers=1,
)

X1_first, y1 = make_blobs(
n_samples=n_samples,
cluster_std=cluster_std,
n_features=n_features_1 // 2,
random_state=rng.integers(1, 10000),
centers=1,
)

# create the first views for y=0 and y=1
X0_first = np.concatenate(
(X0_first, rng.standard_normal(size=(n_samples, n_features_1 // 2))), axis=1
)
X1_first = np.concatenate(
(X1_first, rng.standard_normal(size=(n_samples, n_features_1 // 2))), axis=1
)
y1[:] = 1

# add the second view for y=0 and y=1, which is completely noise
X0 = np.concatenate([X0_first, rng.standard_normal(size=(n_samples, n_features_2))], axis=1)
X1 = np.concatenate([X1_first, rng.standard_normal(size=(n_samples, n_features_2))], axis=1)

# combine the views and targets
X = np.vstack((X0, X1))
y = np.hstack((y0, y1)).T

# add noise to the data
X = X + rng.standard_normal(size=X.shape)

return X, y


n_samples = 100
n_features = 10000
n_features_views = [10, n_features]

X, y = make_multiview_classification(
n_samples=n_samples,
n_features_1=10,
n_features_2=n_features,
cluster_std=2.0,
seed=seed,
)
# %%
# Perform hypothesis testing using Mutual Information
# ---------------------------------------------------
# Here, we use :class:`~sktree.stats.FeatureImportanceForestClassifier` to perform the hypothesis
# test. The test statistic is computed by comparing the metric (i.e. mutual information) estimated
# between two forests. One forest is trained on the original dataset, and one forest is trained
# on a permuted dataset, where the rows of the ``covariate_index`` columns are shuffled randomly.
#
# The null distribution is then estimated in an efficient manner using the framework of
# :footcite:`coleman2022scalable`. The sample evaluations of each forest (i.e. the posteriors)
# are sampled randomly ``n_repeats`` times to generate a null distribution. The pvalue is then
# computed as the proportion of samples in the null distribution that are less than the
# observed test statistic.

n_estimators = 200
max_features = "sqrt"
test_size = 0.2
n_repeats = 1000
n_jobs = -1

est = FeatureImportanceForestClassifier(
estimator=HonestForestClassifier(
n_estimators=n_estimators,
max_features=max_features,
tree_estimator=MultiViewDecisionTreeClassifier(feature_set_ends=n_features_views),
random_state=seed,
honest_fraction=0.5,
n_jobs=n_jobs,
),
random_state=seed,
test_size=test_size,
permute_per_tree=False,
sample_dataset_per_tree=False,
)

mv_results = dict()

print(
f"Permutation per tree: {est.permute_per_tree} and sampling dataset per tree: "
f"{est.sample_dataset_per_tree}"
)
# we test for the first feature set, which is important and thus should return a pvalue < 0.05
stat, pvalue = est.test(
X, y, covariate_index=np.arange(10, dtype=int), metric="mi", n_repeats=n_repeats
)
mv_results["important_feature_stat"] = stat
mv_results["important_feature_pvalue"] = pvalue
print(f"Estimated MI difference: {stat} with Pvalue: {pvalue}")

# we test for the second feature set, which is unimportant and thus should return a pvalue > 0.05
stat, pvalue = est.test(
X,
y,
covariate_index=np.arange(10, n_features, dtype=int),
metric="mi",
n_repeats=n_repeats,
)
mv_results["unimportant_feature_stat"] = stat
mv_results["unimportant_feature_pvalue"] = pvalue
print(f"Estimated MI difference: {stat} with Pvalue: {pvalue}")

# %%
# Let's investigate what happens when we do not use a multi-view decision tree.
# All other parameters are kept the same.

est = FeatureImportanceForestClassifier(
estimator=HonestForestClassifier(
n_estimators=n_estimators,
max_features=max_features,
tree_estimator=DecisionTreeClassifier(),
random_state=seed,
honest_fraction=0.5,
n_jobs=n_jobs,
),
random_state=seed,
test_size=test_size,
permute_per_tree=False,
sample_dataset_per_tree=False,
)

rf_results = dict()

# we test for the first feature set, which is important and thus should return a pvalue < 0.05
stat, pvalue = est.test(
X, y, covariate_index=np.arange(10, dtype=int), metric="mi", n_repeats=n_repeats
)
rf_results["important_feature_stat"] = stat
rf_results["important_feature_pvalue"] = pvalue
print(f"Estimated MI difference using regular decision-trees: {stat} with Pvalue: {pvalue}")

# we test for the second feature set, which is unimportant and thus should return a pvalue > 0.05
stat, pvalue = est.test(
X,
y,
covariate_index=np.arange(10, n_features, dtype=int),
metric="mi",
n_repeats=n_repeats,
)
rf_results["unimportant_feature_stat"] = stat
rf_results["unimportant_feature_pvalue"] = pvalue
print(f"Estimated MI difference using regular decision-trees: {stat} with Pvalue: {pvalue}")

fig, ax = plt.subplots(figsize=(5, 3))

# plot pvalues
ax.bar(0, rf_results["important_feature_pvalue"], label="Important Feature Set (RF)")
ax.bar(1, rf_results["unimportant_feature_pvalue"], label="Unimportant Feature Set (RF)")
ax.bar(2, mv_results["important_feature_pvalue"], label="Important Feature Set (MV)")
ax.bar(3, mv_results["unimportant_feature_pvalue"], label="Unimportant Feature Set (MV)")
ax.axhline(0.05, color="k", linestyle="--", label="alpha=0.05")
ax.set(ylabel="Log10(PValue)", xlim=[-0.5, 3.5], yscale="log")
ax.legend()

fig.tight_layout()
plt.show()

# %%
# Discussion
# ----------
# We see that the multi-view decision tree is able to detect the important feature set,
# while the regular decision tree is not. This is because the regular decision tree
# is not aware of the multi-view structure of the data, and thus is challenged
# by the imbalanced dimensionality of the feature sets. I.e. it rarely splits on
# the first low-dimensional feature set, and thus is unable to detect its importance.
#
# Note both approaches still fail to reject the null hypothesis (for alpha of 0.05)
# when testing the unimportant feature set. The difference in the two approaches
# show the statistical power of the multi-view decision tree is higher than the
# regular decision tree in this simulation.

# %%
# References
# ----------
# .. footbibliography::
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@
range, hence the complexity is `O(n)`. This makes the algorithm more suitable for large datasets.
To see how sample-sizes affect the performance of Extra Oblique Trees vs regular Oblique Trees,
see :ref:`sphx_glr_auto_examples_plot_extra_orf_sample_size.py`
see :ref:`sphx_glr_auto_examples_sparse_oblique_trees_plot_extra_orf_sample_size.py`
References
----------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -125,5 +125,5 @@
# For an example of using oblique trees/forests in practice on data, see the following
# examples:
#
# - :ref:`sphx_glr_auto_examples_plot_oblique_forests_iris.py`
# - :ref:`sphx_glr_auto_examples_plot_oblique_random_forest.py`
# - :ref:`sphx_glr_auto_examples_sparse_oblique_trees_plot_oblique_forests_iris.py`
# - :ref:`sphx_glr_auto_examples_sparse_oblique_trees_plot_oblique_random_forest.py`
Loading

0 comments on commit 1cd99ea

Please sign in to comment.