docs updates [skip ci] (1534)

neurodata · Oct 16, 2023 · 1cd99ea · 1cd99ea
1 parent 0f5d511
commit 1cd99ea
Show file tree

Hide file tree

Showing 145 changed files with 7,309 additions and 975 deletions.
diff --git a/dev/.buildinfo b/dev/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 78ac69da5fa39adca3daf95ef6cb47e1
+config: 66daa7f80f7a79b1b0aa2c0d40b08166
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/...4e0c3b50/plot_oblique_random_forest.ipynb → ...68f6cf41/plot_oblique_random_forest.ipynb b/...4e0c3b50/plot_oblique_random_forest.ipynb → ...68f6cf41/plot_oblique_random_forest.ipynb
@@ -4,7 +4,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Plot oblique forest and axis-aligned random forest predictions on cc18 datasets\n\nA performance comparison between oblique forest and standard axis-\naligned random forest using three datasets from OpenML benchmarking suites.\n\nTwo of these datasets, namely\n[WDBC](https://www.openml.org/search?type=data&sort=runs&id=1510)\nand [Phishing Website](https://www.openml.org/search?type=data&sort=runs&id=4534)\ndatasets consist of 31 features where the former dataset is entirely numeric\nand the latter dataset is entirely norminal. The third dataset, dubbed\n[cnae-9](https://www.openml.org/search?type=data&status=active&id=1468), is a\nnumeric dataset that has notably large feature space of 857 features. As you\nwill notice, of these three datasets, the oblique forest outperforms axis-aligned\nrandom forest on cnae-9 utilizing sparse random projection mechanism. All datasets\nare subsampled due to computational constraints.\n\nFor an example of using extra-oblique trees/forests in practice on data, see the following\nexample `sphx_glr_auto_examples_plot_extra_oblique_random_forest.py`.\n"
+        "\n# Plot oblique forest and axis-aligned random forest predictions on cc18 datasets\n\nA performance comparison between oblique forest and standard axis-\naligned random forest using three datasets from OpenML benchmarking suites.\n\nTwo of these datasets, namely\n[WDBC](https://www.openml.org/search?type=data&sort=runs&id=1510)\nand [Phishing Website](https://www.openml.org/search?type=data&sort=runs&id=4534)\ndatasets consist of 31 features where the former dataset is entirely numeric\nand the latter dataset is entirely norminal. The third dataset, dubbed\n[cnae-9](https://www.openml.org/search?type=data&status=active&id=1468), is a\nnumeric dataset that has notably large feature space of 857 features. As you\nwill notice, of these three datasets, the oblique forest outperforms axis-aligned\nrandom forest on cnae-9 utilizing sparse random projection mechanism. All datasets\nare subsampled due to computational constraints.\n\nFor an example of using extra-oblique trees/forests in practice on data, see the following\nexample `sphx_glr_auto_examples_sparse_oblique_trees_plot_extra_oblique_random_forest.py`.\n"
       ]
     },
     {

diff --git a/dev/_downloads/07fcc19ba03226cd3d83d4e40ec44385/auto_examples_python.zip b/dev/_downloads/07fcc19ba03226cd3d83d4e40ec44385/auto_examples_python.zip
diff --git a/...4c306824ef4/plot_extra_orf_sample_size.py → ...a9a1af003b3/plot_extra_orf_sample_size.py b/...4c306824ef4/plot_extra_orf_sample_size.py → ...a9a1af003b3/plot_extra_orf_sample_size.py
diff --git a/...que_axis_aligned_forests_sparse_parity.py → ...que_axis_aligned_forests_sparse_parity.py b/...que_axis_aligned_forests_sparse_parity.py → ...que_axis_aligned_forests_sparse_parity.py
diff --git a/...efac359614b5/plot_oblique_forests_iris.py → ...647887921e55/plot_oblique_forests_iris.py b/...efac359614b5/plot_oblique_forests_iris.py → ...647887921e55/plot_oblique_forests_iris.py
diff --git a/...fb77bbe91b241ef2187e2/plot_iris_dtc.ipynb → ...d0047495c72732be9c2f5/plot_iris_dtc.ipynb b/...fb77bbe91b241ef2187e2/plot_iris_dtc.ipynb → ...d0047495c72732be9c2f5/plot_iris_dtc.ipynb
diff --git a/dev/_downloads/127ab7e39d2557d7257bbc5cd61a27ca/plot_sparse_projection_matrix.ipynb b/dev/_downloads/127ab7e39d2557d7257bbc5cd61a27ca/plot_sparse_projection_matrix.ipynb
@@ -76,7 +76,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Discussion\nAs we can see, the (sparse) oblique splitter samples random features to\nlinearly combine to form candidate split dimensions.\n\nIn contrast, the normal splitter in :class:`sklearn.tree.DecisionTreeClassifier` samples\nrandomly across all ``n_features`` features.\n\nFor an example of using oblique trees/forests in practice on data, see the following\nexamples:\n\n- `sphx_glr_auto_examples_plot_oblique_forests_iris.py`\n- `sphx_glr_auto_examples_plot_oblique_random_forest.py`\n\n"
+        "## Discussion\nAs we can see, the (sparse) oblique splitter samples random features to\nlinearly combine to form candidate split dimensions.\n\nIn contrast, the normal splitter in :class:`sklearn.tree.DecisionTreeClassifier` samples\nrandomly across all ``n_features`` features.\n\nFor an example of using oblique trees/forests in practice on data, see the following\nexamples:\n\n- `sphx_glr_auto_examples_sparse_oblique_trees_plot_oblique_forests_iris.py`\n- `sphx_glr_auto_examples_sparse_oblique_trees_plot_oblique_random_forest.py`\n\n"
       ]
     }
   ],

diff --git a/..._axis_aligned_forests_sparse_parity.ipynb → ..._axis_aligned_forests_sparse_parity.ipynb b/..._axis_aligned_forests_sparse_parity.ipynb → ..._axis_aligned_forests_sparse_parity.ipynb
diff --git a/...cfd7520f/plot_extra_orf_sample_size.ipynb → ...62ade7e2/plot_extra_orf_sample_size.ipynb b/...cfd7520f/plot_extra_orf_sample_size.ipynb → ...62ade7e2/plot_extra_orf_sample_size.ipynb
diff --git a/dev/_downloads/45cc3f30924d00401982aacf040d7b0d/plot_MI_imbalanced_hyppo_testing.py b/dev/_downloads/45cc3f30924d00401982aacf040d7b0d/plot_MI_imbalanced_hyppo_testing.py
@@ -0,0 +1,237 @@
+"""
+===============================================================================
+Mutual Information for Gigantic Hypothesis Testing (MIGHT) with Imbalanced Data
+===============================================================================
+
+Here, we demonstrate how to do hypothesis testing on highly imbalanced data
+in terms of their feature-set dimensionalities.
+using mutual information as a test statistic. We use the framework of
+:footcite:`coleman2022scalable` to estimate pvalues efficiently.
+
+Here, we simulate two feature sets, one of which is important for the target,
+but significantly smaller in dimensionality than the other feature set, which
+is unimportant for the target. We then use the MIGHT framework to test for
+the importance of each feature set. Instead of leveraging a normal honest random
+forest to estimate the posteriors, here we leverage a multi-view honest random
+forest, with knowledge of the multi-view structure of the ``X`` data.
+
+For other examples of hypothesis testing, see the following:
+
+- :ref:`sphx_glr_auto_examples_hypothesis_testing_plot_MI_gigantic_hypothesis_testing_forest.py`
+- :ref:`sphx_glr_auto_examples_hypothesis_testing_plot_might_auc.py`
+
+For more information on the multi-view decision-tree, see
+:ref:`sphx_glr_auto_examples_multiview_plot_multiview_dtc.py`.
+"""
+
+import matplotlib.pyplot as plt
+import numpy as np
+from sklearn.datasets import make_blobs
+
+from sktree import HonestForestClassifier
+from sktree.stats import FeatureImportanceForestClassifier
+from sktree.tree import DecisionTreeClassifier, MultiViewDecisionTreeClassifier
+
+seed = 12345
+rng = np.random.default_rng(seed)
+
+# %%
+# Simulate data
+# -------------
+# We simulate the two feature sets, and the target variable. We then combine them
+# into a single dataset to perform hypothesis testing.
+seed = 12345
+rng = np.random.default_rng(seed)
+
+
+def make_multiview_classification(
+    n_samples=100, n_features_1=10, n_features_2=1000, cluster_std=2.0, seed=None
+):
+    rng = np.random.default_rng(seed=seed)
+
+    # Create a high-dimensional multiview dataset with a low-dimensional informative
+    # subspace in one view of the dataset.
+    X0_first, y0 = make_blobs(
+        n_samples=n_samples,
+        cluster_std=cluster_std,
+        n_features=n_features_1 // 2,
+        random_state=rng.integers(1, 10000),
+        centers=1,
+    )
+
+    X1_first, y1 = make_blobs(
+        n_samples=n_samples,
+        cluster_std=cluster_std,
+        n_features=n_features_1 // 2,
+        random_state=rng.integers(1, 10000),
+        centers=1,
+    )
+
+    # create the first views for y=0 and y=1
+    X0_first = np.concatenate(
+        (X0_first, rng.standard_normal(size=(n_samples, n_features_1 // 2))), axis=1
+    )
+    X1_first = np.concatenate(
+        (X1_first, rng.standard_normal(size=(n_samples, n_features_1 // 2))), axis=1
+    )
+    y1[:] = 1
+
+    # add the second view for y=0 and y=1, which is completely noise
+    X0 = np.concatenate([X0_first, rng.standard_normal(size=(n_samples, n_features_2))], axis=1)
+    X1 = np.concatenate([X1_first, rng.standard_normal(size=(n_samples, n_features_2))], axis=1)
+
+    # combine the views and targets
+    X = np.vstack((X0, X1))
+    y = np.hstack((y0, y1)).T
+
+    # add noise to the data
+    X = X + rng.standard_normal(size=X.shape)
+
+    return X, y
+
+
+n_samples = 100
+n_features = 10000
+n_features_views = [10, n_features]
+
+X, y = make_multiview_classification(
+    n_samples=n_samples,
+    n_features_1=10,
+    n_features_2=n_features,
+    cluster_std=2.0,
+    seed=seed,
+)
+# %%
+# Perform hypothesis testing using Mutual Information
+# ---------------------------------------------------
+# Here, we use :class:`~sktree.stats.FeatureImportanceForestClassifier` to perform the hypothesis
+# test. The test statistic is computed by comparing the metric (i.e. mutual information) estimated
+# between two forests. One forest is trained on the original dataset, and one forest is trained
+# on a permuted dataset, where the rows of the ``covariate_index`` columns are shuffled randomly.
+#
+# The null distribution is then estimated in an efficient manner using the framework of
+# :footcite:`coleman2022scalable`. The sample evaluations of each forest (i.e. the posteriors)
+# are sampled randomly ``n_repeats`` times to generate a null distribution. The pvalue is then
+# computed as the proportion of samples in the null distribution that are less than the
+# observed test statistic.
+
+n_estimators = 200
+max_features = "sqrt"
+test_size = 0.2
+n_repeats = 1000
+n_jobs = -1
+
+est = FeatureImportanceForestClassifier(
+    estimator=HonestForestClassifier(
+        n_estimators=n_estimators,
+        max_features=max_features,
+        tree_estimator=MultiViewDecisionTreeClassifier(feature_set_ends=n_features_views),
+        random_state=seed,
+        honest_fraction=0.5,
+        n_jobs=n_jobs,
+    ),
+    random_state=seed,
+    test_size=test_size,
+    permute_per_tree=False,
+    sample_dataset_per_tree=False,
+)
+
+mv_results = dict()
+
+print(
+    f"Permutation per tree: {est.permute_per_tree} and sampling dataset per tree: "
+    f"{est.sample_dataset_per_tree}"
+)
+# we test for the first feature set, which is important and thus should return a pvalue < 0.05
+stat, pvalue = est.test(
+    X, y, covariate_index=np.arange(10, dtype=int), metric="mi", n_repeats=n_repeats
+)
+mv_results["important_feature_stat"] = stat
+mv_results["important_feature_pvalue"] = pvalue
+print(f"Estimated MI difference: {stat} with Pvalue: {pvalue}")
+
+# we test for the second feature set, which is unimportant and thus should return a pvalue > 0.05
+stat, pvalue = est.test(
+    X,
+    y,
+    covariate_index=np.arange(10, n_features, dtype=int),
+    metric="mi",
+    n_repeats=n_repeats,
+)
+mv_results["unimportant_feature_stat"] = stat
+mv_results["unimportant_feature_pvalue"] = pvalue
+print(f"Estimated MI difference: {stat} with Pvalue: {pvalue}")
+
+# %%
+# Let's investigate what happens when we do not use a multi-view decision tree.
+# All other parameters are kept the same.
+
+est = FeatureImportanceForestClassifier(
+    estimator=HonestForestClassifier(
+        n_estimators=n_estimators,
+        max_features=max_features,
+        tree_estimator=DecisionTreeClassifier(),
+        random_state=seed,
+        honest_fraction=0.5,
+        n_jobs=n_jobs,
+    ),
+    random_state=seed,
+    test_size=test_size,
+    permute_per_tree=False,
+    sample_dataset_per_tree=False,
+)
+
+rf_results = dict()
+
+# we test for the first feature set, which is important and thus should return a pvalue < 0.05
+stat, pvalue = est.test(
+    X, y, covariate_index=np.arange(10, dtype=int), metric="mi", n_repeats=n_repeats
+)
+rf_results["important_feature_stat"] = stat
+rf_results["important_feature_pvalue"] = pvalue
+print(f"Estimated MI difference using regular decision-trees: {stat} with Pvalue: {pvalue}")
+
+# we test for the second feature set, which is unimportant and thus should return a pvalue > 0.05
+stat, pvalue = est.test(
+    X,
+    y,
+    covariate_index=np.arange(10, n_features, dtype=int),
+    metric="mi",
+    n_repeats=n_repeats,
+)
+rf_results["unimportant_feature_stat"] = stat
+rf_results["unimportant_feature_pvalue"] = pvalue
+print(f"Estimated MI difference using regular decision-trees: {stat} with Pvalue: {pvalue}")
+
+fig, ax = plt.subplots(figsize=(5, 3))
+
+# plot pvalues
+ax.bar(0, rf_results["important_feature_pvalue"], label="Important Feature Set (RF)")
+ax.bar(1, rf_results["unimportant_feature_pvalue"], label="Unimportant Feature Set (RF)")
+ax.bar(2, mv_results["important_feature_pvalue"], label="Important Feature Set (MV)")
+ax.bar(3, mv_results["unimportant_feature_pvalue"], label="Unimportant Feature Set (MV)")
+ax.axhline(0.05, color="k", linestyle="--", label="alpha=0.05")
+ax.set(ylabel="Log10(PValue)", xlim=[-0.5, 3.5], yscale="log")
+ax.legend()
+
+fig.tight_layout()
+plt.show()
+
+# %%
+# Discussion
+# ----------
+# We see that the multi-view decision tree is able to detect the important feature set,
+# while the regular decision tree is not. This is because the regular decision tree
+# is not aware of the multi-view structure of the data, and thus is challenged
+# by the imbalanced dimensionality of the feature sets. I.e. it rarely splits on
+# the first low-dimensional feature set, and thus is unable to detect its importance.
+#
+# Note both approaches still fail to reject the null hypothesis (for alpha of 0.05)
+# when testing the unimportant feature set. The difference in the two approaches
+# show the statistical power of the multi-view decision tree is higher than the
+# regular decision tree in this simulation.
+
+# %%
+# References
+# ----------
+# .. footbibliography::
diff --git a/...99d343e12/plot_oblique_forests_iris.ipynb → ...382a83bd8/plot_oblique_forests_iris.ipynb b/...99d343e12/plot_oblique_forests_iris.ipynb → ...382a83bd8/plot_oblique_forests_iris.ipynb
diff --git a/dev/_downloads/6f1e7a639e0699d6164445b55e6c116d/auto_examples_jupyter.zip b/dev/_downloads/6f1e7a639e0699d6164445b55e6c116d/auto_examples_jupyter.zip
diff --git a/...1159dda0/plot_overlapping_gaussians.ipynb → ...3d329bbd/plot_overlapping_gaussians.ipynb b/...1159dda0/plot_overlapping_gaussians.ipynb → ...3d329bbd/plot_overlapping_gaussians.ipynb
diff --git a/...2531e2f729f/plot_overlapping_gaussians.py → ...d137fc6a08c/plot_overlapping_gaussians.py b/...2531e2f729f/plot_overlapping_gaussians.py → ...d137fc6a08c/plot_overlapping_gaussians.py
diff --git a/...5896a/plot_extra_oblique_random_forest.py → ...9bddb/plot_extra_oblique_random_forest.py b/...5896a/plot_extra_oblique_random_forest.py → ...9bddb/plot_extra_oblique_random_forest.py
@@ -59,7 +59,7 @@
 range, hence the complexity is `O(n)`. This makes the algorithm more suitable for large datasets.
 
 To see how sample-sizes affect the performance of Extra Oblique Trees vs regular Oblique Trees,
-see :ref:`sphx_glr_auto_examples_plot_extra_orf_sample_size.py`
+see :ref:`sphx_glr_auto_examples_sparse_oblique_trees_plot_extra_orf_sample_size.py`
 
 References
 ----------

diff --git a/dev/_downloads/8acd301a380877f5b8f1beb6f6dff954/plot_sparse_projection_matrix.py b/dev/_downloads/8acd301a380877f5b8f1beb6f6dff954/plot_sparse_projection_matrix.py
@@ -125,5 +125,5 @@
 # For an example of using oblique trees/forests in practice on data, see the following
 # examples:
 #
-# - :ref:`sphx_glr_auto_examples_plot_oblique_forests_iris.py`
-# - :ref:`sphx_glr_auto_examples_plot_oblique_random_forest.py`
+# - :ref:`sphx_glr_auto_examples_sparse_oblique_trees_plot_oblique_forests_iris.py`
+# - :ref:`sphx_glr_auto_examples_sparse_oblique_trees_plot_oblique_random_forest.py`