Skip to content

Commit

Permalink
docs updates [skip ci] (1494)
Browse files Browse the repository at this point in the history
  • Loading branch information
Circle Ci committed Oct 12, 2023
1 parent 7c353a1 commit d42f474
Show file tree
Hide file tree
Showing 141 changed files with 11,008 additions and 757 deletions.
2 changes: 1 addition & 1 deletion dev/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: ab004720845ed397b6242f73a218a6a4
config: c5a20278bacebe8c104b122777cc5d51
tags: 645f666f9bcd5a90fca523b33c5a78b7
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Plot the projection matrices of an oblique tree for sampling images, or time-series\n\nThis example shows how projection matrices are generated for an oblique tree,\nspecifically the :class:`sktree.tree.PatchObliqueDecisionTreeClassifier`.\n\nFor a tree, one can specify the structure of the data that it will be trained on\n(i.e. ``(X, y)``). This is done by specifying the ``data_dims`` parameter. For\nexample, if the data is 2D, then ``data_dims`` should be set to ``(n_rows, n_cols)``,\nwhere now each row of ``X`` is a 1D array of length ``n_rows * n_cols``. If the data\nis 3D, then ``data_dims`` should be set to ``(n_rows, n_cols, n_depth)``, where now\neach row of ``X`` is a 1D array of length ``n_rows * n_cols * n_depth``. This allows\nthe tree to be trained on data of any structured dimension, but still be compatible\nwith the robust sklearn API.\n\nThe projection matrices are used to generate patches of the data. These patches are\nused to calculate the feature values that are used during splitting. The patch is\ngenerated by sampling a hyperrectangle from the data. The hyperrectangle is defined\nby a starting point and a patch size. The starting point is sampled uniformly from\nthe structure of the data. For example, if each row of ``X`` has a 2D image structure\n``(n_rows, n_cols)``, then the starting point will be sampled uniformly from the square\ngrid. The patch size is sampled uniformly from the range ``min_patch_dims`` to\n``max_patch_dims``. The patch size is also constrained to be within the bounds of the\ndata structure. For example, if the patch size is ``(3, 3)`` and the data structure\nis ``(5, 5)``, then the patch will only sample indices within the data.\n\nWe also allow each dimension to be arbitrarily discontiguous.\n\nFor details on how to use the hyperparameters related to the patches, see\n:class:`sktree.tree.PatchObliqueDecisionTreeClassifier`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n\n# import modules\n# .. note:: We use a private Cython module here to demonstrate what the patches\n# look like. This is not part of the public API. The Cython module used\n# is just a Python wrapper for the underlying Cython code and is not the\n# same as the Cython splitter used in the actual implementation.\n# To use the actual splitter, one should use the public API for the\n# relevant tree/forests class.\nimport numpy as np\n\nfrom sktree._lib.sklearn.tree._criterion import Gini\nfrom sktree.tree.manifold._morf_splitter import BestPatchSplitterTester"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize patch splitter\nThe patch splitter is used to generate patches for the projection matrices.\nWe will initialize the patch with some dummy values for the sake of this\nexample.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"criterion = Gini(1, np.array((0, 1)))\nmax_features = 6\nmin_samples_leaf = 1\nmin_weight_leaf = 0.0\nrandom_state = np.random.RandomState(100)\n\nboundary = None\nfeature_weight = None\nmonotonic_cst = None\nmissing_value_feature_mask = None\n\n# initialize some dummy data\nX = np.repeat(np.arange(25).astype(np.float32), 5).reshape(5, -1)\ny = np.array([0, 0, 0, 1, 1]).reshape(-1, 1).astype(np.float64)\nsample_weight = np.ones(5)\n\nprint(\"The shape of our dataset is: \", X.shape, y.shape, sample_weight.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate 1D patches\nNow that we have th patch splitter initialized, we can generate some patches\nand visualize how they appear on the data. We will make the patch 1D, which\nsamples multiple rows contiguously. This is a 1D patch of size 3.\n\n<div class=\"alert alert-danger\"><h4>Warning</h4><p>Do not use this interface directly in practice.</p></div>\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"min_patch_dims = np.array((1, 1))\nmax_patch_dims = np.array((3, 1))\ndim_contiguous = np.array((True, True))\ndata_dims = np.array((5, 5))\n\n# Note: not used, but passed for API compatibility\nfeature_combinations = 1.5\n\nsplitter = BestPatchSplitterTester(\n criterion,\n max_features,\n min_samples_leaf,\n min_weight_leaf,\n random_state,\n monotonic_cst,\n feature_combinations,\n min_patch_dims,\n max_patch_dims,\n dim_contiguous,\n data_dims,\n boundary,\n feature_weight,\n)\nsplitter.init_test(X, y, sample_weight, missing_value_feature_mask)\n\n# sample the projection matrix that consists of 1D patches\nproj_mat = splitter.sample_projection_matrix_py()\nprint(proj_mat.shape)\n\n# Visualize 1D patches\nfig, axs = plt.subplots(nrows=2, ncols=3, figsize=(12, 8), sharex=True, sharey=True, squeeze=True)\naxs = axs.flatten()\nfor idx, ax in enumerate(axs):\n ax.imshow(proj_mat[idx, :].reshape(data_dims), cmap=\"viridis\")\n ax.set(\n xlim=(-1, data_dims[1]),\n ylim=(-1, data_dims[0]),\n title=f\"Patch {idx}\",\n )\n\nfig.suptitle(\"1D Patch Visualization\")\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate 2D patches\nWe will make the patch 2D, which samples multiple rows contiguously. This is\na 2D patch of size 3 in the columns and 2 in the rows.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"min_patch_dims = np.array((1, 1))\nmax_patch_dims = np.array((3, 3))\ndim_contiguous = np.array((True, True))\ndata_dims = np.array((5, 5))\n\nsplitter = BestPatchSplitterTester(\n criterion,\n max_features,\n min_samples_leaf,\n min_weight_leaf,\n random_state,\n monotonic_cst,\n feature_combinations,\n min_patch_dims,\n max_patch_dims,\n dim_contiguous,\n data_dims,\n boundary,\n feature_weight,\n)\nsplitter.init_test(X, y, sample_weight, missing_value_feature_mask)\n\n# sample the projection matrix that consists of 1D patches\nproj_mat = splitter.sample_projection_matrix_py()\n\n# Visualize 2D patches\nfig, axs = plt.subplots(nrows=2, ncols=3, figsize=(12, 8), sharex=True, sharey=True, squeeze=True)\naxs = axs.flatten()\nfor idx, ax in enumerate(axs):\n ax.imshow(proj_mat[idx, :].reshape(data_dims), cmap=\"viridis\")\n ax.set(\n xlim=(-1, data_dims[1]),\n ylim=(-1, data_dims[0]),\n title=f\"Patch {idx}\",\n )\n\nfig.suptitle(\"2D Patch Visualization\")\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate 3D patches\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# initialize some dummy data\nX = np.repeat(np.arange(25 * 5).astype(np.float32), 5).reshape(5, -1)\ny = np.array([0, 0, 0, 1, 1]).reshape(-1, 1).astype(np.float64)\nsample_weight = np.ones(5)\n\n# We will make the patch 3D, which samples multiple rows contiguously. This is\n# a 3D patch of size 3 in the columns and 2 in the rows.\nmin_patch_dims = np.array((1, 2, 1))\nmax_patch_dims = np.array((3, 2, 4))\ndim_contiguous = np.array((True, True, True))\ndata_dims = np.array((5, 5, 5))\n\nsplitter = BestPatchSplitterTester(\n criterion,\n max_features,\n min_samples_leaf,\n min_weight_leaf,\n random_state,\n monotonic_cst,\n feature_combinations,\n min_patch_dims,\n max_patch_dims,\n dim_contiguous,\n data_dims,\n boundary,\n feature_weight,\n)\nsplitter.init_test(X, y, sample_weight, missing_value_feature_mask)\n\n# sample the projection matrix that consists of 1D patches\nproj_mat = splitter.sample_projection_matrix_py()\nprint(proj_mat.shape)\n\nfig = plt.figure()\nfor idx in range(3 * 2):\n ax = fig.add_subplot(2, 3, idx + 1, projection=\"3d\")\n\n # Plot the surface.\n z, x, y = proj_mat[idx, :].reshape(data_dims).nonzero()\n ax.scatter(x, y, z, alpha=1, marker=\"o\", color=\"black\")\n\n # Customize the z axis.\n ax.set_zlim(-1.01, data_dims[2])\n ax.set(\n xlim=(-1, data_dims[1]),\n ylim=(-1, data_dims[0]),\n title=f\"Patch {idx}\",\n )\n\nfig.suptitle(\"3D Patch Visualization\")\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Discontiguous Patches\nWe can also generate patches that are not contiguous. This is useful for\nanalyzing data that is structured, but not necessarily contiguous in certain\ndimensions. For example, we can generate patches that sample the data in a\nmultivariate time series, where the data consists of ``(n_channels, n_times)``\nand the patches are discontiguous in the channel dimension, but contiguous\nin the time dimension. Here, we show an example patch.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# initialize some dummy data\nX = np.repeat(np.arange(25).astype(np.float32), 5).reshape(5, -1)\ny = np.array([0, 0, 0, 1, 1]).reshape(-1, 1).astype(np.float64)\nsample_weight = np.ones(5)\nmax_features = 9\n\n# We will make the patch 2D, which samples multiple rows contiguously. This is\n# a 2D patch of size 3 in the columns and 2 in the rows.\nmin_patch_dims = np.array((2, 2))\nmax_patch_dims = np.array((3, 4))\ndim_contiguous = np.array((False, True))\ndata_dims = np.array((5, 5))\n\nsplitter = BestPatchSplitterTester(\n criterion,\n max_features,\n min_samples_leaf,\n min_weight_leaf,\n random_state,\n monotonic_cst,\n feature_combinations,\n min_patch_dims,\n max_patch_dims,\n dim_contiguous,\n data_dims,\n boundary,\n feature_weight,\n)\nsplitter.init_test(X, y, sample_weight, missing_value_feature_mask)\n\n# sample the projection matrix that consists of 1D patches\nproj_mat = splitter.sample_projection_matrix_py()\n\n# Visualize 2D patches\nfig, axs = plt.subplots(nrows=3, ncols=3, figsize=(12, 8), sharex=True, sharey=True, squeeze=True)\naxs = axs.flatten()\nfor idx, ax in enumerate(axs):\n ax.imshow(proj_mat[idx, :].reshape(data_dims), cmap=\"viridis\")\n ax.set(\n xlim=(-1, data_dims[1]),\n ylim=(-1, data_dims[0]),\n title=f\"Patch {idx}\",\n )\n\nfig.suptitle(\"2D Discontiguous Patch Visualization\")\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will make the patch 2D, which samples multiple rows contiguously. This is\na 2D patch of size 3 in the columns and 2 in the rows.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"dim_contiguous = np.array((False, False))\n\nsplitter = BestPatchSplitterTester(\n criterion,\n max_features,\n min_samples_leaf,\n min_weight_leaf,\n random_state,\n monotonic_cst,\n feature_combinations,\n min_patch_dims,\n max_patch_dims,\n dim_contiguous,\n data_dims,\n boundary,\n feature_weight,\n)\nsplitter.init_test(X, y, sample_weight, missing_value_feature_mask)\n\n# sample the projection matrix that consists of 1D patches\nproj_mat = splitter.sample_projection_matrix_py()\n\n# Visualize 2D patches\nfig, axs = plt.subplots(nrows=3, ncols=3, figsize=(12, 8), sharex=True, sharey=True, squeeze=True)\naxs = axs.flatten()\nfor idx, ax in enumerate(axs):\n ax.imshow(proj_mat[idx, :].reshape(data_dims), cmap=\"viridis\")\n ax.set(\n xlim=(-1, data_dims[1]),\n ylim=(-1, data_dims[0]),\n title=f\"Patch {idx}\",\n )\n\nfig.suptitle(\"2D Discontiguous In All Dims Patch Visualization\")\nplt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Plot oblique forest and axis-aligned random forest predictions on cc18 datasets\n\nA performance comparison between oblique forest and standard axis-\naligned random forest using three datasets from OpenML benchmarking suites.\n\nTwo of these datasets, namely\n[WDBC](https://www.openml.org/search?type=data&sort=runs&id=1510)\nand [Phishing Website](https://www.openml.org/search?type=data&sort=runs&id=4534)\ndatasets consist of 31 features where the former dataset is entirely numeric\nand the latter dataset is entirely norminal. The third dataset, dubbed\n[cnae-9](https://www.openml.org/search?type=data&status=active&id=1468), is a\nnumeric dataset that has notably large feature space of 857 features. As you\nwill notice, of these three datasets, the oblique forest outperforms axis-aligned\nrandom forest on cnae-9 utilizing sparse random projection mechanism. All datasets\nare subsampled due to computational constraints.\n"
"\n# Plot oblique forest and axis-aligned random forest predictions on cc18 datasets\n\nA performance comparison between oblique forest and standard axis-\naligned random forest using three datasets from OpenML benchmarking suites.\n\nTwo of these datasets, namely\n[WDBC](https://www.openml.org/search?type=data&sort=runs&id=1510)\nand [Phishing Website](https://www.openml.org/search?type=data&sort=runs&id=4534)\ndatasets consist of 31 features where the former dataset is entirely numeric\nand the latter dataset is entirely norminal. The third dataset, dubbed\n[cnae-9](https://www.openml.org/search?type=data&status=active&id=1468), is a\nnumeric dataset that has notably large feature space of 857 features. As you\nwill notice, of these three datasets, the oblique forest outperforms axis-aligned\nrandom forest on cnae-9 utilizing sparse random projection mechanism. All datasets\nare subsampled due to computational constraints.\n\nFor an example of using extra-oblique trees/forests in practice on data, see the following\nexample `sphx_glr_auto_examples_plot_extra_oblique_random_forest.py`.\n"
]
},
{
Expand Down
Loading

0 comments on commit d42f474

Please sign in to comment.