docs updates [skip ci] (1177)

neurodata · Sep 10, 2023 · 67176a9 · 67176a9
1 parent cde40de
commit 67176a9
Show file tree

Hide file tree

Showing 37 changed files with 4,411 additions and 38 deletions.
diff --git a/dev/.buildinfo b/dev/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 3ab5d8a48b9804837cceda513431b7d3
+config: 96eabe40c6d49eaf7a05630f5ecc9e70
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/dev/_downloads/07fcc19ba03226cd3d83d4e40ec44385/auto_examples_python.zip b/dev/_downloads/07fcc19ba03226cd3d83d4e40ec44385/auto_examples_python.zip
diff --git a/dev/_downloads/08f830cfb84719260d86f1ecbb717431/plot_extra_oblique_random_forest.ipynb b/dev/_downloads/08f830cfb84719260d86f1ecbb717431/plot_extra_oblique_random_forest.ipynb
@@ -0,0 +1,43 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n# Compare extra oblique forest and oblique random forest predictions on cc18 datasets\n\nA performance comparison between extra oblique forest and standard oblique random\nforest using four datasets from OpenML benchmarking suites.\n\nExtra oblique forest uses extra oblique trees as base model which differ from classic\ndecision trees in the way they are built. When looking for the best split to\nseparate the samples of a node into two groups, random splits are drawn for each\nof the `max_features` randomly selected features and the best split among those is\nchosen. This is in contrast with the greedy approach, which evaluates the best possible\nthreshold for each chosen split. For details of the original extra-tree, see [1]_.\n\nThe datasets used in this example are from the OpenML benchmarking suite are:\n\n* [Phishing Website](https://www.openml.org/search?type=data&sort=runs&id=4534)\n* [WDBC](https://www.openml.org/search?type=data&sort=runs&id=1510)\n* [Lsvt](https://www.openml.org/search?type=data&sort=runs&id=1484)\n* [har](https://www.openml.org/search?type=data&sort=runs&id=1478)\n* [cnae-9](https://www.openml.org/search?type=data&sort=runs&id==1468)\n\nLarge datasets are subsampled due to computational constraints for running\nthis example. Note that `cnae-9` is\nan high dimensional dataset with very sparse 856 features, mostly consisting of zeros.\n\n+------------------+-------------+--------------+----------+\n|     Dataset      |  # Samples  |  # Features  | Datatype |\n+==================+=============+==============+==========+\n| Phishing Website |    2000     |      30      | nominal  |\n+------------------+-------------+--------------+----------+\n|      WDBC        |    455      |      30      | numeric  |\n+------------------+-------------+--------------+----------+\n|       Lsvt       |    100      |     310      | numeric  |\n+------------------+-------------+--------------+----------+\n|        har       |    2000     |     561      | numeric  |\n+------------------+-------------+--------------+----------+\n|      cnae-9      |    864      |     856      | numeric  |\n+------------------+-------------+--------------+----------+\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>In the following example, the parameters `max_depth` and 'max_features` are\n    set deliberately low in order to allow the example to run in our CI test suite.\n    For normal usage, these parameters should be set to appropriate values depending\n    on the dataset.</p></div>\n\n## Discussion\nExtra Oblique Tree demonstrates performance similar to that of regular Oblique Tree on average\nwith some increase in variance. See [1]_ for a detailed discussion on the bias-variance tradeoff\nof extra-trees vs normal trees.\n\nHowever, Extra Oblique Tree runs substantially faster than Oblique Tree on some datasets due to\nthe random split process which omits the computationally expensive search for the best split.\nThe main source of increase in speed stems from the omission of sorting samples during the\nsplitting of a node. In the standard trees, samples are sorted in ascending order to determine the\nbest split hence the complexity is `O(n\\log(n))`. In Extra trees, samples\nare not sorted and the split is determined by randomly drawing a threshold from the feature's\nrange, hence the complexity is `O(n)`. This makes the algorithm more suitable for large datasets.\n\n## References\n.. [1] P. Geurts, D. Ernst., and L. Wehenkel, \"Extremely randomized trees\", Machine Learning, 63(1),\n    3-42, 2006.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "from datetime import datetime\n\nimport matplotlib.pyplot as plt\nimport pandas as pd\nimport seaborn as sns\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.model_selection import RepeatedKFold, cross_validate\n\nfrom sktree import ExtraObliqueRandomForestClassifier, ObliqueRandomForestClassifier\n\n# Model parameters\nmax_depth = 3\nmax_features = \"sqrt\"\nmax_sample_size = 2000\nrandom_state = 123\nn_estimators = 50\n\n# Datasets\nphishing_website = 4534\nwdbc = 1510\nlsvt = 1484\nhar = 1478\ncnae_9 = 1468\n\ndata_ids = [phishing_website, wdbc, lsvt, har, cnae_9]\ndf = pd.DataFrame()\n\n\ndef load_cc18(data_id):\n    df = fetch_openml(data_id=data_id, as_frame=True, parser=\"pandas\")\n\n    # extract the dataset name\n    d_name = df.details[\"name\"]\n\n    # Subsampling large datasets\n    n = int(df.frame.shape[0] * 0.8)\n\n    if n > max_sample_size:\n        n = max_sample_size\n\n    df = df.frame.sample(n, random_state=random_state)\n    X, y = df.iloc[:, :-1], df.iloc[:, -1]\n\n    return X, y, d_name\n\n\ndef get_scores(X, y, d_name, n_cv=5, n_repeats=1, **kwargs):\n    clfs = [ExtraObliqueRandomForestClassifier(**kwargs), ObliqueRandomForestClassifier(**kwargs)]\n    dim = X.shape\n    tmp = []\n\n    for i, clf in enumerate(clfs):\n        t0 = datetime.now()\n        cv = RepeatedKFold(n_splits=n_cv, n_repeats=n_repeats, random_state=kwargs[\"random_state\"])\n        test_score = cross_validate(estimator=clf, X=X, y=y, cv=cv, scoring=\"accuracy\")\n        time_taken = datetime.now() - t0\n        # convert the time taken to seconds\n        time_taken = time_taken.total_seconds()\n\n        tmp.append(\n            [\n                d_name,\n                dim,\n                [\"EORF\", \"ORF\"][i],\n                test_score[\"test_score\"],\n                test_score[\"test_score\"].mean(),\n                time_taken,\n            ]\n        )\n\n    df = pd.DataFrame(tmp, columns=[\"dataset\", \"dimension\", \"model\", \"score\", \"mean\", \"time_taken\"])\n    df = df.explode(\"score\")\n    df[\"score\"] = df[\"score\"].astype(float)\n    df.reset_index(inplace=True, drop=True)\n\n    return df\n\n\nparams = {\n    \"max_features\": max_features,\n    \"n_estimators\": n_estimators,\n    \"max_depth\": max_depth,\n    \"random_state\": random_state,\n    \"n_cv\": 10,\n    \"n_repeats\": 1,\n}\n\nfor data_id in data_ids:\n    X, y, d_name = load_cc18(data_id=data_id)\n    tmp = get_scores(X=X, y=y, d_name=d_name, **params)\n    df = pd.concat([df, tmp])\n\n# Show the time taken to train each model\nprint(pd.DataFrame.from_dict(params, orient=\"index\", columns=[\"value\"]))\nprint(df.groupby([\"dataset\", \"dimension\", \"model\"])[[\"time_taken\"]].mean())\n\n# Draw a comparison plot\nd_names = df.dataset.unique()\nN = d_names.shape[0]\n\nfig, ax = plt.subplots(1, N)\nfig.set_size_inches(6 * N, 6)\n\nfor i, name in enumerate(d_names):\n    sns.stripplot(\n        data=df.query(f'dataset == \"{name}\"'),\n        x=\"model\",\n        y=\"score\",\n        ax=ax[i],\n        dodge=True,\n    )\n    sns.boxplot(\n        data=df.query(f'dataset == \"{name}\"'),\n        x=\"model\",\n        y=\"score\",\n        ax=ax[i],\n        color=\"white\",\n    )\n    ax[i].set_title(name)\n    if i != 0:\n        ax[i].set_ylabel(\"\")\n    ax[i].set_xlabel(\"\")\n# show the figure\nplt.show()"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.9.18"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/dev/_downloads/47a0f99ed91a41ad89e274c306824ef4/plot_extra_orf_sample_size.py b/dev/_downloads/47a0f99ed91a41ad89e274c306824ef4/plot_extra_orf_sample_size.py
@@ -0,0 +1,147 @@
+"""
+========================================================================================
+Speed of Extra Oblique Random Forest vs Oblique Random Forest on different dataset sizes
+========================================================================================
+
+A performance comparison between extra oblique forest and standard oblique random
+forest on different dataset sizes. The purpose of this comparison is to show the speed of
+changes for each models as dataset size increases. For more information, see [1]_.
+
+The datasets used in this example are from the OpenML benchmarking suite are:
+
+* [Phishing Website](https://www.openml.org/search?type=data&sort=runs&id=4534)
+* [har](https://www.openml.org/search?type=data&sort=runs&id=1478)
+
++------------------+---------+----------+----------+
+|      dataset     | samples | features | datatype |
++------------------+---------+----------+----------+
+| Phishing Website |  11055  |    30    | nominal  |
++------------------+---------+----------+----------+
+|       har        |  10299  |   562    | numeric  |
++------------------+---------+----------+----------+
+
+.. note:: In the following example, the parameters `max_depth` and 'max_features` are
+    set deliberately low in order to pass the CI test suit. For normal usage, these parameters
+    should be set to appropriate values depending on the dataset.
+
+Discussion
+----------
+In this section, the focus is on the time taken to train each model. The results show
+that extra oblique random forest is faster than standard oblique random forest on all
+datasets. Notably, the speed of extra oblique random forest and oblique random forest
+grows linearly with the increase in sample size but grows faster for the oblique random
+forest. The difference between the two models is more significant on datasets with higher
+dimensions.
+
+References
+----------
+.. [1] P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized trees", Machine Learning, 63(1),
+    3-42, 2006.
+"""
+
+from datetime import datetime
+
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+import seaborn as sns
+from sklearn.datasets import fetch_openml
+from sklearn.model_selection import RepeatedKFold, cross_validate
+
+from sktree import ExtraObliqueRandomForestClassifier, ObliqueRandomForestClassifier
+
+# Model Parameters
+max_depth = 3
+max_features = "sqrt"
+max_sample_size = 10000
+random_state = 123
+n_estimators = 50
+
+# Datasets
+phishing_website = 4534
+har = 1478
+
+data_ids = [phishing_website, har]
+df = pd.DataFrame()
+
+
+def load_cc18(data_id, sample_size):
+    df = fetch_openml(data_id=data_id, as_frame=True, parser="pandas")
+
+    # extract the dataset name
+    d_name = df.details["name"]
+
+    # Subsampling large datasets
+    n = sample_size
+
+    if n > max_sample_size:
+        n = max_sample_size
+
+    df = df.frame.sample(n, random_state=random_state)
+    X, y = df.iloc[:, :-1], df.iloc[:, -1]
+
+    return X, y, d_name
+
+
+def get_scores(X, y, d_name, n_cv=5, n_repeats=1, **kwargs):
+    clfs = [ExtraObliqueRandomForestClassifier(**kwargs), ObliqueRandomForestClassifier(**kwargs)]
+    dim = X.shape
+    tmp = []
+
+    for i, clf in enumerate(clfs):
+        t0 = datetime.now()
+        cv = RepeatedKFold(n_splits=n_cv, n_repeats=n_repeats, random_state=kwargs["random_state"])
+        test_score = cross_validate(estimator=clf, X=X, y=y, cv=cv, scoring="accuracy")
+        time_taken = datetime.now() - t0
+        # convert the time taken to seconds
+        time_taken = time_taken.total_seconds()
+
+        tmp.append(
+            [
+                d_name,
+                dim,
+                ["EORF", "ORF"][i],
+                test_score["test_score"],
+                test_score["test_score"].mean(),
+                time_taken,
+            ]
+        )
+
+    df = pd.DataFrame(tmp, columns=["dataset", "dimension", "model", "score", "mean", "time_taken"])
+    df = df.explode("score")
+    df["score"] = df["score"].astype(float)
+    df.reset_index(inplace=True, drop=True)
+
+    return df
+
+
+params = {
+    "max_features": max_features,
+    "n_estimators": n_estimators,
+    "max_depth": max_depth,
+    "random_state": random_state,
+    "n_cv": 10,
+    "n_repeats": 1,
+}
+
+for data_id in data_ids:
+    for n in np.linspace(1000, max_sample_size, 10).astype(int):
+        X, y, d_name = load_cc18(data_id=data_id, sample_size=n)
+        tmp = get_scores(X=X, y=y, d_name=d_name, **params)
+        df = pd.concat([df, tmp])
+df["n_row"] = [item[0] for item in df.dimension]
+# Show the time taken to train each model
+df_tmp = df.groupby(["dataset", "n_row", "model"])[["time_taken"]].mean()
+
+# Draw a comparison plot
+d_names = df.dataset.unique()
+N = d_names.shape[0]
+
+fig, ax = plt.subplots(1, N)
+# plot the results with time taken on y axis and sample size on x axis
+fig.set_size_inches(6 * N, 6)
+for i, d_name in enumerate(d_names):
+    df_tmp = df[df["dataset"] == d_name]
+    sns.lineplot(data=df_tmp, x="n_row", y="time_taken", hue="model", color="dataset", ax=ax[i])
+    ax[i].set_title(d_name)
+plt.show()
diff --git a/dev/_downloads/643a076e7aec94b455be6c93cfd7520f/plot_extra_orf_sample_size.ipynb b/dev/_downloads/643a076e7aec94b455be6c93cfd7520f/plot_extra_orf_sample_size.ipynb
@@ -0,0 +1,43 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n# Speed of Extra Oblique Random Forest vs Oblique Random Forest on different dataset sizes\n\nA performance comparison between extra oblique forest and standard oblique random\nforest on different dataset sizes. The purpose of this comparison is to show the speed of\nchanges for each models as dataset size increases. For more information, see [1]_.\n\nThe datasets used in this example are from the OpenML benchmarking suite are:\n\n* [Phishing Website](https://www.openml.org/search?type=data&sort=runs&id=4534)\n* [har](https://www.openml.org/search?type=data&sort=runs&id=1478)\n\n+------------------+---------+----------+----------+\n|      dataset     | samples | features | datatype |\n+------------------+---------+----------+----------+\n| Phishing Website |  11055  |    30    | nominal  |\n+------------------+---------+----------+----------+\n|       har        |  10299  |   562    | numeric  |\n+------------------+---------+----------+----------+\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>In the following example, the parameters `max_depth` and 'max_features` are\n    set deliberately low in order to pass the CI test suit. For normal usage, these parameters\n    should be set to appropriate values depending on the dataset.</p></div>\n\n## Discussion\nIn this section, the focus is on the time taken to train each model. The results show\nthat extra oblique random forest is faster than standard oblique random forest on all\ndatasets. Notably, the speed of extra oblique random forest and oblique random forest\ngrows linearly with the increase in sample size but grows faster for the oblique random\nforest. The difference between the two models is more significant on datasets with higher\ndimensions.\n\n## References\n.. [1] P. Geurts, D. Ernst., and L. Wehenkel, \"Extremely randomized trees\", Machine Learning, 63(1),\n    3-42, 2006.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "from datetime import datetime\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport seaborn as sns\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.model_selection import RepeatedKFold, cross_validate\n\nfrom sktree import ExtraObliqueRandomForestClassifier, ObliqueRandomForestClassifier\n\n# Model Parameters\nmax_depth = 3\nmax_features = \"sqrt\"\nmax_sample_size = 10000\nrandom_state = 123\nn_estimators = 50\n\n# Datasets\nphishing_website = 4534\nhar = 1478\n\ndata_ids = [phishing_website, har]\ndf = pd.DataFrame()\n\n\ndef load_cc18(data_id, sample_size):\n    df = fetch_openml(data_id=data_id, as_frame=True, parser=\"pandas\")\n\n    # extract the dataset name\n    d_name = df.details[\"name\"]\n\n    # Subsampling large datasets\n    n = sample_size\n\n    if n > max_sample_size:\n        n = max_sample_size\n\n    df = df.frame.sample(n, random_state=random_state)\n    X, y = df.iloc[:, :-1], df.iloc[:, -1]\n\n    return X, y, d_name\n\n\ndef get_scores(X, y, d_name, n_cv=5, n_repeats=1, **kwargs):\n    clfs = [ExtraObliqueRandomForestClassifier(**kwargs), ObliqueRandomForestClassifier(**kwargs)]\n    dim = X.shape\n    tmp = []\n\n    for i, clf in enumerate(clfs):\n        t0 = datetime.now()\n        cv = RepeatedKFold(n_splits=n_cv, n_repeats=n_repeats, random_state=kwargs[\"random_state\"])\n        test_score = cross_validate(estimator=clf, X=X, y=y, cv=cv, scoring=\"accuracy\")\n        time_taken = datetime.now() - t0\n        # convert the time taken to seconds\n        time_taken = time_taken.total_seconds()\n\n        tmp.append(\n            [\n                d_name,\n                dim,\n                [\"EORF\", \"ORF\"][i],\n                test_score[\"test_score\"],\n                test_score[\"test_score\"].mean(),\n                time_taken,\n            ]\n        )\n\n    df = pd.DataFrame(tmp, columns=[\"dataset\", \"dimension\", \"model\", \"score\", \"mean\", \"time_taken\"])\n    df = df.explode(\"score\")\n    df[\"score\"] = df[\"score\"].astype(float)\n    df.reset_index(inplace=True, drop=True)\n\n    return df\n\n\nparams = {\n    \"max_features\": max_features,\n    \"n_estimators\": n_estimators,\n    \"max_depth\": max_depth,\n    \"random_state\": random_state,\n    \"n_cv\": 10,\n    \"n_repeats\": 1,\n}\n\nfor data_id in data_ids:\n    for n in np.linspace(1000, max_sample_size, 10).astype(int):\n        X, y, d_name = load_cc18(data_id=data_id, sample_size=n)\n        tmp = get_scores(X=X, y=y, d_name=d_name, **params)\n        df = pd.concat([df, tmp])\ndf[\"n_row\"] = [item[0] for item in df.dimension]\n# Show the time taken to train each model\ndf_tmp = df.groupby([\"dataset\", \"n_row\", \"model\"])[[\"time_taken\"]].mean()\n\n# Draw a comparison plot\nd_names = df.dataset.unique()\nN = d_names.shape[0]\n\nfig, ax = plt.subplots(1, N)\n# plot the results with time taken on y axis and sample size on x axis\nfig.set_size_inches(6 * N, 6)\nfor i, d_name in enumerate(d_names):\n    df_tmp = df[df[\"dataset\"] == d_name]\n    sns.lineplot(data=df_tmp, x=\"n_row\", y=\"time_taken\", hue=\"model\", color=\"dataset\", ax=ax[i])\n    ax[i].set_title(d_name)\nplt.show()"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.9.18"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/dev/_downloads/6f1e7a639e0699d6164445b55e6c116d/auto_examples_jupyter.zip b/dev/_downloads/6f1e7a639e0699d6164445b55e6c116d/auto_examples_jupyter.zip