[ENH] Add dataset generators (#169)

* Add datasets generating functions --------- Signed-off-by: Adam Li <adam2392@gmail.com>
neurodata · Nov 13, 2023 · 030a064 · 030a064
1 parent e4728fa
commit 030a064
Show file tree

Hide file tree

Showing 17 changed files with 661 additions and 40 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -3,13 +3,13 @@
 Thanks for considering contributing! Please read this document to learn the various ways you can contribute to this project and how to go about doing it.
 
 **Submodule dependency on a fork of scikit-learn**
-Due to the current state of scikit-learn's internal Cython code for trees, we have to instead leverage a maintained fork of scikit-learn at https://github.com/neurodata/scikit-learn, where specifically, the `submodulev2` branch is used to build and install this repo. We keep that fork well-maintained and up-to-date with respect to the main sklearn repo. The only difference is the refactoring of the `tree/` submodule. This fork is used internally under the namespace ``sktree._lib.sklearn``. It is necessary to use this fork for anything related to:
+Due to the current state of scikit-learn's internal Cython code for trees, we have to instead leverage a maintained fork of scikit-learn at <https://github.com/neurodata/scikit-learn>, where specifically, the `submodulev3` branch is used to build and install this repo. We keep that fork well-maintained and up-to-date with respect to the main sklearn repo. The only difference is the refactoring of the `tree/` submodule. This fork is used internally under the namespace ``sktree._lib.sklearn``. It is necessary to use this fork for anything related to:
 
 - `RandomForest*`
 - `ExtraTrees*`
 - or any importable items from the `tree/` submodule, whether it is a Cython or Python object
 
-If you are developing for scikit-tree, we will always depend on the most up-to-date commit of `https://github.com/neurodata/scikit-learn/submodulev2` as a submodule within scikit-tee. This branch is consistently maintained for changes upstream that occur in the scikit-learn tree submodule. This ensures that our fork maintains consistency and robustness due to bug fixes and improvements upstream
+If you are developing for scikit-tree, we will always depend on the most up-to-date commit of `https://github.com/neurodata/scikit-learn/submodulev3` as a submodule within scikit-tee. This branch is consistently maintained for changes upstream that occur in the scikit-learn tree submodule. This ensures that our fork maintains consistency and robustness due to bug fixes and improvements upstream
 
 ## Bug reports and feature requests
 
@@ -27,16 +27,16 @@ code sample or an executable test case demonstrating the expected behavior.
 
 We use GitHub issues to track feature requests. Before you create an feature request:
 
-* Make sure you have a clear idea of the enhancement you would like. If you have a vague idea, consider discussing
+- Make sure you have a clear idea of the enhancement you would like. If you have a vague idea, consider discussing
 it first on a GitHub issue.
-* Check the documentation to make sure your feature does not already exist.
-* Do [a quick search](https://github.com/neurodata/scikit-tree/issues) to see whether your feature has already been suggested.
+- Check the documentation to make sure your feature does not already exist.
+- Do [a quick search](https://github.com/neurodata/scikit-tree/issues) to see whether your feature has already been suggested.
 
 When creating your request, please:
 
-* Provide a clear title and description.
-* Explain why the enhancement would be useful. It may be helpful to highlight the feature in other libraries.
-* Include code examples to demonstrate how the enhancement would be used.
+- Provide a clear title and description.
+- Explain why the enhancement would be useful. It may be helpful to highlight the feature in other libraries.
+- Include code examples to demonstrate how the enhancement would be used.
 
 ## Making a pull request
 
@@ -52,7 +52,7 @@ When you're ready to contribute code to address an open issue, please follow the
 
         git clone https://github.com/USERNAME/scikit-tree.git
 
-    or 
+    or
 
         git clone git@github.com:USERNAME/scikit-tree.git
 
@@ -142,6 +142,7 @@ When you're ready to contribute code to address an open issue, please follow the
     </details>
 
 ### Installing locally with Meson
+
 Meson is a modern build system with a lot of nice features, which is why we use it for our build system to compile the Cython/C++ code.
 However, there are some intricacies that might be new to a pure Python developer.
 
@@ -151,7 +152,7 @@ In general, the steps to build scikit-tree are:
 - build and install scikit-tree locally using `spin`
 
 Example would be:
-        
+
         pip uninstall scikit-learn
 
         # install the fork of scikit-learn
@@ -172,13 +173,13 @@ The most common errors come from the following:
 
 The CI files for github actions shows how to build and install for each OS.
 
-
 ### Writing docstrings
 
 We use [Sphinx](https://www.sphinx-doc.org/en/master/index.html) to build our API docs, which automatically parses all docstrings
 of public classes and methods. All docstrings should adhere to the [Numpy styling convention](https://www.sphinx-doc.org/en/master/usage/extensions/example_numpy.html).
 
 ### Testing Changes Locally With Poetry
+
 With poetry installed, we have included a few convenience functions to check your code. These checks must pass and will be checked by the PR's continuous integration services. You can install the various different developer dependencies with poetry:
 
     poetry install --with style, docs, test
@@ -217,6 +218,22 @@ If you need to add new, or remove old dependencies, then you need to modify the
 
 To update the lock file.
 
+## Developing a new Tree model
+
+Here, we define some high-level procedures for how to best approach implementing a new decision-tree model that is not supported yet in scikit-tree.
+
+1. First-pass on implementation:
+
+    Implement a Cython splitter class and expose it in Python afterwards. Follow the framework for PatchObliqueSplitter and ObliqueSplitter and their respective decision-tree models: PatchObliqueDecisionTreeClassifier and ObliqueDecisionTreeClassifier.
+
+2. Second-pass on implementation:
+
+    This involves extending relevant API beyond just the Splitter in Cython. This requires maintaining some degree of backwards-compatibility. Extend the existing API for Tree, TreeBuilder, Criterion, or ObliqueSplitter to enable whatever functionality you desire.
+
+3. Third-pass on implementation:
+
+    This is the most complex implementation and should in theory be rarely used.  This involves both designing a change in the scikit-learn fork submodule as well as relevant changes in scikit-tree itself. Extend the scikit-learn fork API. This requires maintaining some degree of backwards-compatability and testing the proposed changes wrt whatever changes you then make in scikit-tree.
+
 ---
 
 The Project abides by the Organization's [code of conduct](https://github.com/py-why/governance/blob/main/CODE-OF-CONDUCT.md) and [trademark policy](https://github.com/py-why/governance/blob/main/TRADEMARKS.md).

diff --git a/DEVELOPING.md b/DEVELOPING.md
@@ -6,11 +6,11 @@
 - [Development Tasks](#development-tasks)
         - [Basic Verification](#basic-verification)
         - [Docsite](#docsite)
-    - [Details](#details)
-        - [Coding Style](#coding-style)
-        - [Lint](#lint)
-        - [Type checking](#type-checking)
-        - [Unit tests](#unit-tests)
+  - [Details](#details)
+    - [Coding Style](#coding-style)
+    - [Lint](#lint)
+    - [Type checking](#type-checking)
+    - [Unit tests](#unit-tests)
 - [Advanced Updating submodules](#advanced-updating-submodules)
 - [Cython and C++](#cython-and-c)
 - [Making a Release](#making-a-release)
@@ -19,16 +19,16 @@
 
 # Requirements
 
-* Python 3.9+
-* numpy>=1.25
-* scipy>=1.11
-* scikit-learn>=1.3.1
+- Python 3.9+
+- numpy>=1.25
+- scipy>=1.11
+- scikit-learn>=1.3.1
 
 For the other requirements, inspect the ``pyproject.toml`` file.
 
 # Setting up your development environment
 
-We recommend using miniconda, as python virtual environments may not setup properly compilers necessary for our compiled code. For detailed information on setting up and managing conda environments, see https://conda.io/docs/test-drive.html.
+We recommend using miniconda, as python virtual environments may not setup properly compilers necessary for our compiled code. For detailed information on setting up and managing conda environments, see <https://conda.io/docs/test-drive.html>.
 
 <!-- Setup a conda env -->
 
@@ -38,7 +38,7 @@ We recommend using miniconda, as python virtual environments may not setup prope
 **Make sure you specify a Python version if your system defaults to anything less than Python 3.9.**
 
 **Any commands should ALWAYS be after you have activated your conda environment.**
-Next, install necessary build dependencies. For more information, see https://scikit-learn.org/stable/developers/advanced_installation.html.
+Next, install necessary build dependencies. For more information, see <https://scikit-learn.org/stable/developers/advanced_installation.html>.
 
     conda install -c conda-forge joblib threadpoolctl pytest compilers llvm-openmp
 
@@ -77,7 +77,7 @@ For other commands, see
 
 Note at this stage, you will be unable to run Python commands directly. For example, ``pytest ./sktree`` will not work.
 
-However, after installing and building the project from source using meson, you can leverage editable installs to make testing code changes much faster. For more information on meson-python's progress supporting editable installs in a better fashion, see https://meson-python.readthedocs.io/en/latest/how-to-guides/editable-installs.html.
+However, after installing and building the project from source using meson, you can leverage editable installs to make testing code changes much faster. For more information on meson-python's progress supporting editable installs in a better fashion, see <https://meson-python.readthedocs.io/en/latest/how-to-guides/editable-installs.html>.
 
     pip install --no-build-isolation --editable .
 
@@ -88,6 +88,7 @@ However, after installing and building the project from source using meson, you
 the unit-tests should run.
 
 # Development Tasks
+
 There are a series of top-level tasks available through Poetry. If you are updated the dependencies, please run `poetry update` to update the lock file. These can each be run via
 
  `poetry run poe <taskname>`
@@ -99,16 +100,18 @@ To do so, first install poetry and poethepoet.
 Now, you are ready to run quick commands to format the codebase, lint the codebase and type-check the codebase.
 
 ### Basic Verification
+
 * **format** - runs the suite of formatting tools applying tools to make code compliant
-* **format_check** - runs the suite of formatting tools checking for compliance
-* **lint** - runs the suite of linting tools
-* **type_check** - performs static typechecking of the codebase using mypy
-* **unit_test** - executes fast unit tests
-* **verify** - executes the basic PR verification suite, which includes all the tasks listed above
+- **format_check** - runs the suite of formatting tools checking for compliance
+- **lint** - runs the suite of linting tools
+- **type_check** - performs static typechecking of the codebase using mypy
+- **unit_test** - executes fast unit tests
+- **verify** - executes the basic PR verification suite, which includes all the tasks listed above
 
 ### Docsite
+
 * **build_docs** - build the API documentation site
-* **build_docs_noplot** - build the API documentation site without running explicitly any of the examples, for faster local checks of any documentation updates.
+- **build_docs_noplot** - build the API documentation site without running explicitly any of the examples, for faster local checks of any documentation updates.
 
 ## Details
 
@@ -144,8 +147,8 @@ In order for any code to be added to the repository, we require unit tests to pa
 
 # (Advanced) Updating submodules
 
-Scikit-tree relies on a submodule of a forked-version of scikit-learn for certain Python and Cython code that extends the ``DecisionTree*`` models. Usually, if a developer is making changes, they should go over to the ``submodulev3`` branch on ``https://github.com/neurodata/scikit-learn`` and 
-submit a PR to make changes to the submodule. 
+Scikit-tree relies on a submodule of a forked-version of scikit-learn for certain Python and Cython code that extends the ``DecisionTree*`` models. Usually, if a developer is making changes, they should go over to the ``submodulev3`` branch on ``https://github.com/neurodata/scikit-learn`` and
+submit a PR to make changes to the submodule.
 
 This should **ALWAYS** be supported by some use-case in scikit-tree. We want the minimal amount of code-change in our forked version of scikit-learn to make it very easy to merge in upstream changes, bug fixes and features for tree-based code.
 
@@ -160,6 +163,7 @@ Now, you can re-build the project using the latest submodule changes.
     spin build --clean
 
 # Cython and C++
+
 The general design of scikit-tree follows that of the tree-models inside scikit-learn, where tree-based models are inherently Cythonized, or written with C++. Then the actual forest (e.g. RandomForest, or ExtraForest) is just a Python API wrapper that creates an ensemble of the trees.
 
 In order to develop new tree models, generally Cython and C++ code will need to be written in order to optimize the tree building process, otherwise fitting a single forest model would take very long.
@@ -170,7 +174,7 @@ Scikit-tree is in-line with scikit-learn and thus relies on each new version rel
 
 1. Download wheels from GH Actions and put all wheels into a ``dist/`` folder
 
-https://github.com/neurodata/scikit-tree/actions/workflows/build_wheels.yml will have all the wheels for common OSes built for each Python version.
+<https://github.com/neurodata/scikit-tree/actions/workflows/build_wheels.yml> will have all the wheels for common OSes built for each Python version.
 
 2. Upload wheels to test PyPi
 
@@ -186,10 +190,10 @@ Verify that installations work as expected on your machine.
 twine upload dist/*
 ```
 
-or if you have two-factor authentication enabled: https://pypi.org/help/#apitoken
+or if you have two-factor authentication enabled: <https://pypi.org/help/#apitoken>
 
     twine upload dist/* --repository scikit-tree
 
 4. Update version number on ``meson.build`` and ``pyproject.toml`` to the relevant version.
 
-See https://github.com/neurodata/scikit-tree/pull/160 as an example.
+See https://github.com/neurodata/scikit-tree/pull/160 as an example.
diff --git a/README.md b/README.md
@@ -17,14 +17,16 @@ Tree-models have withstood the test of time, and are consistently used for moder
 Documentation
 =============
 
-See here for the documentation for our dev version: https://docs.neurodata.io/scikit-tree/dev/index.html
+See here for the documentation for our dev version: <https://docs.neurodata.io/scikit-tree/dev/index.html>
 
 Why oblique trees and why trees beyond those in scikit-learn?
 =============================================================
+
 In 2001, Leo Breiman proposed two types of Random Forests. One was known as ``Forest-RI``, which is the axis-aligned traditional random forest. One was known as ``Forest-RC``, which is the random oblique linear combinations random forest. This leveraged random combinations of features to perform splits. [MORF](1) builds upon ``Forest-RC`` by proposing additional functions to combine features. Other modern tree variants such as Canonical Correlation Forests (CCF), Extended Isolation Forests, Quantile Forests, or unsupervised random forests are also important at solving real-world problems using robust decision tree models.
 
 Installation
 ============
+
 Our installation will try to follow scikit-learn installation as close as possible, as we contain Cython code subclassed, or inspired by the scikit-learn tree submodule.
 
 Dependencies
@@ -37,18 +39,20 @@ We minimally require:
     * scipy
     * scikit-learn >= 1.3
 
-Installation with Pip (https://pypi.org/project/scikit-tree/)
+Installation with Pip (<https://pypi.org/project/scikit-tree/>)
 -------------------------------------------------------------
+
 Installing with pip on a conda environment is the recommended route.
 
     pip install scikit-tree
 
 Building locally with Meson (For developers)
 --------------------------------------------
+
 Make sure you have the necessary packages installed
 
     # install build dependencies
-    pip install numpy scipy meson ninja meson-python Cython scikit-learn scikit-learn-tree
+    pip install -r build_requirements.txt
 
     # you may need these optional dependencies to build scikit-learn locally
     conda install -c conda-forge joblib threadpoolctl pytest compilers llvm-openmp
@@ -102,11 +106,13 @@ After building locally, you can use editable installs (warning: this only regist
 
 Development
 ===========
+
 We welcome contributions for modern tree-based algorithms. We use Cython to achieve fast C/C++ speeds, while abiding by a scikit-learn compatible (tested) API. Moreover, our Cython internals are easily extensible because they follow the internal Cython API of scikit-learn as well.
 
-Due to the current state of scikit-learn's internal Cython code for trees, we have to instead leverage a fork of scikit-learn at https://github.com/neurodata/scikit-learn when
+Due to the current state of scikit-learn's internal Cython code for trees, we have to instead leverage a fork of scikit-learn at <https://github.com/neurodata/scikit-learn> when
 extending the decision tree model API of scikit-learn. Specifically, we extend the Python and Cython API of the tree submodule in scikit-learn in our submodule, so we can introduce the tree models housed in this package. Thus these extend the functionality of decision-tree based models in a way that is not possible yet in scikit-learn itself. As one example, we introduce an abstract API to allow users to implement their own oblique splits. Our plan in the future is to benchmark these functionalities and introduce them upstream to scikit-learn where applicable and inclusion criterion are met.
 
 References
 ==========
+
 [1]: [`Li, Adam, et al. "Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks" SIAM Journal on Mathematics of Data Science, 5(1), 77-96, 2023`](https://doi.org/10.1137/21M1449117)