AMFRegressor (#1166)

* AMF Classifier & Mondrian Tree Classifier implementation * [Pull request Update] - Adding a "mondrian" folder in the "tree" folder for better file structure - Using "random.choices" instead of the "sample_discrete" functions in "utils.py", and removing "sample_discrete" from the "utils.py" * [Pull Request] - Removing the "__repr__" method of AMF - Removing the @Setter and @getter - Removing the "loss" parameter of the classifiers since only the "log-loss" is being used in the end * Updating docstring * [Pull request] - Making `learn_one` and `predict_proba_one` accepting all kinds of supported labels for `y` as input - `predict_proba_one` outputs a dictionary of scores with matching labels * [Fix] Reability Co-authored-by: Saulo Martiello Mastelini <mastelini@usp.br> * [Fix] Language Co-authored-by: Saulo Martiello Mastelini <mastelini@usp.br> * [Fix] Language Co-authored-by: Saulo Martiello Mastelini <mastelini@usp.br> * [Fix] math package implementation usage Co-authored-by: Saulo Martiello Mastelini <mastelini@usp.br> * [Pull request] - Leaving `__all__` in alphabetical order for the classifiers - Removing type parameters in the description of `log_2_sum` of math utils - Replacing java-like getters and setters by python-like properties and setter * - Adding support for random state (seed) - Replacing Overflow from infinity to maximum possible float (so it makes computations still possible) * [Ignoring testing environment] * Fixing style & typos Co-authored-by: Saulo Martiello Mastelini <mastelini@usp.br> * [Pull request] - Fixing import order in __init__ file of ensemble - Using LaTeX formulation in AMFClassifier description - Making all nodes related methods private (it shouldn't be used outside) - Docstring syntax update and fixes - Importing river.base instead of typing module for better readability - Adding a short description to the MondrianTreeClassifier - Renaming MondrianTreeLeaf into MondrianLeaf - Reordering functions in MondrianTreeClassifier for better readability * Pre-commit clean up * Pre-commit clean up * [MyPy issue] - Trying to fix the left-right issue uppercast (that shouldn't be a problem normally, but mypy keeps being unhappy) - Fixing assignment issue to the parent during upward procedure - Fixing type assignment to the root branch of the tree - Fixing arg-type for list of intensities - Fixing arg-type issue with current samples proceeding - Fixing dirichlet arg-type issue - Fixing some typing issues - Removing call-overload as int in the memories features range list - Correcting output of predict function * Fixing MyPy issues (detyping) * suggestions and style issues fix * addingnecessary files, classes and methods for regressor * minor import modifications * minor list to typing.List and dict to typing.Dict modifs * minor modifs to pass tests * minor changes * changing names * Fixing predict function to support the "model not trained" situation instead of raising an exception * more style suggestions * testing * regressor fix * fixing docstring * [Pull request Update] - Fixing some TODOs from Mastelini suggestions - Factorizing a bit of code from nodes that should be shared with regressor - Removing branch structure as of now for future changes * Removing all "array-like" structure for full dict support * Pre-commit hookups fixes * regressor fix * Delete tests.py * [Pull request] - Adding suggestions from Mastelini on keys usage - Removing useless initialization of scores in the MondrianTreeClassifier * bug fix * fix conflicts * refactored, but has bugs * remove mypy skip * tests * tests * cleanup * better, but not fixed * minor fix * [Fixes] - Fixing scoring bug (no propagation of counts) - Removing unused parameters in docs - Replacing type union of Python 3.10 in 3.9 annotations - Adding little description for MondrianBranch * Pre-commit hookups fixes * fix some tests * Reworking intensities * fix remaining tests and remove duplicated method call * [Pull request] - Adding examples for AMF & Mondrian Tree Classifiers - Reordering __init__ in alphabetical order - Cleaning the comments - Adding string representation for nodes * Hiding MondrianTree from users visibility * Fixing import on Mondrian Tree example Co-authored-by: Saulo Martiello Mastelini <mastelini@usp.br> * tests * merge fix * merge fix * docstring fixes --------- Co-authored-by: AlexandreChaussard <alexandre.chaussard@telecom-sudparis.eu> Co-authored-by: Alexandre Chaussard <78101027+AlexandreChaussard@users.noreply.github.com> Co-authored-by: Saulo Martiello Mastelini <mastelini@usp.br> Co-authored-by: Kenza Ben jelloun <kenza.ben_jelloun@telecom-sudparis.eu> Co-authored-by: Saulo Martiello Mastelini <saulomastelini@gmail.com>
online-ml · Jul 6, 2023 · 0386737 · 0386737
1 parent 10a2028
commit 0386737
Show file tree

Hide file tree

Showing 5 changed files with 697 additions and 5 deletions.
diff --git a/river/forest/aggregated_mondrian_forest.py b/river/forest/aggregated_mondrian_forest.py
@@ -4,7 +4,7 @@
 import random
 
 from river import base
-from river.tree.mondrian import MondrianTreeClassifier
+from river.tree.mondrian import MondrianTreeClassifier, MondrianTreeRegressor
 
 
 class AMFLearner(base.Ensemble, abc.ABC):
@@ -217,3 +217,114 @@ def predict_proba_one(self, x):
     @property
     def _multiclass(self):
         return True
+
+
+class AMFRegressor(AMFLearner, base.Regressor):
+    """Aggregated Mondrian Forest regressor for online learning.
+
+    This algorithm is truly online, in the sense that a single pass is performed, and that
+    predictions can be produced anytime.
+
+    Each node in a tree predicts according to the average of the labels
+    it contains. The prediction for a sample is computed as the aggregated predictions
+    of all the subtrees along the path leading to the leaf node containing the sample.
+    The aggregation weights are exponential weights with learning rate ``step`` and loss
+    ``loss`` when ``use_aggregation`` is ``True``.
+
+    This computation is performed exactly thanks to a context tree weighting algorithm.
+    More details can be found in the paper cited in references below.
+
+    The final predictions are the average of the predictions of each of the
+    ``n_estimators`` trees in the forest.
+
+    Parameters
+    ----------
+    n_estimators
+        The number of trees in the forest.
+    step
+        Step-size for the aggregation weights.
+    use_aggregation
+        Controls if aggregation is used in the trees. It is highly recommended to
+        leave it as `True`.
+    split_pure
+        Controls if nodes that contains only sample of the same class should be
+        split ("pure" nodes). Default is `False`, namely pure nodes are not split,
+        but `True` can be sometimes better.
+    seed
+        Random seed for reproducibility.
+
+    Note
+    ----
+    All the parameters of ``AMFRegressor`` become **read-only** after the first call
+    to ``partial_fit``.
+
+    References
+    ----------
+    [^1]: J. Mourtada, S. Gaiffas and E. Scornet, *AMF: Aggregated Mondrian Forests for Online Learning*, arXiv:1906.10529, 2019
+
+    """
+
+    def __init__(
+        self,
+        n_estimators: int = 10,
+        step: float = 1.0,
+        use_aggregation: bool = True,
+        split_pure: bool = False,
+        seed: int = None,
+    ):
+
+        super().__init__(
+            n_estimators=n_estimators,
+            step=step,
+            loss="least-squares",
+            use_aggregation=use_aggregation,
+            split_pure=split_pure,
+            seed=seed,
+        )
+
+        self.iteration = 0
+
+    def _initialize_trees(self):
+        """Initialize the forest."""
+
+        self.data: list[MondrianTreeRegressor] = []
+        for _ in range(self.n_estimators):
+            # We don't want to have the same stochastic scheme for each tree, or it'll break the randomness
+            # Hence we introduce a new seed for each, that is derived of the given seed by a deterministic process
+            seed = self._rng.randint(0, 9999999)
+
+            tree = MondrianTreeRegressor(
+                self.step,
+                self.use_aggregation,
+                self.split_pure,
+                self.iteration,
+                seed,
+            )
+            self.data.append(tree)
+
+    def learn_one(self, x, y):
+        # Checking if the forest has been created
+        if not self.is_trained():
+            self._initialize_trees()
+
+        # we fit all the trees using the new sample
+        for tree in self:
+            tree.learn_one(x, y)
+
+        self.iteration += 1
+
+        return self
+
+    def predict_one(self, x):
+
+        # Checking that the model has been trained once at least
+        if not self.is_trained():
+            return None
+
+        prediction = 0
+        for tree in self:
+            tree.use_aggregation = self.use_aggregation
+            prediction += tree.predict_one(x)
+        prediction = prediction / self.n_estimators
+
+        return prediction
diff --git a/river/tree/mondrian/__init__.py b/river/tree/mondrian/__init__.py
@@ -3,12 +3,13 @@
 implementations for the Mondrian trees.
 
 Note that this module is not exposed in the tree module, and is instead used by the
-AMFClassifier class in the ensemble module.
+AMFClassifier and AMFRegressor classes in the ensemble module.
 
 """
 from __future__ import annotations
 
 from .mondrian_tree import MondrianTree
 from .mondrian_tree_classifier import MondrianTreeClassifier
+from .mondrian_tree_regressor import MondrianTreeRegressor
 
-__all__ = ["MondrianTree", "MondrianTreeClassifier"]
+__all__ = ["MondrianTree", "MondrianTreeClassifier", "MondrianTreeRegressor"]
diff --git a/river/tree/mondrian/mondrian_tree_classifier.py b/river/tree/mondrian/mondrian_tree_classifier.py
@@ -464,6 +464,7 @@ def predict_proba_one(self, x):
 
         # If the tree hasn't seen any sample, then it should return
         # the default empty dict
+
         if not self._is_initialized:
             return {}
 

diff --git a/river/tree/mondrian/mondrian_tree_nodes.py b/river/tree/mondrian/mondrian_tree_nodes.py
@@ -288,9 +288,9 @@ def update_downwards(
         Parameters
         ----------
         x
-            Sample to proceed (as a list).
+            Sample to proceed.
         y
-            Class of the sample x_t.
+            Class of the sample x.
         dirichlet
             Dirichlet parameter of the tree.
         use_aggregation
@@ -366,3 +366,150 @@ class MondrianBranchClassifier(MondrianNodeClassifier, MondrianBranch):
 
     def __init__(self, parent, time, depth, feature, threshold, *children):
         super().__init__(parent, time, depth, feature, threshold, *children)
+
+
+class MondrianNodeRegressor(MondrianNode):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+        self.n_samples = 0
+        self.mean = 0.0
+
+    def replant(self, leaf: MondrianNodeRegressor, copy_all: bool = False):
+        """Transfer information from a leaf to a new branch."""
+        self.weight = leaf.weight  # type: ignore
+        self.log_weight_tree = leaf.log_weight_tree  # type: ignore
+        self.mean = leaf.mean
+
+        if copy_all:
+            self.memory_range_min = leaf.memory_range_min
+            self.memory_range_max = leaf.memory_range_max
+            self.n_samples = leaf.n_samples
+
+    def predict(self) -> base.typing.RegTarget:
+        """Return the prediction of the node."""
+        return self.mean
+
+    def loss(self, sample_value: base.typing.RegTarget) -> float:
+        """Compute the loss of the node.
+
+        Parameters
+        ----------
+        sample_value
+            A given value.
+        """
+
+        r = self.predict() - sample_value
+        return r * r / 2
+
+    def update_weight(
+        self,
+        sample_value: base.typing.RegTarget,
+        use_aggregation: bool,
+        step: float,
+    ) -> float:
+        """Update the weight of the node given a label and the method used.
+
+        Parameters
+        ----------
+        sample_value
+            Label of a given sample.
+        use_aggregation
+            Whether to use aggregation or not during computation (given by the tree).
+        step
+            Step parameter of the tree.
+        """
+
+        loss_t = self.loss(sample_value)
+        if use_aggregation:
+            self.weight -= step * loss_t
+        return loss_t
+
+    def update_downwards(
+        self,
+        x,
+        sample_value: base.typing.RegTarget,
+        use_aggregation: bool,
+        step: float,
+        do_update_weight: bool,
+    ):
+        """Update the node when running a downward procedure updating the tree.
+
+        Parameters
+        ----------
+        x
+            Sample to proceed (as a list).
+        sample_value
+            Label of the sample x.
+        use_aggregation
+            Should it use the aggregation or not
+        step
+            Step of the tree.
+        do_update_weight
+            Should we update the weights of the node as well.
+        """
+
+        # Updating the range of the feature values known by the node
+        # If it is the first sample, we copy the features vector into the min and max range
+        if self.n_samples == 0:
+            for feature in x:
+                x_f = x[feature]
+                self.memory_range_min[feature] = x_f
+                self.memory_range_max[feature] = x_f
+        # Otherwise, we update the range
+        else:
+            for feature in x:
+                x_f = x[feature]
+                if x_f < self.memory_range_min[feature]:
+                    self.memory_range_min[feature] = x_f
+                if x_f > self.memory_range_max[feature]:
+                    self.memory_range_max[feature] = x_f
+
+        # One more sample in the node
+        self.n_samples += 1
+
+        if do_update_weight:
+            self.update_weight(sample_value, use_aggregation, step)
+
+        # Update the mean of the labels in the node online
+        self.mean = (self.n_samples * self.mean + sample_value) / (self.n_samples + 1)
+
+
+class MondrianLeafRegressor(MondrianNodeRegressor, MondrianLeaf):
+    """Mondrian Tree Regressor leaf node.
+
+    Parameters
+    ----------
+    parent
+        Parent node.
+    time
+        Split time of the node.
+    depth
+        The depth of the leaf.
+    """
+
+    def __init__(self, parent, time, depth):
+        super().__init__(parent, time, depth)
+
+
+class MondrianBranchRegressor(MondrianNodeRegressor, MondrianBranch):
+    """Mondrian Tree Regressor branch node.
+
+    Parameters
+    ----------
+    parent
+        Parent node of the branch.
+    time
+        Split time characterizing the branch.
+    depth
+        Depth of the branch in the tree.
+    feature
+        Feature of the branch.
+    threshold
+        Acceptation threshold of the branch.
+    *children
+        Children nodes of the branch.
+    """
+
+    def __init__(self, parent, time, depth, feature, threshold, *children):
+        super().__init__(parent, time, depth, feature, threshold, *children)