Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in AMFClassifier and finish AMFRegressor (#1166) #1281

Merged
merged 15 commits into from
Jul 11, 2023
8 changes: 7 additions & 1 deletion docs/releases/unreleased.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Unreleased

Calling `learn_one` in a pipeline will now update each part of the pipeline in turn. Before the unsupervised parts of the pipeline were updated during `predict_one`. This is more intuitive for new users. The old behavior, which yields better results, can be restored by calling `learn_one` with the new `compose.pure_inference_mode` context manager.
Calling `learn_one` in a pipeline will now update each part of the pipeline in turn. Before the unsupervised parts of the pipeline were updated during `predict_one`. This is more intuitive for new users. The old behavior, which yields better results, can be restored by calling `learn_one` with the new `compose.learn_during_predict` context manager.

## compose

Expand All @@ -15,7 +15,13 @@ Calling `learn_one` in a pipeline will now update each part of the pipeline in t
## forest

- Fixed issue with `forest.ARFClassifier` which couldn't be passed a `CrossEntropy` metric.
- Fixed a bug in `forest.AMFClassifier` which slightly improves predictive accurary.
- Added `forest.AMFRegressor`.

## preprocessing

- Added `preprocessing.OrdinalEncoder`, to map string features to integers.

## utils

- Added `utils.random.exponential` to retrieve random samples following an exponential distribution.
3 changes: 2 additions & 1 deletion river/forest/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,13 @@
from __future__ import annotations

from .adaptive_random_forest import ARFClassifier, ARFRegressor
from .aggregated_mondrian_forest import AMFClassifier
from .aggregated_mondrian_forest import AMFClassifier, AMFRegressor
from .online_extra_trees import OXTRegressor

__all__ = [
"ARFClassifier",
"ARFRegressor",
"AMFClassifier",
"AMFRegressor",
"OXTRegressor",
]
122 changes: 118 additions & 4 deletions river/forest/aggregated_mondrian_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import random

from river import base
from river.tree.mondrian import MondrianTreeClassifier
from river.tree.mondrian import MondrianTreeClassifier, MondrianTreeRegressor


class AMFLearner(base.Ensemble, abc.ABC):
Expand Down Expand Up @@ -71,7 +71,7 @@ def _min_number_of_models(self):
class AMFClassifier(AMFLearner, base.Classifier):
"""Aggregated Mondrian Forest classifier for online learning.

This implementation is truly online, in the sense that a single pass is performed, and that
This implementation is truly online[^1], in the sense that a single pass is performed, and that
predictions can be produced anytime.

Each node in a tree predicts according to the distribution of the labels
Expand Down Expand Up @@ -139,11 +139,12 @@ class AMFClassifier(AMFLearner, base.Classifier):
>>> metric = metrics.Accuracy()

>>> evaluate.progressive_val_score(dataset, model, metric)
Accuracy: 84.97%
Accuracy: 85.37%

References
----------
J. Mourtada, S. Gaiffas and E. Scornet, *AMF: Aggregated Mondrian Forests for Online Learning*, arXiv:1906.10529, 2019.
[^1]: Mourtada, J., Gaïffas, S., & Scornet, E. (2021). AMF: Aggregated Mondrian forests for online
learning. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(3), 505-533.

"""

Expand Down Expand Up @@ -217,3 +218,116 @@ def predict_proba_one(self, x):
@property
def _multiclass(self):
return True


class AMFRegressor(AMFLearner, base.Regressor):
"""Aggregated Mondrian Forest regressor for online learning.

This algorithm is truly online, in the sense that a single pass is performed, and that
predictions can be produced anytime.

Each node in a tree predicts according to the average of the labels it contains.
The prediction for a sample is computed as the aggregated predictions of all the subtrees
along the path leading to the leaf node containing the sample. The aggregation weights are
exponential weights with learning rate `step` using a squared loss when `use_aggregation`
is `True`.

This computation is performed exactly thanks to a context tree weighting algorithm.
More details can be found in the original paper[^1].

The final predictions are the average of the predictions of each of the
``n_estimators`` trees in the forest.

Parameters
----------
n_estimators
The number of trees in the forest.
step
Step-size for the aggregation weights.
use_aggregation
Controls if aggregation is used in the trees. It is highly recommended to
leave it as `True`.
seed
Random seed for reproducibility.

Examples
--------

>>> from river import datasets
>>> from river import evaluate
>>> from river import forest
>>> from river import metrics

>>> dataset = datasets.TrumpApproval()
>>> model = forest.AMFRegressor(seed=42)
>>> metric = metrics.MAE()

>>> evaluate.progressive_val_score(dataset, model, metric)
MAE: 0.268533

References
----------
[^1]: Mourtada, J., Gaïffas, S., & Scornet, E. (2021). AMF: Aggregated Mondrian forests for online
learning. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(3), 505-533.

"""

def __init__(
self,
n_estimators: int = 10,
step: float = 1.0,
use_aggregation: bool = True,
seed: int = None,
):
super().__init__(
n_estimators=n_estimators,
step=step,
loss="least-squares",
use_aggregation=use_aggregation,
seed=seed,
)

self.iteration = 0

def _initialize_trees(self):
"""Initialize the forest."""

self.data: list[MondrianTreeRegressor] = []
for _ in range(self.n_estimators):
# We don't want to have the same stochastic scheme for each tree, or it'll break the randomness
# Hence we introduce a new seed for each, that is derived of the given seed by a deterministic process
seed = self._rng.randint(0, 9999999)

tree = MondrianTreeRegressor(
self.step,
self.use_aggregation,
self.iteration,
seed,
)
self.data.append(tree)

def learn_one(self, x, y):
# Checking if the forest has been created
if not self._is_initialized:
self._initialize_trees()

# we fit all the trees using the new sample
for tree in self:
tree.learn_one(x, y)

self.iteration += 1

return self

def predict_one(self, x):
# Checking that the model has been trained once at least
if not self._is_initialized:
return None

prediction = 0
for tree in self:
tree.use_aggregation = self.use_aggregation
prediction += tree.predict_one(x)
prediction = prediction / self.n_estimators

return prediction
5 changes: 3 additions & 2 deletions river/tree/mondrian/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,13 @@
implementations for the Mondrian trees.

Note that this module is not exposed in the tree module, and is instead used by the
AMFClassifier class in the ensemble module.
AMFClassifier and AMFRegressor classes in the ensemble module.

"""
from __future__ import annotations

from .mondrian_tree import MondrianTree
from .mondrian_tree_classifier import MondrianTreeClassifier
from .mondrian_tree_regressor import MondrianTreeRegressor

__all__ = ["MondrianTree", "MondrianTreeClassifier"]
__all__ = ["MondrianTree", "MondrianTreeClassifier", "MondrianTreeRegressor"]
9 changes: 3 additions & 6 deletions river/tree/mondrian/mondrian_tree.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,32 +16,29 @@ class MondrianTree(abc.ABC):
step
Step parameter of the tree.
loss
Loss to minimize for each node of the tree
Pick between: "log", ...
Loss to minimize for each node of the tree. At the moment it is a placeholder.
In the future, different optimization metrics might become available.
use_aggregation
Whether or not the tree should it use aggregation.
split_pure
Whether or not the tree should split pure leaves when training.
iteration
Number of iterations to run when training.
seed
Random seed for reproducibility.

"""

def __init__(
self,
step: float = 0.1,
loss: str = "log",
use_aggregation: bool = True,
split_pure: bool = False,
iteration: int = 0,
seed: int | None = None,
):
# Properties common to all the Mondrian Trees
self.step = step
self.loss = loss
self.use_aggregation = use_aggregation
self.split_pure = split_pure
self.iteration = iteration

# Controls the randomness in the tree
Expand Down
23 changes: 15 additions & 8 deletions river/tree/mondrian/mondrian_tree_classifier.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
from __future__ import annotations

import math
import sys

from river import base, utils
from river.tree.mondrian.mondrian_tree import MondrianTree
Expand Down Expand Up @@ -54,7 +53,7 @@ class MondrianTreeClassifier(MondrianTree, base.Classifier):
>>> metric = metrics.Accuracy()

>>> evaluate.progressive_val_score(dataset, model, metric)
Accuracy: 57.52%
Accuracy: 58.52%

References
----------
Expand All @@ -76,11 +75,12 @@ def __init__(
step=step,
loss="log",
use_aggregation=use_aggregation,
split_pure=split_pure,
iteration=iteration,
seed=seed,
)

self.dirichlet = dirichlet
self.split_pure = split_pure

# Training attributes
# The previously observed classes set
Expand All @@ -107,6 +107,7 @@ def _score(self, node: MondrianNodeClassifier) -> float:
----------
node
Node to evaluate the score.

"""

return node.score(self._y, self.dirichlet, len(self._classes))
Expand All @@ -118,6 +119,7 @@ def _predict(self, node: MondrianNodeClassifier) -> dict[base.typing.ClfTarget,
----------
node
Node to make predictions.

"""

return node.predict(self.dirichlet, self._classes, len(self._classes))
Expand All @@ -129,6 +131,7 @@ def _loss(self, node: MondrianNodeClassifier) -> float:
----------
node
Node to evaluate the loss.

"""

return node.loss(self._y, self.dirichlet, len(self._classes))
Expand All @@ -140,6 +143,7 @@ def _update_weight(self, node: MondrianNodeClassifier) -> float:
----------
node
Node to update the weight.

"""

return node.update_weight(
Expand All @@ -154,6 +158,7 @@ def _update_count(self, node: MondrianNodeClassifier):
----------
node
Target node.

"""

node.update_count(self._y)
Expand All @@ -169,6 +174,7 @@ def _update_downwards(
Target node.
do_weight_update
Whether we should update the weights or not.

"""

return node.update_downwards(
Expand All @@ -193,6 +199,7 @@ def _compute_split_time(
----------
node
Target node.

"""

# Don't split if the node is pure: all labels are equal to the one of y_t
Expand All @@ -202,11 +209,7 @@ def _compute_split_time(
# If x_t extends the current range of the node
if extensions_sum > 0:
# Sample an exponential with intensity = extensions_sum
# try catch to handle the Overflow situation in the exponential
try:
T = math.exp(1 / extensions_sum)
except OverflowError:
T = sys.float_info.max # we get the largest possible output instead
T = utils.random.exponential(1 / extensions_sum, rng=self._rng)

time = node.time
# Splitting time of the node (if splitting occurs)
Expand Down Expand Up @@ -246,6 +249,7 @@ def _split(
Feature of the node.
is_right_extension
Should we extend the tree in the right or left direction.

"""

new_depth = node.depth + 1
Expand Down Expand Up @@ -420,6 +424,7 @@ def _go_upwards(self, leaf: MondrianLeafClassifier):
----------
leaf
Leaf to start from when going upward.

"""

current_node = leaf
Expand Down Expand Up @@ -460,10 +465,12 @@ def predict_proba_one(self, x):
----------
x
Feature vector.

"""

# If the tree hasn't seen any sample, then it should return
# the default empty dict

if not self._is_initialized:
return {}

Expand Down
Loading
Loading