Skip to content

Commit

Permalink
Update docs for typos and wording (#1544)
Browse files Browse the repository at this point in the history
  • Loading branch information
stereosky authored May 14, 2024
1 parent c5f0023 commit 5c9156e
Show file tree
Hide file tree
Showing 10 changed files with 23 additions and 23 deletions.
6 changes: 3 additions & 3 deletions docs/examples/batch-to-online.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@
"scorer = metrics.make_scorer(metrics.roc_auc_score)\n",
"scores = model_selection.cross_val_score(model, X, y, scoring=scorer, cv=cv)\n",
"\n",
"# Display the average score and it's standard deviation\n",
"# Display the average score and its standard deviation\n",
"print(f'ROC AUC: {scores.mean():.3f} (± {scores.std():.3f})')"
]
},
Expand All @@ -94,7 +94,7 @@
"source": [
"## A hands-on introduction to incremental learning\n",
"\n",
"Incremental learning is also often called *online learning* or *stream learning*, but if you [google online learning](https://www.google.com/search?q=online+learning) a lot of the results will point to educational websites. Hence, the terms \"incremental learning\" and \"stream learning\" (from which River derives it's name) are prefered. The point of incremental learning is to fit a model to a stream of data. In other words, the data isn't available in it's entirety, but rather the observations are provided one by one. As an example let's stream through the dataset used previously."
"Incremental learning is also often called *online learning* or *stream learning*, but if you [google online learning](https://www.google.com/search?q=online+learning) a lot of the results will point to educational websites. Hence, the terms \"incremental learning\" and \"stream learning\" (from which River derives its name) are preferred. The point of incremental learning is to fit a model to a stream of data. In other words, the data isn't available in its entirety, but rather the observations are provided one by one. As an example let's stream through the dataset used previously."
]
},
{
Expand Down Expand Up @@ -484,7 +484,7 @@
"# We compute the CV scores using the same CV scheme and the same scoring\n",
"scores = model_selection.cross_val_score(model, X, y, scoring=scorer, cv=cv)\n",
"\n",
"# Display the average score and it's standard deviation\n",
"# Display the average score and its standard deviation\n",
"print(f'ROC AUC: {scores.mean():.3f} (± {scores.std():.3f})')"
]
},
Expand Down
12 changes: 6 additions & 6 deletions docs/introduction/basic-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,27 +22,27 @@ The challenge for machine learning is to ensure models you train offline on proa

## Online processing

Online processing is the act of processing a data stream one element at a time. In the case of machine learning, that means training a model by teaching it one sample at a time. This is completely opposite to the traditional way of doing machine learning, which is to train a model on a whole batch data at a time.
Online processing is the act of processing a data stream one element at a time. In the case of machine learning, that means training a model by teaching it one sample at a time. This is completely opposite to the traditional way of doing machine learning, which is to train a model on whole batches of data at a time.

An online model is therefore a stateful, dynamic object. It keeps learning and doesn't have to revisit past data. It's a different way of doing things, and therefore has its own set of pros and cons.

## Tasks

Machine learning encompasses many different tasks: classification, regression, anomaly detection, time series forecasting, etc. The ideology behind River is to be a generic machine learning which allows to perform these tasks in a streaming manner. Indeed, many batch machine learning algorithms have online equivalents.
Machine learning encompasses many different tasks: classification, regression, anomaly detection, time series forecasting, etc. The ideology behind River is to be a generic machine learning approach which allows these tasks to be performed in a streaming manner. Indeed, many batch machine learning algorithms have online equivalents.

Note that River also supports some more basic tasks. For instance, you might just want to calculate a running average of a data stream. These are usually smaller parts of a whole stream processing pipeline.

## Dictionaries everywhere

River is a Python library. It is composed of a bunch of classes which implement various online processing algorithms. Most of these classes are machine learning models which can process a single sample, be it for learning or for inference.

We made the choice to use dictionaries as the basic building block. First of all, online processing is different to batch processing, in that vectorization doesn't bring any speedup. Therefore numeric processing libraries such as numpy and PyTorch actually bring too much overhead. Using native Python data structures is faster.
We made the choice to use dictionaries as the basic building block. First of all, online processing is different to batch processing, in that vectorization doesn't bring any speed-up. Therefore numeric processing libraries such as NumPy and PyTorch actually bring too much overhead. Using native Python data structures is faster.

Dictionaries are therefore a perfect fit. They're native to Python and have excellent support in the standard library. They allow naming each feature. They can hold any kind of data type. They allow transparent support of JSON payloads, allowing seemless integration with web apps.
Dictionaries are therefore a perfect fit. They're native to Python and have excellent support in the standard library. They allow the naming of each feature. They can hold any kind of data type. They allow transparent support of JSON payloads, allowing seamless integration with web apps.

## Datasets

In production, you're almost always going to face data streams which you have to react to. Such as users visiting your website. The advantage of online machine learning is that you can design models which make predictions as well as learn from this data stream as it flows.
In production, you're almost always going to face data streams which you have to react to, such as users visiting your website. The advantage of online machine learning is that you can design models that make predictions as well as learn from this data stream as it flows.

But of course, when you're developping a model, you don't usually have access to a real-time feed on which to evaluate your model. You usually have an offline dataset which you want to evaluate your model on. River provides some datasets which can be read in online manner, one sample at a time. It is however crucial to keep in mind that the goal is to reproduce a production scenario as closely as possible, in order to ensure your model will perform just as well in production.

Expand All @@ -58,4 +58,4 @@ This is what makes online machine learning powerful. By replaying datasets in th

The main reason why an offline model might not perform as expected in production is because of concept drift. But this is true for all machine learning models, be they offline or online.

The advantage of online models over offline models is that they can cope with drift. Indeed, because they can keep learning, they usually adapt to concept drift in a seemless manner. As opposed to batch models which have to be retrained from scratch.
The advantage of online models over offline models is that they can cope with drift. Indeed, because they can keep learning, they usually adapt to concept drift in a seamless manner. As opposed to batch models which have to be retrained from scratch.
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@
"\n",
"Concept drifts might happen in the electricity demand across the year, in the stock market, in buying preferences, and in the likelihood of a new movie's success, among others.\n",
"\n",
"Let us consider the movie example: two movies made at different epochs can have similar features such as famous actors/directors, storyline, production budget, marketing campaigns, etc., yet it is not certain that both will be similarly successful. What the target audience *considers* is worth watching (and their money) is constantly changing, and production companies must adapt accordingly to avoid \"box office flops\".\n",
"Let us consider the movie example: two movies made at different epochs can have similar features such as famous actors/directors, storyline, production budget, marketing campaigns, etc., yet it is not certain that both will be similarly successful. What the target audience *considers* is worth watching (and their money worth spending) is constantly changing, and production companies must adapt accordingly to avoid \"box office flops\".\n",
"\n",
"Prior to the pandemics, the usage of hand sanitizers and facial masks was not widespread. When the cases of COVID-19 started increasing, there was a lack of such products for the final consumer. Imagine a batch-learning model deciding how much of each product a supermarket should stock during those times. What a mess!\n",
"Prior to the pandemic, the usage of hand sanitizers and facial masks was not widespread. When the cases of COVID-19 started increasing, there was a lack of such products for the end consumer. Imagine a batch-learning model deciding how much of each product a supermarket should stock during those times. What a mess!\n",
"\n",
"## Impact of drift on learning\n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion docs/introduction/why-use-river.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ In the streaming setting, data can evolve. Adaptive methods are specifically des

## General purpose

River supports different machine learning tasks, including regression, classification, and unsupervised learning. It can also be used for adhoc tasks, such as computing online metrics, as well as concept drift detection.
River supports different machine learning tasks, including regression, classification, and unsupervised learning. It can also be used for ad hoc tasks, such as computing online metrics, as well as concept drift detection.

## User experience

Expand Down
2 changes: 1 addition & 1 deletion docs/releases/0.4.1.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

## ensemble

- Removed `ensemble.HedgeBinaryClassifier` because it's performance was subpar.
- Removed `ensemble.HedgeBinaryClassifier` because its performance was subpar.
- Removed `ensemble.GroupRegressor`, as this should be a special case of `ensemble.StackingRegressor`.

## feature_extraction
Expand Down
2 changes: 1 addition & 1 deletion river/compose/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ class Pipeline(base.Estimator):
"""A pipeline of estimators.
Pipelines allow you to chain different steps into a sequence. Typically, when doing supervised
learning, a pipeline contains one ore more transformation steps, whilst it's is a regressor or
learning, a pipeline contains one or more transformation steps, whilst it's a regressor or
a classifier. It is highly recommended to use pipelines with River. Indeed, in an online
learning setting, it is very practical to have a model defined as a single object. Take a look
at the [user guide](/recipes/pipelines) for further information and
Expand Down
4 changes: 2 additions & 2 deletions river/imblearn/hard_sampling.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ class HardSamplingRegressor(HardSampling, base.Regressor):
The hardness of an observation is evaluated with a loss function that compares the sample's
ground truth with the wrapped model's prediction. If the buffer is not full, then the sample
is added to the buffer. If the buffer is full and the new sample has a bigger loss than the
lowest loss in the buffer, then the sample takes it's place.
lowest loss in the buffer, then the sample takes its place.
Parameters
----------
Expand Down Expand Up @@ -159,7 +159,7 @@ class HardSamplingClassifier(HardSampling, base.Classifier):
The hardness of an observation is evaluated with a loss function that compares the sample's
ground truth with the wrapped model's prediction. If the buffer is not full, then the sample
is added to the buffer. If the buffer is full and the new sample has a bigger loss than the
lowest loss in the buffer, then the sample takes it's place.
lowest loss in the buffer, then the sample takes its place.
Parameters
----------
Expand Down
2 changes: 1 addition & 1 deletion river/neighbors/knn_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ class KNNClassifier(base.Classifier):
documentation of each available search engine for more details on its usage.
By default, use the `SWINN` search engine for approximate search queries.
weighted
Weight the contribution of each neighbor by it's inverse distance.
Weight the contribution of each neighbor by its inverse distance.
cleanup_every
This determines at which rate old classes are cleaned up. Classes that
have been seen in the past but that are not present in the current
Expand Down
8 changes: 4 additions & 4 deletions river/optim/losses.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ class Absolute(RegressionLoss):
$$L = |p_i - y_i|$$
It's gradient w.r.t. to $p_i$ is
Its gradient w.r.t. to $p_i$ is
$$\\frac{\\partial L}{\\partial p_i} = sgn(p_i - y_i)$$
Expand Down Expand Up @@ -203,7 +203,7 @@ class Hinge(BinaryLoss):
$$L = max(0, 1 - p_i * y_i)$$
It's gradient w.r.t. to $p_i$ is
Its gradient w.r.t. to $p_i$ is
$$
\\frac{\\partial L}{\\partial y_i} = \\left\{
Expand Down Expand Up @@ -404,7 +404,7 @@ class Squared(RegressionLoss):
$$L = (p_i - y_i) ^ 2$$
It's gradient w.r.t. to $p_i$ is
Its gradient w.r.t. to $p_i$ is
$$\\frac{\\partial L}{\\partial p_i} = 2 (p_i - y_i)$$
Expand Down Expand Up @@ -539,7 +539,7 @@ class Poisson(RegressionLoss):
$$L = exp(p_i) - y_i \\times p_i$$
It's gradient w.r.t. to $p_i$ is
Its gradient w.r.t. to $p_i$ is
$$\\frac{\\partial L}{\\partial p_i} = exp(p_i) - y_i$$
Expand Down
4 changes: 2 additions & 2 deletions river/stats/link.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ class Link(stats.base.Univariate):
>>> stat.update(1)
The output from `get` will still be 0. The reason is that `stats.Shift` has not enough
values, and therefore outputs it's default value, which is `None`. The `stats.Mean`
values, and therefore outputs its default value, which is `None`. The `stats.Mean`
instance is therefore not updated.
>>> stat.get()
Expand All @@ -57,7 +57,7 @@ class Link(stats.base.Univariate):
>>> stat.get()
2.0
Note that composing statistics returns a new statistic with it's own name.
Note that composing statistics returns a new statistic with its own name.
>>> stat.name
'mean_of_shift_1'
Expand Down

0 comments on commit 5c9156e

Please sign in to comment.