Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add preprocessing + pipeline modules #234

Merged
merged 4 commits into from
Jan 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 108 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@
<a href="#examples">Examples</a> ·
<a href="#acknowledgments">Acknowledgments</a> ·
<a href="#references">References</a> ·
<a href="#contributors">Contributors</a>
<a href="#contributors">Contributors</a> ·
<a href="#licensing">Licensing</a>
</sup>
</p>

Expand All @@ -47,8 +48,7 @@ Some examples of how Sequentia can be used on sequence data include:

- determining a spoken word based on its audio signal or alternative representations such as MFCCs,
- predicting motion intent for gesture control from sEMG signals,
- classifying hand-written characters according to their pen-tip trajectories,
- predicting the gene family that a DNA sequence belongs to.
- classifying hand-written characters according to their pen-tip trajectories.

## Build Status

Expand All @@ -58,27 +58,48 @@ Some examples of how Sequentia can be used on sequence data include:

## Features

### Models

The following models provided by Sequentia all support variable length sequences.

- [x] [Dynamic Time Warping + k-Nearest Neighbors](https://sequentia.readthedocs.io/en/latest/sections/classifiers/knn.html) (via [`dtaidistance`](https://github.com/wannesm/dtaidistance))
- [x] Classification
- [x] Regression
- [x] Multivariate real-valued observations
- [x] Sakoe–Chiba band global warping constraint
- [x] Dependent and independent feature warping (DTWD/DTWI)
- [x] Custom distance-weighted predictions
- [x] Multi-processed predictions
- [x] [Hidden Markov Models](https://sequentia.readthedocs.io/en/latest/sections/classifiers/gmmhmm.html) (via [`hmmlearn`](https://github.com/hmmlearn/hmmlearn))<br/><em>Parameter estimation with the Baum-Welch algorithm and prediction with the forward algorithm</em> [[1]](#references)
- [x] Classification
- [x] Multivariate real-valued observations (Gaussian mixture model emissions)
- [x] Univariate categorical observations (discrete emissions)
- [x] Linear, left-right and ergodic topologies
- [x] Multi-processed predictions

<p align="center">
<img src="https://raw.githubusercontent.com/eonu/sequentia/master/docs/_static/images/classifier.png" width="80%"/><br/>
HMM Sequence Classifier
</p>
#### [Dynamic Time Warping + k-Nearest Neighbors](https://sequentia.readthedocs.io/en/latest/sections/models/knn/index.html) (via [`dtaidistance`](https://github.com/wannesm/dtaidistance))

- [x] Classification
- [x] Regression
- [x] Multivariate real-valued observations
- [x] Sakoe–Chiba band global warping constraint
- [x] Dependent and independent feature warping (DTWD/DTWI)
- [x] Custom distance-weighted predictions
- [x] Multi-processed predictions

#### [Hidden Markov Models](https://sequentia.readthedocs.io/en/latest/sections/models/hmm/index.html) (via [`hmmlearn`](https://github.com/hmmlearn/hmmlearn))

Parameter estimation with the Baum-Welch algorithm and prediction with the forward algorithm [[1]](#references)

- [x] Classification
- [x] Multivariate real-valued observations (Gaussian mixture model emissions)
- [x] Univariate categorical observations (discrete emissions)
- [x] Linear, left-right and ergodic topologies
- [x] Multi-processed predictions

### Scikit-Learn compatibility

Sequentia aims to follow the Scikit-Learn interface for estimators and transformations,
as well as to be largely compatible with three core Scikit-Learn modules to improve the ease of model development:
[`preprocessing`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing), [`model_selection`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) and [`pipeline`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline).

While there are many other important modules, full compatibility with Scikit-Learn is challenging and many of its features are in fact inapplicable to sequential data, therefore we only focus on the relevant core modules.

Despite some deviation from the Scikit-Learn interface in order to accommodate sequences, the following features are currently compatible with Sequentia.

- [x] [`preprocessing`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- [x] [`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer) — via an adapted class definition
- [x] Function-based transformations (stateless)
- [x] Class-based transformations (stateful)
- [ ] [`pipeline`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline)
- [x] [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) — via an adapted class definition
- [ ] [`FeatureUnion`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html#sklearn.pipeline.FeatureUnion)
- [ ] [`model_selection`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)

## Installation

Expand Down Expand Up @@ -118,40 +139,73 @@ Documentation for the package is available on [Read The Docs](https://sequentia.

## Examples

This example demonstrates multivariate sequences classified into classes `0`/`1` using the `KNNClassifier`.
This example demonstrates multivariate sequence classification with two features and two classes, using the `KNNClassifier`.

This example also shows a typical preprocessing workflow, as well as compatibility with Scikit-Learn.

```python
import numpy as np
from sequentia.models import KNNClassifier

# Generate training sequences and labels
X = [
np.array([[1., 0., 5., 3., 7., 2., 2., 4., 9., 8., 7.],
[3., 8., 4., 0., 7., 1., 1., 3., 4., 2., 9.]]).T,
np.array([[2., 1., 4., 6., 5., 8.],
[5., 3., 9., 0., 8., 2.]]).T,
np.array([[5., 8., 0., 3., 1., 0., 2., 7., 9.],
[0., 2., 7., 1., 2., 9., 5., 8., 1.]]).T
]
y = [0, 1, 1]

# Sequentia expects a concatenated array of sequences (and their corresponding lengths)
X, lengths = np.vstack(X), [len(x) for x in X]

# Create and fit the classifier
clf = KNNClassifier(k=1).fit(X, y, lengths)

# Make a prediction for a new observation sequence
x_new = np.array([[0., 3., 2., 7., 9., 1., 1.],
[2., 5., 7., 4., 2., 0., 8.]]).T
y_new = clf.predict(x_new)
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA

from sequentia.models import KNNClassifier
from sequentia.pipeline import Pipeline
from sequentia.preprocessing import IndependentFunctionTransformer, mean_filter

# Create input data
# - Sequentia expects sequences to be concatenated into a single array
# - Sequence lengths are provided separately and used to decode the sequences when needed
# - This avoids the need for complex structures such as lists of arrays with different lengths

# Sequences
X = np.array([
# Sequence 1 - Length 3
[1.2 , 7.91],
[1.34, 6.6 ],
[0.92, 8.08],
# Sequence 2 - Length 5
[2.11, 6.97],
[1.83, 7.06],
[1.54, 5.98],
[0.86, 6.37],
[1.21, 5.8 ],
# Sequence 3 - Length 2
[1.7 , 6.22],
[2.01, 5.49]
])

# Sequence lengths
lengths = np.array([3, 5, 2])

# Sequence classes
y = np.array([0, 1, 1])

# Create a transformation pipeline that feeds into a KNNClassifier
# 1. Individually denoise each sequence by applying a mean filter for each feature
# 2. Individually standardize each sequence by subtracting the mean and dividing the s.d. for each feature
# 3. Reduce the dimensionality of the data to a single feature by using PCA
# 4. Pass the resulting transformed data into a KNNClassifier
pipeline = Pipeline([
('denoise', IndependentFunctionTransformer(mean_filter)),
('scale', IndependentFunctionTransformer(scale)),
('pca', PCA(n_components=1)),
('knn', KNNClassifier(k=1))
])

# Fit the pipeline to the data - lengths must be provided
pipeline.fit(X, y, lengths)

# Predict classes for the sequences and calculate accuracy - lengths must be provided
y_pred = pipeline.predict(X, lengths)
acc = pipeline.score(X, y, lengths)
```

## Acknowledgments

In earlier versions of the package, an approximate DTW implementation [`fastdtw`](https://github.com/slaypni/fastdtw) was used in hopes of speeding up k-NN predictions, as the authors of the original FastDTW paper [[2]](#references) claim that approximated DTW alignments can be computed in linear memory and time, compared to the O(N<sup>2</sup>) runtime complexity of the usual exact DTW implementation.

I was contacted by [Prof. Eamonn Keogh](https://www.cs.ucr.edu/~eamonn/) whose work [[3]](#references) makes the surprising revelation that FastDTW is generally slower than the exact DTW algorithm that it approximates. Upon switching from the `fastdtw` package to [`dtaidistance`](https://github.com/wannesm/dtaidistance) (a very solid implementation of exact DTW with fast pure C compiled functions), DTW k-NN prediction times were indeed reduced drastically.
I was contacted by [Prof. Eamonn Keogh](https://www.cs.ucr.edu/~eamonn/) whose work makes the surprising revelation that FastDTW is generally slower than the exact DTW algorithm that it approximates [[3]](#references). Upon switching from the `fastdtw` package to [`dtaidistance`](https://github.com/wannesm/dtaidistance) (a very solid implementation of exact DTW with fast pure C compiled functions), DTW k-NN prediction times were indeed reduced drastically.

I would like to thank Prof. Eamonn Keogh for directly reaching out to me regarding this finding.

Expand Down Expand Up @@ -216,9 +270,16 @@ All contributions to this repository are greatly appreciated. Contribution guide
</thead>
</table>

## Licensing

Sequentia is released under the [MIT](https://opensource.org/licenses/MIT) license.

Certain parts of the source code are heavily adapted from [Scikit-Learn](scikit-learn.org/).
Such files contain copy of [their license](https://github.com/scikit-learn/scikit-learn/blob/main/COPYING).

---

<p align="center">
<b>Sequentia</b> &copy; 2019-2023, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> License.<br/>
<b>Sequentia</b> &copy; 2019-2023, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
<em>Authored and maintained by Edwin Onuonga.</em>
</p>
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Features
:titlesonly:

sections/models/index
sections/preprocessing/index
sections/datasets/index

Documentation Search and Index
Expand Down
1 change: 1 addition & 0 deletions docs/sections/models/hmm/classifier.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Methods
~sequentia.models.hmm.classifier.HMMClassifier.add_model
~sequentia.models.hmm.classifier.HMMClassifier.add_models
~sequentia.models.hmm.classifier.HMMClassifier.fit
~sequentia.models.hmm.classifier.HMMClassifier.fit_predict
~sequentia.models.hmm.classifier.HMMClassifier.predict
~sequentia.models.hmm.classifier.HMMClassifier.predict_proba
~sequentia.models.hmm.classifier.HMMClassifier.predict_scores
Expand Down
5 changes: 3 additions & 2 deletions docs/sections/models/knn/classifier.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ The KNN Classifier is a classifier that uses the :math:`k`-NN algorithm with DTW
To classify a sequence :math:`O'`, the :class:`.KNNClassifier` works by:

1. | Calculating the **DTW distance** between :math:`O'` and every training sequence.

2. | Forming a **k-neighborhood** :math:`\mathcal{K}'=\left\{O^{(1)},\ldots,O^{(k)}\right\}` of the :math:`k` nearest training sequences to :math:`O'`.

3. | Calculating a **distance weighting** for each sequence in :math:`\mathcal{K}'`.
3. | Calculating a **distance weighting** for each sequence in :math:`\mathcal{K}'`.
| A uniform weighting of 1 is used by default, meaning that all sequences in :math:`\mathcal{K}'` have equal influence on the predicted class. However, custom functions such as :math:`e^{-x}` (where :math:`x` is the DTW distance) can be specified to increase classification weight on training sequences that are more similar to :math:`O'`.

4. | Calculating a **score** for each of the unique classes corresponding to the sequences in :math:`\mathcal{K}'`.
Expand Down Expand Up @@ -37,6 +37,7 @@ Methods
~sequentia.models.knn.classifier.KNNClassifier.compute_distance_matrix
~sequentia.models.knn.classifier.KNNClassifier.dtw
~sequentia.models.knn.classifier.KNNClassifier.fit
~sequentia.models.knn.classifier.KNNClassifier.fit_predict
~sequentia.models.knn.classifier.KNNClassifier.load
~sequentia.models.knn.classifier.KNNClassifier.plot_dtw_histogram
~sequentia.models.knn.classifier.KNNClassifier.plot_warping_path_1d
Expand Down
7 changes: 4 additions & 3 deletions docs/sections/models/knn/regressor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@ The KNN Regressor is a regressor that uses the :math:`k`-NN algorithm with DTW a
To predict an output :math:`y'\in\mathbb{R}` for a sequence :math:`O'`, the :class:`.KNNRegressor` works by:

1. | Calculating the **DTW distance** between :math:`O'` and every training sequence.

2. | Forming a **k-neighborhood** :math:`\mathcal{K}'=\left\{O^{(1)},\ldots,O^{(k)}\right\}` of the :math:`k` nearest training sequences to :math:`O'`.

3. | Calculating a **distance weighting** :math:`w^{(1)},\ldots,w^{(k)}` for each sequence in :math:`\mathcal{K}'`.
3. | Calculating a **distance weighting** :math:`w^{(1)},\ldots,w^{(k)}` for each sequence in :math:`\mathcal{K}'`.
| A uniform weighting of 1 is used by default, meaning that all sequences in :math:`\mathcal{K}'` have equal influence on the predicted output :math:`y'`. However, custom functions such as :math:`e^{-x}` (where :math:`x` is the DTW distance) can be specified to increase weight on training sequences that are more similar to :math:`O'`.

4. | Calculating :math:`y'` as the **distance weighted mean of the outputs** :math:`y^{(1)},\ldots,y^{(k)}` of sequences in :math:`\mathcal{K}'`.

.. math::

y' = \frac{\sum_{k=1}^Kw^{(k)}y^{(k)}}{\sum_{k=1}^Kw^{(k)}}
Expand Down Expand Up @@ -41,6 +41,7 @@ Methods
~sequentia.models.knn.regressor.KNNRegressor.compute_distance_matrix
~sequentia.models.knn.regressor.KNNRegressor.dtw
~sequentia.models.knn.regressor.KNNRegressor.fit
~sequentia.models.knn.regressor.KNNRegressor.fit_predict
~sequentia.models.knn.regressor.KNNRegressor.load
~sequentia.models.knn.regressor.KNNRegressor.plot_dtw_histogram
~sequentia.models.knn.regressor.KNNRegressor.plot_warping_path_1d
Expand Down
20 changes: 20 additions & 0 deletions docs/sections/preprocessing/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Preprocessing
=============

.. toctree::
:titlesonly:

pipeline
transforms/index

----

Sequentia provides an adapted version of the :mod:`sklearn.preprocessing` interface,
modified for sequential data support but also continuing to support most of the Scikit-Learn transformations out-of-the-box.

Transformations can be applied to all of the input sequences collectively — treated as a single array,
or on an individual basis by using the :class:`.IndependentFunctionTransformer`.

Transformation steps can be combined together with an estimator in a :class:`.Pipeline` which follows the Scikit-Learn interface.

Additional transformations specific to sequences are also provided, such as :ref:`filters <filters>` for signal data.
41 changes: 41 additions & 0 deletions docs/sections/preprocessing/pipeline.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
Pipeline
========

Before fitting and using a model, it is common to apply a sequence of preprocessing steps to data.

Pipelines can be used to wrap preprocessing transformations as well as a model into a single estimator,
making it more convenient to reapply the transformations and make predictions on new data.

The :class:`.Pipeline` class implements this feature and is based on :class:`sklearn.pipeline.Pipeline`.

API reference
-------------

Class
^^^^^

.. autosummary::

~sequentia.pipeline.Pipeline

Methods
^^^^^^^

.. autosummary::

~sequentia.pipeline.Pipeline.__init__
~sequentia.pipeline.Pipeline.fit
~sequentia.pipeline.Pipeline.fit_predict
~sequentia.pipeline.Pipeline.fit_transform
~sequentia.pipeline.Pipeline.inverse_transform
~sequentia.pipeline.Pipeline.predict
~sequentia.pipeline.Pipeline.predict_proba
~sequentia.pipeline.Pipeline.score
~sequentia.pipeline.Pipeline.transform

|

.. autoclass:: sequentia.pipeline.Pipeline
:members:
:inherited-members:
:exclude-members: decision_function, get_feature_names_out, get_params, set_params, set_output, predict_log_proba, score_samples, feature_names_in_
27 changes: 27 additions & 0 deletions docs/sections/preprocessing/transforms/filters.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
.. _filters:

Filters
=======

Filters are a common preprocessing method for reducing noise in signal processing.

:func:`.mean_filter` and :func:`.median_filter` can be applied to individual sequences.

.. seealso::
Consider using :class:`.IndependentFunctionTransformer` to apply these filters to multiple sequences.

API reference
-------------

Methods
^^^^^^^

.. autosummary::

~sequentia.preprocessing.transforms.mean_filter
~sequentia.preprocessing.transforms.median_filter

|

.. autofunction:: sequentia.preprocessing.transforms.mean_filter
.. autofunction:: sequentia.preprocessing.transforms.median_filter
Loading