feat!: merge to release v0.3 (#119)

* feat!: disentangle threshold selection from the main model (#89) * threshold estimators as separate models * remove threshold estimating from autoencoders * simplify mlflow model saving * mlflow now only supports saving per artifact * registry load function now returns a dataclass instead of dict * replace mlflow with mlflow-skinny to reduce unwanted dependencies Signed-off-by: Avik Basu <avikbasu93@gmail.com> Co-authored-by: s0nicboOm <i.kushalbatra@gmail.com> Co-authored-by: Vigith Maurice <vigith@gmail.com>
numaproj · Jan 6, 2023 · dca9a7a · dca9a7a
1 parent 4301147
commit dca9a7a
Show file tree

Hide file tree

Showing 88 changed files with 5,056 additions and 3,422 deletions.
diff --git a/.coveragerc b/.coveragerc
@@ -2,4 +2,4 @@
 branch = True
 parallel = True
 source = numalogic
-omit = numalogic/tests/*
+omit = tests/*
diff --git a/.flake8 b/.flake8
@@ -1,5 +1,5 @@
 [flake8]
 ignore = E203, F821
-exclude = .git,__pycache__,docs/source/conf.py,old,build,dist
+exclude = .git,__pycache__,docs/source/conf.py,old,build,dist,venv
 max-complexity = 10
 max-line-length = 100
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -30,7 +30,7 @@ jobs:
     - name: Install dependencies
       run: |
         poetry env use ${{ matrix.python-version }}
-        poetry install --all-extras
+        poetry install --all-extras --with dev,torch
 
     - name: Test with pytest
       run: make test
diff --git a/.github/workflows/coverage.yml b/.github/workflows/coverage.yml
@@ -30,11 +30,11 @@ jobs:
     - name: Install dependencies
       run: |
         poetry env use ${{ matrix.python-version }}
-        poetry install --all-extras
+        poetry install --all-extras --with dev,torch
 
     - name: Run Coverage
       run: |
-        poetry run pytest --cov-report=xml --cov=numalogic --cov-config .coveragerc numalogic/tests/ -sq
+        poetry run pytest --cov-report=xml --cov=numalogic --cov-config .coveragerc tests/ -sq
 
     - name: Upload Coverage
       uses: codecov/codecov-action@v3

diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -30,7 +30,7 @@ jobs:
     - name: Install dependencies
       run: |
         poetry env use ${{ matrix.python-version }}
-        poetry install
+        poetry install --with dev
 
     - name: Black format check
       run: poetry run black --check .

diff --git a/.github/workflows/pypi.yml b/.github/workflows/pypi.yml
@@ -30,7 +30,7 @@ jobs:
       - name: Install dependencies
         run: |
           poetry env use ${{ matrix.python-version }}
-          poetry install --all-extras
+          poetry install
 
       - name: Build dist
         run: poetry build

diff --git a/.gitignore b/.gitignore
@@ -165,3 +165,5 @@ cython_debug/
 
 # Mac related
 *.DS_Store
+
+.python-version
diff --git a/Makefile b/Makefile
@@ -16,8 +16,7 @@ clean:
 	@find . -type f -name "*.py[co]" -exec rm -rf {} +
 
 format: clean
-	poetry run black numalogic/*
-	poetry run black examples/*
+	poetry run black numalogic/ examples/ tests/
 
 lint: format
 	poetry run flake8 .
@@ -28,7 +27,7 @@ setup:
 
 # test your application (tests in the tests/ directory)
 test:
-	poetry run pytest numalogic/tests/
+	poetry run pytest tests/
 
 publish:
 	@rm -rf dist

diff --git a/README.md b/README.md
@@ -35,6 +35,21 @@ the result further or drop it after a trigger request.
 
 ## Installation
 
+Numalogic requires Python 3.8 or higher.
+
+### Prerequisites
+Numalogic needs [PyTorch]("https://pytorch.org/") and 
+[PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/stable/) to work. 
+But since these packages are platform dependendent, 
+they are not included in the numalogic package itself. Kindly install them first.
+
+Numalogic supports the following pytorch versions:
+- 1.11.x
+- 1.12.x
+- 1.13.x
+
+Other versions do work, it is just that they are not tested.
+
 numalogic can be installed using pip.
 ```shell
 pip install numalogic
@@ -57,7 +72,7 @@ pip install numalogic[mlflow]
     ```
 3. To install dependencies:
     ```
-    poetry install
+    poetry install --with dev,torch
     ```
    If extra dependencies are needed:
     ```

diff --git a/docs/autoencoders.md b/docs/autoencoders.md
@@ -2,47 +2,60 @@
 
 An Autoencoder is a type of Artificial Neural Network, used to learn efficient data representations (encoding) of unlabeled data. 
 
-It mainly consist of 2 components: an encoder and a decoder. The encoder compresses the input into a lower dimensional code, the decoder then reconstructs the input only using this code.
+It mainly consists of 2 components: an encoder and a decoder. The encoder compresses the input into a lower dimensional code, the decoder then reconstructs the input only using this code.
 
-### Autoencoder Pipelines
+## Datamodules
+Pytorch-lightning datamodules abstracts and separates the data functionality from the model and training itself.
+Numalogic provides `TimeseriesDataModule` to help set up and load dataloaders.
 
-Numalogic provides two types of pipelines for Autoencoders. These pipelines serve as a wrapper around the base network models, making it easier to train, predict and generate scores. Also, this module follows the sklearn API.
+```python
+import numpy as np
+from numalogic.tools.data import TimeseriesDataModule
+
+train_data = np.random.randn(100, 3)
+datamodule = TimeseriesDataModule(12, train_data, batch_size=128)
+```
 
-#### AutoencoderPipeline
+## Autoencoder Trainer
 
-Here we are using `VanillAE`, a Vanilla Autoencoder model.
+Numalogic provides a subclass of Pytorch-Lightning Trainer module specifically for Autoencoders. 
+This trainer provides a mechanism to train, validate and infer on data, with all the parameters supported by Lightning Trainer.
+
+Here we are using `VanillaAE`, a Vanilla Autoencoder model.
 
 ```python 
-from numalogic.models.autoencoder.variants import Conv1dAE
-from numalogic.models.autoencoder import SparseAEPipeline
+from numalogic.models.autoencoder.variants import VanillaAE
+from numalogic.models.autoencoder import AutoencoderTrainer
 
-model = AutoencoderPipeline(
-    model=VanillaAE(signal_len=12, n_features=3), seq_len=seq_len
-)
-model.fit(X_train)
+model = VanillaAE(seq_len=12, n_features=3)
+trainer = AutoencoderTrainer(max_epochs=50, enable_progress_bar=True)
+trainer.fit(model, datamodule=datamodule)
 ```
 
-#### SparseAEPipeline
+## Autoencoder Variants
 
-A Sparse Autoencoder is a type of autoencoder that employs sparsity to achieve an information bottleneck. Specifically the loss function is constructed so that activations are penalized within a layer.
+Numalogic supports 2 variants of Autoencoders currently. 
+More details can be found [here](https://www.deeplearningbook.org/contents/autoencoders.html).
 
-So, by adding a sparsity regularization, we will be able to stop the neural network from copying the input and reduce overfitting.
+### 1. Undercomplete autoencoders
 
-```python 
-from numalogic.models.autoencoder.variants import Conv1dAE
-from numalogic.models.autoencoder import SparseAEPipeline
+This is the simplest version of autoencoders where it is made sure that the 
+latent dimension is smaller than the encoding and decoding dimesions.
 
-model = SparseAEPipeline(
-    model=VanillaAE(signal_len=12, n_features=3), seq_len=36, num_epochs=30
-)
-model.fit(X_train)
-```
+Examples would be `VanillaAE`, `Conv1dAE`, `LSTMAE` and `TransformerAE`
+
+### 2. Sparse autoencoders
+A Sparse Autoencoder is a type of autoencoder that employs sparsity to achieve an information bottleneck. 
+Specifically the loss function is constructed so that activations are penalized within a layer.
+So, by adding a sparsity regularization, we will be able to stop the neural network from copying the input and reduce overfitting.
+
+Examples would be `SparseVanillaAE`, `SparseConv1dAE`, `SparseLSTMAE` and `SparseTransformerAE`
 
-### Autoencoder Variants
+## Network architectures
 
-Numalogic supports the following variants of Autoencoders
+Numalogic currently supports the following architectures.
 
-#### VanillaAE
+#### Fully Connected
 
 Vanilla Autoencoder model comprising only fully connected layers.
 
@@ -52,17 +65,17 @@ from numalogic.models.autoencoder.variants import VanillaAE
 model = VanillaAE(seq_len=12, n_features=2)
 ```   
 
-#### Conv1dAE
+#### 1d Convolutional
 
 Conv1dAE is a one dimensional Convolutional Autoencoder with multichannel support.
 
 ```python
-from numalogic.models.autoencoder.variants import Conv1dAE
+from numalogic.models.autoencoder.variants import SparseConv1dAE
 
-model=Conv1dAE(in_channels=3, enc_channels=8)
+model = SparseConv1dAE(beta=1e-2, seq_len=12, in_channels=3, enc_channels=8)
 ```
 
-#### LSTMAE
+#### LSTM
 
 An LSTM (Long Short-Term Memory) Autoencoder is an implementation of an autoencoder for sequence data using an Encoder-Decoder LSTM architecture.
 
@@ -73,7 +86,7 @@ model = LSTMAE(seq_len=12, no_features=2, embedding_dim=15)
 
 ```
 
-#### TransformerAE
+#### Transformer
 
 The transformer-based Autoencoder model was inspired from [Attention is all you need](https://arxiv.org/abs/1706.03762) paper. 
 

diff --git a/docs/ml-flow.md b/docs/ml-flow.md
@@ -19,34 +19,34 @@ Once the mlflow server has been started, you can navigate to http://127.0.0.1:50
 
 ### Model saving
 
-Numalogic provides `MLflowRegistrar`, to save and load models to/from MLflow.
+Numalogic provides `MLflowRegistry`, to save and load models to/from MLflow.
 
 Here, `tracking_uri` is the uri where mlflow server is running. The `static_keys` and `dynamic_keys` are used to form a unique key for the model.
 
-The `primary_artifact` would be the main model, and `secondary_artifacts` can be used to save any pre-processing models like scalers. 
+The `primary_artifact` would be the main model, and `secondary_artifacts` can be used to save any pre-processing models like scalers.
 
 ```python
-from numalogic.registry import MLflowRegistrar
+from numalogic.registry import MLflowRegistry
 
 # static and dynamic keys are used to look up a model
 static_keys = ["synthetic", "3ts"]
 dynamic_keys = ["minmaxscaler", "sparseconv1d"]
 
-registry = MLflowRegistrar(tracking_uri="http://0.0.0.0:5000", artifact_type="pytorch")
+registry = MLflowRegistry(tracking_uri="http://0.0.0.0:5000", artifact_type="pytorch")
 registry.save(
-   skeys=static_keys, 
-   dkeys=dynamic_keys, 
-   primary_artifact=model, 
-   secondary_artifacts={"preproc": scaler}
+    skeys=static_keys,
+    dkeys=dynamic_keys,
+    primary_artifact=model,
+    secondary_artifacts={"preproc": scaler}
 )
 ```
 
 ### Model loading
 
-Once, the models are save to MLflow, the `load` function of `MLflowRegistrar` can be used to load the model.
+Once, the models are save to MLflow, the `load` function of `MLflowRegistry` can be used to load the model.
 
 ```python
-registry = MLflowRegistrar(tracking_uri="http://0.0.0.0:8080")
+registry = MLflowRegistry(tracking_uri="http://0.0.0.0:8080")
 artifact_dict = registry.load(
     skeys=static_keys, dkeys=dynamic_keys
 )

diff --git a/docs/post-processing.md b/docs/post-processing.md
@@ -3,6 +3,20 @@
 Post-processing step is again an optional step, where we normalize the anomalies between 0-10. This is mostly to make the scores more understandable.
 
 ```python
-from numalogic.scores import tanh_norm
-test_anomaly_score_norm = tanh_norm(test_anomaly_score)
+import numpy as np
+from numalogic.postprocess import tanh_norm
+
+raw_anomaly_score = np.random.randn(10, 2)
+test_anomaly_score_norm = tanh_norm(raw_anomaly_score)
+```
+
+A scikit-learn compatible API is also available.
+```python
+import numpy as np
+from numalogic.postprocess import TanhNorm
+
+raw_score = np.random.randn(10, 2)
+
+norm = TanhNorm(scale_factor=10, smooth_factor=10)
+norm_score = norm.fit_transform(raw_score)
 ```
diff --git a/docs/quick-start.md b/docs/quick-start.md
@@ -18,22 +18,33 @@ In this example, the train data set has numbers ranging from 1-10. Whereas in th
 import numpy as np
 from numalogic.models.autoencoder import AutoencoderPipeline
 from numalogic.models.autoencoder.variants import Conv1dAE
-from numalogic.scores import tanh_norm
+from numalogic.models.threshold._std import StdDevThreshold
+from numalogic.postprocess import tanh_norm
+from numalogic.preprocess.transformer import LogTransformer
 
 X_train = np.array([1, 3, 5, 2, 5, 1, 4, 5, 1, 4, 5, 8, 9, 1, 2, 4, 5, 1, 3]).reshape(-1, 1)
-X_test = np.array([-20, 3, 5, 40, 5, 10, 4, 5, 100]).reshape(-1,1)
+X_test = np.array([-20, 3, 5, 40, 5, 10, 4, 5, 100]).reshape(-1, 1)
 
-model = AutoencoderPipeline(
+# preprocess step
+clf = LogTransformer()
+train_data = clf.fit_transform(X_train)
+test_data = clf.transform(X_test)
+
+# Define threshold estimator and call fit()
+thresh_clf = StdDevThreshold(std_factor=1.2)
+thresh_clf.fit(train_data)
+
+ae_pl = AutoencoderPipeline(
     model=Conv1dAE(in_channels=1, enc_channels=4), seq_len=8, num_epochs=30
 )
 # fit method trains the model on train data set
-model.fit(X_train)
+ae_pl.fit(X_train)
 
-# predict method returns the reconstruction error
-recon = model.predict(X_test)
+# score method returns the reconstruction error
+anomaly_score = ae_pl.score(X_test)
 
-# score method returns the anomaly score computed on test data set
-anomaly_score = model.score(X_test)
+# recalibrate score based on threshold estimator
+anomaly_score = thresh_clf.predict(anomaly_score)
 
 # normalizing scores to range between 0-10
 anomaly_score_norm = tanh_norm(anomaly_score)
@@ -43,15 +54,15 @@ print("Anomaly Scores:", anomaly_score_norm)
 Below is the sample output, which has logs and anomaly scores printed. Notice the anomaly score for points -20, 40 and 100 in `X_test` is high.
 ```shell
 ...snip training logs...
-Anomaly Scores: [[2.70173135]
- [0.22298803]
- [0.01045979]
- [3.66973793]
- [0.12931582]
- [0.53661316]
- [0.10056313]
- [0.2634344 ]
- [7.76317209]]
+Anomaly Scores: [[6.4051428 ]
+ [5.56049277]
+ [6.17384938]
+ [9.3043446 ]
+ [0.22345986]
+ [0.48584632]
+ [3.18197182]
+ [6.29744181]
+ [9.99937961]]
 ```
 
 Replace `X_train` and `X_test` with your own data, and see the anomaly scores generated.