Skip to content

Commit

Permalink
feat!: merge to release v0.3 (#119)
Browse files Browse the repository at this point in the history
* feat!: disentangle threshold selection from the main model  (#89)
* threshold estimators as separate models
* remove threshold estimating from autoencoders
* simplify mlflow model saving
* mlflow now only supports saving per artifact
* registry load function now returns a dataclass instead of dict
* replace mlflow with mlflow-skinny to reduce unwanted dependencies

Signed-off-by: Avik Basu <avikbasu93@gmail.com>
Co-authored-by: s0nicboOm <i.kushalbatra@gmail.com>
Co-authored-by: Vigith Maurice <vigith@gmail.com>
  • Loading branch information
3 people authored Jan 6, 2023
1 parent 4301147 commit dca9a7a
Show file tree
Hide file tree
Showing 88 changed files with 5,056 additions and 3,422 deletions.
2 changes: 1 addition & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
branch = True
parallel = True
source = numalogic
omit = numalogic/tests/*
omit = tests/*
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[flake8]
ignore = E203, F821
exclude = .git,__pycache__,docs/source/conf.py,old,build,dist
exclude = .git,__pycache__,docs/source/conf.py,old,build,dist,venv
max-complexity = 10
max-line-length = 100
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ jobs:
- name: Install dependencies
run: |
poetry env use ${{ matrix.python-version }}
poetry install --all-extras
poetry install --all-extras --with dev,torch
- name: Test with pytest
run: make test
4 changes: 2 additions & 2 deletions .github/workflows/coverage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,11 @@ jobs:
- name: Install dependencies
run: |
poetry env use ${{ matrix.python-version }}
poetry install --all-extras
poetry install --all-extras --with dev,torch
- name: Run Coverage
run: |
poetry run pytest --cov-report=xml --cov=numalogic --cov-config .coveragerc numalogic/tests/ -sq
poetry run pytest --cov-report=xml --cov=numalogic --cov-config .coveragerc tests/ -sq
- name: Upload Coverage
uses: codecov/codecov-action@v3
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ jobs:
- name: Install dependencies
run: |
poetry env use ${{ matrix.python-version }}
poetry install
poetry install --with dev
- name: Black format check
run: poetry run black --check .
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ jobs:
- name: Install dependencies
run: |
poetry env use ${{ matrix.python-version }}
poetry install --all-extras
poetry install
- name: Build dist
run: poetry build
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -165,3 +165,5 @@ cython_debug/

# Mac related
*.DS_Store

.python-version
5 changes: 2 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,7 @@ clean:
@find . -type f -name "*.py[co]" -exec rm -rf {} +

format: clean
poetry run black numalogic/*
poetry run black examples/*
poetry run black numalogic/ examples/ tests/

lint: format
poetry run flake8 .
Expand All @@ -28,7 +27,7 @@ setup:

# test your application (tests in the tests/ directory)
test:
poetry run pytest numalogic/tests/
poetry run pytest tests/

publish:
@rm -rf dist
Expand Down
17 changes: 16 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,21 @@ the result further or drop it after a trigger request.

## Installation

Numalogic requires Python 3.8 or higher.

### Prerequisites
Numalogic needs [PyTorch]("https://pytorch.org/") and
[PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/stable/) to work.
But since these packages are platform dependendent,
they are not included in the numalogic package itself. Kindly install them first.

Numalogic supports the following pytorch versions:
- 1.11.x
- 1.12.x
- 1.13.x

Other versions do work, it is just that they are not tested.

numalogic can be installed using pip.
```shell
pip install numalogic
Expand All @@ -57,7 +72,7 @@ pip install numalogic[mlflow]
```
3. To install dependencies:
```
poetry install
poetry install --with dev,torch
```
If extra dependencies are needed:
```
Expand Down
73 changes: 43 additions & 30 deletions docs/autoencoders.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,47 +2,60 @@

An Autoencoder is a type of Artificial Neural Network, used to learn efficient data representations (encoding) of unlabeled data.

It mainly consist of 2 components: an encoder and a decoder. The encoder compresses the input into a lower dimensional code, the decoder then reconstructs the input only using this code.
It mainly consists of 2 components: an encoder and a decoder. The encoder compresses the input into a lower dimensional code, the decoder then reconstructs the input only using this code.

### Autoencoder Pipelines
## Datamodules
Pytorch-lightning datamodules abstracts and separates the data functionality from the model and training itself.
Numalogic provides `TimeseriesDataModule` to help set up and load dataloaders.

Numalogic provides two types of pipelines for Autoencoders. These pipelines serve as a wrapper around the base network models, making it easier to train, predict and generate scores. Also, this module follows the sklearn API.
```python
import numpy as np
from numalogic.tools.data import TimeseriesDataModule

train_data = np.random.randn(100, 3)
datamodule = TimeseriesDataModule(12, train_data, batch_size=128)
```

#### AutoencoderPipeline
## Autoencoder Trainer

Here we are using `VanillAE`, a Vanilla Autoencoder model.
Numalogic provides a subclass of Pytorch-Lightning Trainer module specifically for Autoencoders.
This trainer provides a mechanism to train, validate and infer on data, with all the parameters supported by Lightning Trainer.

Here we are using `VanillaAE`, a Vanilla Autoencoder model.

```python
from numalogic.models.autoencoder.variants import Conv1dAE
from numalogic.models.autoencoder import SparseAEPipeline
from numalogic.models.autoencoder.variants import VanillaAE
from numalogic.models.autoencoder import AutoencoderTrainer

model = AutoencoderPipeline(
model=VanillaAE(signal_len=12, n_features=3), seq_len=seq_len
)
model.fit(X_train)
model = VanillaAE(seq_len=12, n_features=3)
trainer = AutoencoderTrainer(max_epochs=50, enable_progress_bar=True)
trainer.fit(model, datamodule=datamodule)
```

#### SparseAEPipeline
## Autoencoder Variants

A Sparse Autoencoder is a type of autoencoder that employs sparsity to achieve an information bottleneck. Specifically the loss function is constructed so that activations are penalized within a layer.
Numalogic supports 2 variants of Autoencoders currently.
More details can be found [here](https://www.deeplearningbook.org/contents/autoencoders.html).

So, by adding a sparsity regularization, we will be able to stop the neural network from copying the input and reduce overfitting.
### 1. Undercomplete autoencoders

```python
from numalogic.models.autoencoder.variants import Conv1dAE
from numalogic.models.autoencoder import SparseAEPipeline
This is the simplest version of autoencoders where it is made sure that the
latent dimension is smaller than the encoding and decoding dimesions.

model = SparseAEPipeline(
model=VanillaAE(signal_len=12, n_features=3), seq_len=36, num_epochs=30
)
model.fit(X_train)
```
Examples would be `VanillaAE`, `Conv1dAE`, `LSTMAE` and `TransformerAE`

### 2. Sparse autoencoders
A Sparse Autoencoder is a type of autoencoder that employs sparsity to achieve an information bottleneck.
Specifically the loss function is constructed so that activations are penalized within a layer.
So, by adding a sparsity regularization, we will be able to stop the neural network from copying the input and reduce overfitting.

Examples would be `SparseVanillaAE`, `SparseConv1dAE`, `SparseLSTMAE` and `SparseTransformerAE`

### Autoencoder Variants
## Network architectures

Numalogic supports the following variants of Autoencoders
Numalogic currently supports the following architectures.

#### VanillaAE
#### Fully Connected

Vanilla Autoencoder model comprising only fully connected layers.

Expand All @@ -52,17 +65,17 @@ from numalogic.models.autoencoder.variants import VanillaAE
model = VanillaAE(seq_len=12, n_features=2)
```

#### Conv1dAE
#### 1d Convolutional

Conv1dAE is a one dimensional Convolutional Autoencoder with multichannel support.

```python
from numalogic.models.autoencoder.variants import Conv1dAE
from numalogic.models.autoencoder.variants import SparseConv1dAE

model=Conv1dAE(in_channels=3, enc_channels=8)
model = SparseConv1dAE(beta=1e-2, seq_len=12, in_channels=3, enc_channels=8)
```

#### LSTMAE
#### LSTM

An LSTM (Long Short-Term Memory) Autoencoder is an implementation of an autoencoder for sequence data using an Encoder-Decoder LSTM architecture.

Expand All @@ -73,7 +86,7 @@ model = LSTMAE(seq_len=12, no_features=2, embedding_dim=15)

```

#### TransformerAE
#### Transformer

The transformer-based Autoencoder model was inspired from [Attention is all you need](https://arxiv.org/abs/1706.03762) paper.

Expand Down
20 changes: 10 additions & 10 deletions docs/ml-flow.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,34 +19,34 @@ Once the mlflow server has been started, you can navigate to http://127.0.0.1:50

### Model saving

Numalogic provides `MLflowRegistrar`, to save and load models to/from MLflow.
Numalogic provides `MLflowRegistry`, to save and load models to/from MLflow.

Here, `tracking_uri` is the uri where mlflow server is running. The `static_keys` and `dynamic_keys` are used to form a unique key for the model.

The `primary_artifact` would be the main model, and `secondary_artifacts` can be used to save any pre-processing models like scalers.
The `primary_artifact` would be the main model, and `secondary_artifacts` can be used to save any pre-processing models like scalers.

```python
from numalogic.registry import MLflowRegistrar
from numalogic.registry import MLflowRegistry

# static and dynamic keys are used to look up a model
static_keys = ["synthetic", "3ts"]
dynamic_keys = ["minmaxscaler", "sparseconv1d"]

registry = MLflowRegistrar(tracking_uri="http://0.0.0.0:5000", artifact_type="pytorch")
registry = MLflowRegistry(tracking_uri="http://0.0.0.0:5000", artifact_type="pytorch")
registry.save(
skeys=static_keys,
dkeys=dynamic_keys,
primary_artifact=model,
secondary_artifacts={"preproc": scaler}
skeys=static_keys,
dkeys=dynamic_keys,
primary_artifact=model,
secondary_artifacts={"preproc": scaler}
)
```

### Model loading

Once, the models are save to MLflow, the `load` function of `MLflowRegistrar` can be used to load the model.
Once, the models are save to MLflow, the `load` function of `MLflowRegistry` can be used to load the model.

```python
registry = MLflowRegistrar(tracking_uri="http://0.0.0.0:8080")
registry = MLflowRegistry(tracking_uri="http://0.0.0.0:8080")
artifact_dict = registry.load(
skeys=static_keys, dkeys=dynamic_keys
)
Expand Down
18 changes: 16 additions & 2 deletions docs/post-processing.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,20 @@
Post-processing step is again an optional step, where we normalize the anomalies between 0-10. This is mostly to make the scores more understandable.

```python
from numalogic.scores import tanh_norm
test_anomaly_score_norm = tanh_norm(test_anomaly_score)
import numpy as np
from numalogic.postprocess import tanh_norm

raw_anomaly_score = np.random.randn(10, 2)
test_anomaly_score_norm = tanh_norm(raw_anomaly_score)
```

A scikit-learn compatible API is also available.
```python
import numpy as np
from numalogic.postprocess import TanhNorm

raw_score = np.random.randn(10, 2)

norm = TanhNorm(scale_factor=10, smooth_factor=10)
norm_score = norm.fit_transform(raw_score)
```
45 changes: 28 additions & 17 deletions docs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,22 +18,33 @@ In this example, the train data set has numbers ranging from 1-10. Whereas in th
import numpy as np
from numalogic.models.autoencoder import AutoencoderPipeline
from numalogic.models.autoencoder.variants import Conv1dAE
from numalogic.scores import tanh_norm
from numalogic.models.threshold._std import StdDevThreshold
from numalogic.postprocess import tanh_norm
from numalogic.preprocess.transformer import LogTransformer

X_train = np.array([1, 3, 5, 2, 5, 1, 4, 5, 1, 4, 5, 8, 9, 1, 2, 4, 5, 1, 3]).reshape(-1, 1)
X_test = np.array([-20, 3, 5, 40, 5, 10, 4, 5, 100]).reshape(-1,1)
X_test = np.array([-20, 3, 5, 40, 5, 10, 4, 5, 100]).reshape(-1, 1)

model = AutoencoderPipeline(
# preprocess step
clf = LogTransformer()
train_data = clf.fit_transform(X_train)
test_data = clf.transform(X_test)

# Define threshold estimator and call fit()
thresh_clf = StdDevThreshold(std_factor=1.2)
thresh_clf.fit(train_data)

ae_pl = AutoencoderPipeline(
model=Conv1dAE(in_channels=1, enc_channels=4), seq_len=8, num_epochs=30
)
# fit method trains the model on train data set
model.fit(X_train)
ae_pl.fit(X_train)

# predict method returns the reconstruction error
recon = model.predict(X_test)
# score method returns the reconstruction error
anomaly_score = ae_pl.score(X_test)

# score method returns the anomaly score computed on test data set
anomaly_score = model.score(X_test)
# recalibrate score based on threshold estimator
anomaly_score = thresh_clf.predict(anomaly_score)

# normalizing scores to range between 0-10
anomaly_score_norm = tanh_norm(anomaly_score)
Expand All @@ -43,15 +54,15 @@ print("Anomaly Scores:", anomaly_score_norm)
Below is the sample output, which has logs and anomaly scores printed. Notice the anomaly score for points -20, 40 and 100 in `X_test` is high.
```shell
...snip training logs...
Anomaly Scores: [[2.70173135]
[0.22298803]
[0.01045979]
[3.66973793]
[0.12931582]
[0.53661316]
[0.10056313]
[0.2634344 ]
[7.76317209]]
Anomaly Scores: [[6.4051428 ]
[5.56049277]
[6.17384938]
[9.3043446 ]
[0.22345986]
[0.48584632]
[3.18197182]
[6.29744181]
[9.99937961]]
```

Replace `X_train` and `X_test` with your own data, and see the anomaly scores generated.
Expand Down
Loading

0 comments on commit dca9a7a

Please sign in to comment.