Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Score for change in the mean and/or covariance matrix #16

Merged
merged 39 commits into from
Oct 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
916d57d
Add covariance Frobenius norm diff score, and ipykernel as 'dev' depe…
johannvk Jul 2, 2024
4e2142f
Experimented with precision covariance score, but was not able to sol…
johannvk Jul 3, 2024
a7f4670
Testing pseudo-determinant based MV-Normal Likelihood ratio CPD.
johannvk Aug 28, 2024
a32745c
Added covariance difference operator norm score.
johannvk Aug 28, 2024
f01ef6e
Tested opnorm covariance score.
johannvk Oct 1, 2024
b624259
Cleaned up and removed unused functions from multivariate-mean-var sc…
johannvk Oct 1, 2024
c7e7178
Removed precision score using GraphicalLasso, and scikit-learn depend…
johannvk Oct 1, 2024
7415799
Added 'multivariate_meanvar_score' to score factory function.
johannvk Oct 1, 2024
021043c
Renamed multivariate_meanvar_score to mean_cov_score
johannvk Oct 1, 2024
81ceb4b
Merge branch 'main' into task/cov-matrix-cpd
johannvk Oct 1, 2024
7e943eb
Checked tests complete successfully, and plurialized 'start', 'end', …
johannvk Oct 1, 2024
d406461
feat(api)!: use "mean_var" rather than "meanvar" as score name
Tveten Oct 1, 2024
ed51441
Merge branch 'task/cov-matrix-cpd' of https://github.com/NorskRegnese…
Tveten Oct 1, 2024
0dc355a
feat(api)!: use "mean_var" rather than "meanvar" as score name
Tveten Oct 1, 2024
a08925d
Made tests pass when running without numba-JIT as well. Removed asser…
johannvk Oct 1, 2024
462c9d0
Pre-commit Formatting fixes
johannvk Oct 1, 2024
af86977
Fixx ruff formatting errors.
johannvk Oct 1, 2024
8763571
docs: improve score_factory documentation
Tveten Oct 1, 2024
11c31b0
docs: unify score argument documentation
Tveten Oct 1, 2024
4179df9
Merge branch 'task/cov-matrix-cpd' of https://github.com/NorskRegnese…
Tveten Oct 1, 2024
6385b7e
removed experimental covariance changepoint scores.
johannvk Oct 1, 2024
010b530
Add self to authors list
Tveten Oct 1, 2024
3937cdc
Add self to author of 'mean_cov_score.py' and maintainer.
johannvk Oct 1, 2024
32a4355
Rename author mtveten -> Tveten
Tveten Oct 1, 2024
a2b2bf4
Merge branch 'task/cov-matrix-cpd' of https://github.com/NorskRegnese…
Tveten Oct 1, 2024
bd10edd
Improved citation for the mean-cov-score methodology.
johannvk Oct 1, 2024
ea06337
Merge branch 'task/cov-matrix-cpd' of github.com:NorskRegnesentral/sk…
johannvk Oct 1, 2024
3bac97d
Reference in APA style.
johannvk Oct 1, 2024
22934fe
remove inputs checks
Tveten Oct 2, 2024
e10fbce
Add log_det_covariance utility and positive definite error handling
Tveten Oct 2, 2024
4318aca
fix: add missing njit decorator
Tveten Oct 2, 2024
3a82860
doc: fix mean_cov_score docs
Tveten Oct 2, 2024
1e48578
fix: ensure cov is 2-dimensional
Tveten Oct 2, 2024
d60cae6
test: add test for error when mean_cov score encounters negative defi…
Tveten Oct 2, 2024
a1a3a5f
Improve docs two places.
johannvk Oct 3, 2024
9943714
Bumped minor version.
johannvk Oct 3, 2024
4524e7f
Merge branch 'main' of https://github.com/NorskRegnesentral/skchange …
Tveten Oct 3, 2024
e90e4e8
Merge branch 'task/cov-matrix-cpd' of https://github.com/NorskRegnese…
Tveten Oct 3, 2024
36c51cb
Bump version in __init__
Tveten Oct 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion interactive/explore_moscore.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
df = generate_teeth_data(
n_segments=2, variance=16, segment_length=100, p=1, random_state=1
)
detector = Moscore(score="meanvar")
detector = Moscore(score="mean_var")
changepoints = detector.fit_predict(df)
px.scatter(df)
px.scatter(detector.scores)
Expand Down
2 changes: 1 addition & 1 deletion interactive/explore_moscore_anomaly.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
px.scatter(df)

detector = MoscoreAnomaly(
score="meanvar",
score="mean_var",
min_anomaly_length=10,
max_anomaly_length=100,
threshold_scale=3.0,
Expand Down
2 changes: 1 addition & 1 deletion interactive/explore_stat_threshold_anomaliser.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
n, anomalies=[(100, 119), (250, 299)], means=[10.0, 5.0], random_state=1
)

change_detector = Moscore("meanvar", bandwidth=20)
change_detector = Moscore("mean_var", bandwidth=20)
change_detector = Pelt("mean", min_segment_length=5)
detector = StatThresholdAnomaliser(
change_detector, stat=np.mean, stat_lower=-1.0, stat_upper=1.0
Expand Down
4 changes: 3 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
[project]
name = "skchange"
version = "0.7.0"
version = "0.8.0"
description = "Sktime-compatible change and anomaly detection"
authors = [
{name = "Martin Tveten", email = "tveten@nr.no"},
]
maintainers = [
{name = "Martin Tveten", email = "tveten@nr.no"},
{name = "Johannes Voll Kolstø", email = "jvkolsto@nr.no"},
]
readme = "README.md"
keywords = [
Expand Down Expand Up @@ -55,6 +56,7 @@ dev = [
"pre-commit",
"pytest",
"pytest-cov",
"ipykernel",
]

[build-system]
Expand Down
2 changes: 1 addition & 1 deletion skchange/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
"""skchange."""

__version__ = "0.7.0"
__version__ = "0.8.0"
2 changes: 1 addition & 1 deletion skchange/anomaly_detectors/capa.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""The collective and point anomalies (CAPA) algorithm."""

__author__ = ["mtveten"]
__author__ = ["Tveten"]
__all__ = ["Capa"]

from typing import Callable, Optional, Union
Expand Down
46 changes: 32 additions & 14 deletions skchange/anomaly_detectors/circular_binseg.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Circular binary segmentation algorithm for multiple changepoint detection."""

__author__ = ["mtveten"]
__author__ = ["Tveten"]
__all__ = ["CircularBinarySegmentation"]

from typing import Callable, Optional, Union
Expand Down Expand Up @@ -107,30 +107,48 @@ class CircularBinarySegmentation(CollectiveAnomalyDetector):

Parameters
----------
score: str, tuple[Callable, Callable], optional (default="mean")
score: {"mean", "mean_var", "mean_cov"}, tuple[Callable, Callable], default="mean"
Test statistic to use for changepoint detection.
* If "mean", the difference-in-mean statistic is used,
* If "var", the difference-in-variance statistic is used,
* If a tuple, it must contain two functions: The first function is the scoring
function, which takes in the output of the second function as its first
argument, and start, end and split indices as the second, third and fourth
arguments. The second function is the initializer, which precomputes quantities
that should be precomputed. See skchange/scores/score_factory.py for examples.
threshold_scale : float, optional (default=2.0)

* "mean": The CUSUM statistic for a change in mean (this is equivalent to a
likelihood ratio test for a change in the mean of Gaussian data). For
multivariate data, the sum of the CUSUM statistics for each dimension is used.
* "mean_var": The likelihood ratio test for a change in the mean and/or variance
of Gaussian data. For multivariate data, the sum of the likelihood ratio
statistics for each dimension is used.
* "mean_cov": The likelihood ratio test for a change in the mean and/or
covariance matrix of multivariate Gaussian data.
* If a tuple, it must contain two numba jitted functions:

1. The first function is the scoring function, which takes four arguments:

1. The output of the second function.
2. Start indices of the intervals to score for a change
3. End indices of the intervals to score for a change
4. Split indices of the intervals to score for a change.

For each start, split and end, the score should be calculated for the
data intervals [start:split] and [split+1:end], meaning that both the
starts and ends are inclusive, while split is included in the left
interval.
2. The second function is the initializer, which takes the data matrix as
input and returns precomputed quantities that may speed up the score
calculations. If not relevant, just return the data matrix.
threshold_scale : float, default=2.0
Scaling factor for the threshold. The threshold is set to
'threshold_scale * 2 * p * np.sqrt(np.log(n))', where 'n' is the sample size
and 'p' is the number of variables. If None, the threshold is tuned on the data
input to .fit().
level : float, optional (default=0.01)
level : float, default=0.01
If `threshold_scale` is None, the threshold is set to the (1-`level`)-quantile
of the changepoint scores of all the seeded intervals on the training data.
For this to be correct, the training data must contain no changepoints.
min_segment_length : int, optional (default=5)
min_segment_length : int, default=5
Minimum length between two changepoints. Must be greater than or equal to 1.
max_interval_length : int (default=100)
max_interval_length : int, default=100
The maximum length of an interval to estimate a changepoint in. Must be greater
than or equal to '2 * min_segment_length'.
growth_factor : float (default = 1.5)
growth_factor : float, default=1.5
The growth factor for the seeded intervals. Intervals grow in size according to
'interval_len=max(interval_len + 1, np.floor(growth_factor * interval_len))',
starting at 'interval_len'='min_interval_length'. It also governs the amount
Expand Down
2 changes: 1 addition & 1 deletion skchange/anomaly_detectors/moscore_anomaly.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""The Moving Score algorithm for multiple collective anomaly detection."""

__author__ = ["mtveten"]
__author__ = ["Tveten"]
__all__ = ["MoscoreAnomaly"]

from typing import Callable, Optional, Union
Expand Down
2 changes: 1 addition & 1 deletion skchange/anomaly_detectors/mvcapa.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""The subset multivariate collective and point anomalies (MVCAPA) algorithm."""

__author__ = ["mtveten"]
__author__ = ["Tveten"]
__all__ = ["Mvcapa"]

from typing import Callable, Optional, Union
Expand Down
6 changes: 3 additions & 3 deletions skchange/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ class name: BaseDetector
fitted state flag - check_is_fitted()
"""

__author__ = ["mtveten"]
__author__ = ["Tveten"]
__all__ = ["BaseDetector"]

import pandas as pd
Expand Down Expand Up @@ -66,8 +66,8 @@ class BaseDetector(BaseEstimator):

_tags = {
"object_type": "detector", # type of object
"authors": "mtveten", # author(s) of the object
"maintainers": "mtveten", # current maintainer(s) of the object
"authors": "Tveten", # author(s) of the object
"maintainers": "Tveten", # current maintainer(s) of the object
} # for unit test cases

def __init__(self):
Expand Down
46 changes: 32 additions & 14 deletions skchange/change_detectors/moscore.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""The Moving Score algorithm for multiple changepoint detection."""

__author__ = ["mtveten"]
__author__ = ["Tveten"]
__all__ = ["Moscore"]

from typing import Callable, Optional, Union
Expand Down Expand Up @@ -58,30 +58,48 @@ class Moscore(ChangeDetector):

Parameters
----------
score: str, tuple[Callable, Callable], optional (default="mean")
score: {"mean", "mean_var", "mean_cov"}, tuple[Callable, Callable], default="mean"
Test statistic to use for changepoint detection.
* If "mean", the difference-in-mean statistic is used,
* If "var", the difference-in-variance statistic is used,
* If a tuple, it must contain two functions: The first function is the scoring
function, which takes in the output of the second function as its first
argument, and start, end and split indices as the second, third and fourth
arguments. The second function is the initializer, which precomputes quantities
that should be precomputed. See skchange/scores/score_factory.py for examples.
bandwidth : int, optional (default=30)

* "mean": The CUSUM statistic for a change in mean (this is equivalent to a
likelihood ratio test for a change in the mean of Gaussian data). For
multivariate data, the sum of the CUSUM statistics for each dimension is used.
* "mean_var": The likelihood ratio test for a change in the mean and/or variance
of Gaussian data. For multivariate data, the sum of the likelihood ratio
statistics for each dimension is used.
* "mean_cov": The likelihood ratio test for a change in the mean and/or
covariance matrix of multivariate Gaussian data.
* If a tuple, it must contain two numba jitted functions:

1. The first function is the scoring function, which takes four arguments:

1. The output of the second function.
2. Start indices of the intervals to score for a change
3. End indices of the intervals to score for a change
4. Split indices of the intervals to score for a change.

For each start, split and end, the score should be calculated for the
data intervals [start:split] and [split+1:end], meaning that both the
starts and ends are inclusive, while split is included in the left
interval.
2. The second function is the initializer, which takes the data matrix as
input and returns precomputed quantities that may speed up the score
calculations. If not relevant, just return the data matrix.
bandwidth : int, default=30
The bandwidth is the number of samples on either side of a candidate
changepoint. The minimum bandwidth depends on the
test statistic. For "mean", the minimum bandwidth is 1.
threshold_scale : float, optional (default=2.0)
threshold_scale : float, default=2.0
Scaling factor for the threshold. The threshold is set to
'threshold_scale * default_threshold', where the default threshold depends on
the number of samples, the number of variables, `bandwidth` and `level`.
If None, the threshold is tuned on the data input to .fit().
level : float, optional (default=0.01)
level : float, default=0.01
If `threshold_scale` is None, the threshold is set to the (1-`level`)-quantile
of the changepoint score on the training data. For this to be correct, the
training data must contain no changepoints. If `threshold_scale` is a number,
`level` is used in the default threshold, _before_ scaling.
min_detection_interval : int, optional (default=1)
min_detection_interval : int, default=1
Minimum number of consecutive scores above the threshold to be considered a
changepoint. Must be between 1 and `bandwidth`/2.

Expand Down Expand Up @@ -305,6 +323,6 @@ def get_test_params(cls, parameter_set="default"):
"""
params = [
{"score": "mean", "bandwidth": 5},
{"score": "meanvar", "bandwidth": 5},
{"score": "mean_var", "bandwidth": 5},
]
return params
2 changes: 1 addition & 1 deletion skchange/change_detectors/pelt.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""The pruned exact linear time (PELT) algorithm."""

__author__ = ["mtveten"]
__author__ = ["Tveten"]
__all__ = ["Pelt"]


Expand Down
46 changes: 32 additions & 14 deletions skchange/change_detectors/seeded_binseg.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Seeded binary segmentation algorithm for multiple changepoint detection."""

__author__ = ["mtveten"]
__author__ = ["Tveten"]
__all__ = ["SeededBinarySegmentation"]

from typing import Callable, Optional, Union
Expand Down Expand Up @@ -108,30 +108,48 @@ class SeededBinarySegmentation(ChangeDetector):

Parameters
----------
score: str, tuple[Callable, Callable], optional (default="mean")
score: {"mean", "mean_var", "mean_cov"}, tuple[Callable, Callable], default="mean"
Test statistic to use for changepoint detection.
* If "mean", the difference-in-mean statistic is used,
* If "var", the difference-in-variance statistic is used,
* If a tuple, it must contain two functions: The first function is the scoring
function, which takes in the output of the second function as its first
argument, and start, end and split indices as the second, third and fourth
arguments. The second function is the initializer, which precomputes quantities
that should be precomputed. See skchange/scores/score_factory.py for examples.
threshold_scale : float, optional (default=2.0)

* "mean": The CUSUM statistic for a change in mean (this is equivalent to a
likelihood ratio test for a change in the mean of Gaussian data). For
multivariate data, the sum of the CUSUM statistics for each dimension is used.
* "mean_var": The likelihood ratio test for a change in the mean and/or variance
of Gaussian data. For multivariate data, the sum of the likelihood ratio
statistics for each dimension is used.
* "mean_cov": The likelihood ratio test for a change in the mean and/or
covariance matrix of multivariate Gaussian data.
* If a tuple, it must contain two numba jitted functions:

1. The first function is the scoring function, which takes four arguments:

1. The output of the second function.
2. Start indices of the intervals to score for a change
3. End indices of the intervals to score for a change
4. Split indices of the intervals to score for a change.

For each start, split and end, the score should be calculated for the
data intervals [start:split] and [split+1:end], meaning that both the
starts and ends are inclusive, while split is included in the left
interval.
2. The second function is the initializer, which takes the data matrix as
input and returns precomputed quantities that may speed up the score
calculations. If not relevant, just return the data matrix.
threshold_scale : float, default=2.0
Scaling factor for the threshold. The threshold is set to
'threshold_scale * 2 * p * np.sqrt(np.log(n))', where 'n' is the sample size
and 'p' is the number of variables. If None, the threshold is tuned on the data
input to .fit().
level : float, optional (default=0.01)
level : float, default=0.01
If `threshold_scale` is None, the threshold is set to the (1-`level`)-quantile
of the changepoint scores of all the seeded intervals on the training data.
For this to be correct, the training data must contain no changepoints.
min_segment_length : int, optional (default=5)
min_segment_length : int, default=5
Minimum length between two changepoints. Must be greater than or equal to 1.
max_interval_length : int (default=200)
max_interval_length : int, default=200
The maximum length of an interval to estimate a changepoint in. Must be greater
than or equal to '2 * min_segment_length'.
growth_factor : float (default = 1.5)
growth_factor : float, default=1.5
The growth factor for the seeded intervals. Intervals grow in size according to
'interval_len=max(interval_len + 1, np.floor(growth_factor * interval_len))',
starting at 'interval_len'='min_interval_length'. It also governs the amount
Expand Down
2 changes: 1 addition & 1 deletion skchange/change_detectors/tests/test_moscore.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def test_moscore_changepoint(score):
n_segments = 2
seg_len = 50
df = generate_teeth_data(
n_segments=n_segments, mean=10, segment_length=seg_len, p=1, random_state=2
n_segments=n_segments, mean=15, segment_length=seg_len, p=1, random_state=2
)
detector = Moscore(score)
changepoints = detector.fit_predict(df)
Expand Down
2 changes: 2 additions & 0 deletions skchange/costs/cost_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@

"""

__author__ = ["Tveten"]

from skchange.costs.mean_cost import init_mean_cost, mean_cost

VALID_COSTS = ["mean"]
Expand Down
2 changes: 2 additions & 0 deletions skchange/costs/mean_cost.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
"""Gaussian mean likelihood cost function for change point detection."""

__author__ = ["Tveten"]

import numpy as np
from numba import njit

Expand Down
2 changes: 2 additions & 0 deletions skchange/costs/mean_saving.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
"""Mean saving for CAPA type anomaly detection."""

__author__ = ["Tveten"]

import numpy as np
from numba import njit

Expand Down
2 changes: 2 additions & 0 deletions skchange/costs/saving_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@

"""

__author__ = ["Tveten"]

from skchange.costs.mean_saving import init_mean_saving, mean_saving

VALID_SAVINGS = ["mean"]
Expand Down
2 changes: 2 additions & 0 deletions skchange/datasets/generate.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
"""Data generators."""

__author__ = ["Tveten"]

from numbers import Number
from typing import Union

Expand Down
Loading
Loading