databrickslabs · rportilla-databricks · Jan 28, 2022 · Aug 10, 2021 · Aug 12, 2021 · Aug 12, 2021
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -1,7 +1,7 @@
 name: build
 on:
   push:
-    branches: [master, scala_refactor]
+    branches: [master]
 jobs:
   run:
     runs-on: ${{ matrix.os }}
@@ -20,13 +20,13 @@ jobs:
     - name: Set Spark env
       run: |
         export SPARK_LOCAL_IP=127.0.0.1
+        export SPARK_SUBMIT_OPTS="--illegal-access=permit -Dio.netty.tryReflectionSetAccessible=true"
     - name: Generate coverage report
       working-directory: ./python
       run: |
         pip install -r requirements.txt
         pip install coverage
-        coverage run -m unittest 
+        coverage run -m unittest discover -s tests -p '*_tests.py'
         coverage xml
-
     - name: Publish test coverage
       uses: codecov/codecov-action@v1
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,8 @@
-# ignore IntelliJ/PyCharm IDE files
+# ignore IntelliJ/PyCharm/VSCode IDE files
 .idea
 *.iml
+.vscode
+
 
 # coverage files
 .coverage
@@ -10,6 +12,7 @@ coverage.xml
 scala/tempo/target
 scala/tempo/project/target/
 scala/tempo/project/project/target/
+scala/target/stream/*
 .bsp
 
 # local delta tables

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -51,4 +51,4 @@ Authorized Users (please list Github usernames):**
 
 
 * Required field
-** Please note that Authorized Users may not be immediately be granted authorization to submit Contributions; should more than one individual attempt to sign a CLA on behalf of a Corporation, the first such CLA will apply and later CLAs will be deemed void.
+** Please note that Authorized Users may not be immediately be granted authorization to submit Contributions; should more than one individual attempt to sign a CLA on behalf of a Corporation, the first such CLA will apply and later CLAs will be deemed void.
diff --git a/README.md b/README.md
@@ -8,7 +8,10 @@
 ## Project Description
 The purpose of this project is to make time series manipulation with Spark simpler. Operations covered under this package include AS OF joins, rolling statistics with user-specified window lengths, featurization of time series using lagged values, and Delta Lake optimization on time and partition fields.
 
+[![image](https://github.com/databrickslabs/tempo/workflows/build/badge.svg)](https://github.com/databrickslabs/tempo/actions?query=workflow%3Abuild)
 [![codecov](https://codecov.io/gh/databrickslabs/tempo/branch/master/graph/badge.svg)](https://codecov.io/gh/databrickslabs/tempo)
+[![Downloads](https://pepy.tech/badge/dbl-tempo/month)](https://pepy.tech/project/dbl-tempo)
+[![PyPI version](https://badge.fury.io/py/dbl-tempo.svg)](https://badge.fury.io/py/dbl-tempo)
 
 ## Using the Project
 
@@ -164,6 +167,92 @@ moving_avg.select('event_ts', 'x', 'y', 'z', 'mean_y').show(10, False)
 ```
 
 
+#### 6 - Fourier Transform
+
+Method for transforming the time series to frequency domain based on the distinguished data column 
+
+Parameters: 
+
+timestep = timestep value to be used for getting the frequency scale
+
+valueCol = name of the time domain data column which will be transformed
+
+```python
+ft_df = tsdf.fourier_transform(timestep=1, valueCol="data_col")
+display(ft_df)
+```
+#### 7 - Interpolation
+
+Interpolate a series to fill in missing values using a specified function. The following interpolation methods are supported: 
+
+- Zero Fill : `zero`
+- Null Fill: `null`
+- Backwards Fill: `bfill`
+- Forwards Fill: `ffill`
+- Linear Fill: `linear`
+
+The `interpolate` method can either be use in conjunction with `resample` or independently.
+
+If `interpolate` is not chained after a `resample` operation, the method automatically first re-samples the input dataset into a given frequency, then performs interpolation on the sampled time-series dataset.
+
+Possible values for frequency include patterns such as 1 minute, 4 hours, 2 days or simply sec, min, day. For the accepted functions to aggregate data, options are 'floor', 'ceil', 'min', 'max', 'mean'. 
+
+`NULL` values after re-sampling are treated the same as missing values. Ability to specify `NULL` as a valid value is currently not supported.
+
+Valid columns data types for interpolation are: `["int", "bigint", "float", "double"]`.
+
+```python
+# Create instance of the TSDF class
+input_tsdf = TSDF(
+            input_df,
+            partition_cols=["partition_a", "partition_b"],
+            ts_col="event_ts",
+        )
+
+
+# What the following chain of operation does is:
+# 1. Aggregate all valid numeric columns using mean into 30 second intervals
+# 2. Interpolate any missing intervals or null values using linear fill
+# Note: When chaining interpolate after a resample, there is no need to provide a freq or func parameter. Only method is required.
+interpolated_tsdf = input_tsdf.resample(freq="30 seconds", func="mean").interpolate(
+    method="linear"
+)
+
+# What the following interpolation method does is:
+# 1. Aggregate columnA and columnBN  using mean into 30 second intervals
+# 2. Interpolate any missing intervals or null values using linear fill
+interpolated_tsdf = input_tsdf.interpolate(
+    freq="30 seconds",
+    func="mean",
+    target_cols= ["columnA","columnB"],
+    method="linear"
+
+)
+
+# Alternatively it's also possible to override default TSDF parameters.
+# e.g. partition_cols, ts_col a
+interpolated_tsdf = input_tsdf.interpolate(
+    partition_cols=["partition_c"],
+    ts_col="other_event_ts"
+    freq="30 seconds",
+    func="mean",
+    target_cols= ["columnA","columnB"],
+    method="linear"
+)
+
+# The show_interpolated flag can be set to `True` to show additional boolean columns 
+# for a given row that shows if a column has been interpolated.
+interpolated_tsdf = input_tsdf.interpolate(
+    partition_cols=["partition_c"],
+    ts_col="other_event_ts"
+    freq="30 seconds",
+    func="mean",
+    method="linear",
+    target_cols= ["columnA","columnB"],
+    show_interpolated=True,
+)
+
+```
 
 ## Project Support
 Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs).  They are provided AS-IS and we do not make any guarantees of any kind.  Please do not submit a support ticket relating to any issues arising from the use of these projects.

diff --git a/python/README.md b/python/README.md
@@ -8,7 +8,9 @@
 ## Project Description
 The purpose of this project is to make time series manipulation with Spark simpler. Operations covered under this package include AS OF joins, rolling statistics with user-specified window lengths, featurization of time series using lagged values, and Delta Lake optimization on time and partition fields.
 
+[![image](https://github.com/databrickslabs/tempo/workflows/build/badge.svg)](https://github.com/databrickslabs/tempo/actions?query=workflow%3Abuild)
 [![codecov](https://codecov.io/gh/databrickslabs/tempo/branch/master/graph/badge.svg)](https://codecov.io/gh/databrickslabs/tempo)
+[![Downloads](https://pepy.tech/badge/dbl-tempo/month)](https://pepy.tech/project/dbl-tempo)
 
 ## Using the Project
 
@@ -144,7 +146,114 @@ moving_avg = watch_accel_tsdf.withRangeStats("y", rangeBackWindowSecs=600)
 moving_avg.select('event_ts', 'x', 'y', 'z', 'mean_y').show(10, False)
 ```
 
+#### 6 - Anomaly Detection
 
+First create a local yaml file containing tables you wish to create anomalies for: 
+
+Note: Use `%sh` in Databricks or just run a bash command in your local directory as follows:
+
+```
+echo """
+table1:
+database : "default"
+name : "revenue_hourly_2021"
+ts_col : "timestamp"
+lookback_window : "84600"
+mode : "new"
+# include any grouping columns or metrics you wish to detect anomalies on
+partition_cols : ["winner"]
+metrics : ["advertiser_impressions", "publisher_net_revenue"]
+""" > ad.yaml
+```
+
+The code to run to produce the stacked table with anomalies is: 
+
+#### 7 - Fourier Transform
+
+Method for transforming the time series to frequency domain based on the distinguished data column 
+
+Parameters: 
+
+timestep = timestep value to be used for getting the frequency scale
+
+valueCol = name of the time domain data column which will be transformed
+
+```python
+ft_df = tsdf.fourier_transform(timestep=1, valueCol="data_col")
+display(ft_df)
+```
+
+#### 8- Interpolation
+Interpolate a series to fill in missing values using a specified function. The following interpolation methods are supported: 
+
+- Zero Fill : `zero`
+- Null Fill: `null`
+- Backwards Fill: `bfill`
+- Forwards Fill: `ffill`
+- Linear Fill: `linear`
+
+The `interpolate` method can either be use in conjunction with `resample` or independently.
+
+If `interpolate` is not chained after a `resample` operation, the method automatically first re-samples the input dataset into a given frequency, then performs interpolation on the sampled time-series dataset.
+
+Possible values for frequency include patterns such as 1 minute, 4 hours, 2 days or simply sec, min, day. For the accepted functions to aggregate data, options are 'floor', 'ceil', 'min', 'max', 'mean'. 
+
+`NULL` values after re-sampling are treated the same as missing values. Ability to specify `NULL` as a valid value is currently not supported.
+
+Valid columns data types for interpolation are: `["int", "bigint", "float", "double"]`.
+
+```python
+# Create instance of the TSDF class
+input_tsdf = TSDF(
+            input_df,
+            partition_cols=["partition_a", "partition_b"],
+            ts_col="event_ts",
+        )
+
+
+# What the following chain of operation does is:
+# 1. Aggregate all valid numeric columns using mean into 30 second intervals
+# 2. Interpolate any missing intervals or null values using linear fill
+# Note: When chaining interpolate after a resample, there is no need to provide a freq or func parameter. Only method is required.
+interpolated_tsdf = input_tsdf.resample(freq="30 seconds", func="mean").interpolate(
+    method="linear"
+)
+
+# What the following interpolation method does is:
+# 1. Aggregate columnA and columnBN  using mean into 30 second intervals
+# 2. Interpolate any missing intervals or null values using linear fill
+interpolated_tsdf = input_tsdf.interpolate(
+    freq="30 seconds",
+    func="mean",
+    target_cols= ["columnA","columnB"],
+    method="linear"
+
+)
+
+# Alternatively it's also possible to override default TSDF parameters.
+# e.g. partition_cols, ts_col a
+interpolated_tsdf = input_tsdf.interpolate(
+    partition_cols=["partition_c"],
+    ts_col="other_event_ts"
+    freq="30 seconds",
+    func="mean",
+    target_cols= ["columnA","columnB"],
+    method="linear"
+)
+
+# The show_interpolated flag can be set to `True` to show additional boolean columns 
+# for a given row that shows if a column has been interpolated.
+interpolated_tsdf = input_tsdf.interpolate(
+    partition_cols=["partition_c"],
+    ts_col="other_event_ts"
+    freq="30 seconds",
+    func="mean",
+    method="linear",
+    target_cols= ["columnA","columnB"],
+    show_interpolated=True,
+)
+
+```
 
 ## Project Support
 Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs).  They are provided AS-IS and we do not make any guarantees of any kind.  Please do not submit a support ticket relating to any issues arising from the use of these projects.

diff --git a/python/requirements.txt b/python/requirements.txt
@@ -1,10 +1,13 @@
+ipython==7.28.0
 numpy==1.19.1
+chispa==0.8.2
 pandas==1.1.0
 py4j==0.10.9
+pyarrow==6.0.1
 pyspark==3.0.0
 pyspark-stubs==3.0.0
 python-dateutil==2.8.1
 pytz==2020.1
+scipy==1.7.2
 six==1.15.0
 wheel==0.34.2
-ipython==7.28.0
diff --git a/python/setup.py b/python/setup.py
@@ -6,7 +6,7 @@
 
 setuptools.setup(
     name='dbl-tempo',
-    version='0.1.2',
+    version='0.1.3',
     author='Ricardo Portilla, Tristan Nixon, Max Thone, Sonali Guleria',
     author_email='labs@databricks.com',
     description='Spark Time Series Utility Package',
@@ -16,7 +16,8 @@
     packages=find_packages(where=".", include=["tempo"]),
     install_requires=[
      'ipython',
-     'pandas'
+     'pandas',
+     'scipy'
     ],
     extras_require=dict(tests=["pytest"]),
     classifiers=[

diff --git a/python/tempo/__init__.py b/python/tempo/__init__.py
@@ -1,2 +1,2 @@
 from tempo.tsdf import TSDF
-from tempo.utils import display
+from tempo.utils import display
Original file line number	Diff line number	Diff line change
Expand Up		@@ -51,4 +51,4 @@ Authorized Users (please list Github usernames):**


		* Required field
		** Please note that Authorized Users may not be immediately be granted authorization to submit Contributions; should more than one individual attempt to sign a CLA on behalf of a Corporation, the first such CLA will apply and later CLAs will be deemed void.
		** Please note that Authorized Users may not be immediately be granted authorization to submit Contributions; should more than one individual attempt to sign a CLA on behalf of a Corporation, the first such CLA will apply and later CLAs will be deemed void.