Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Efficient Schafer-Strimmer for MinT #280

Merged
merged 25 commits into from
Sep 3, 2024
Merged

Conversation

elephaint
Copy link
Contributor

@elephaint elephaint commented Aug 16, 2024

This PR implements an optimized version of the Schafer-Strimmer shrunk empirical covariance algorithm for MinT.

1. Up to 60x faster

First, the result is that we can perform MinT reconciliation much faster, as demonstrated in the table below, where mint_legacy is the old version and mint is the new version, with the number in brackets denoting whether the forecasts contain NaNs (True or False) and the number of bottom-level timeseries reconciled (i.e., 20, 200, 2000). An improvement of up to 60x (on my machine, i7-12700k) is seen (no NaNs, 2000 timeseries).

Note that the NaN-version is about 3x slower than the version that doesn't have to deal with NaNs, which is due to all the masking involved when dealing with NaNs. However, it is still significantly faster than the legacy version (that doesn't handle NaNs properly).

Screenshot 2024-09-02 113916

2. Forecasting performance

Forecasting performance is similar:

Screenshot 2024-09-02 113002

3. Improved NaN handling

This PR improves NaN-handling (thanks to @christophertitchen for the suggestions!), leading to more cases that MinT can handle. For example, introducing 20% NaN values in the Australian Tourism forecasts couldn't be handled by the old version but the new version will give results nearly identical to the no-nan case:

Screenshot 2024-09-02 113218

Code for introducing NaNs in the Australian Tourism example:

# Randomly set NaNs
nan_fraction = 0.2
Y_fitted_df = Y_fitted_df.reset_index()
Y_fitted_df_sample_idx = Y_fitted_df.sample(frac=nan_fraction).index
Y_fitted_df.loc[Y_fitted_df_sample_idx, ["y", "ETS"]] = np.nan
Y_fitted_df = Y_fitted_df.set_index("unique_id")

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@elephaint elephaint marked this pull request as ready for review August 20, 2024 08:32
hierarchicalforecast/methods.py Outdated Show resolved Hide resolved
hierarchicalforecast/methods.py Outdated Show resolved Hide resolved
hierarchicalforecast/methods.py Outdated Show resolved Hide resolved
hierarchicalforecast/methods.py Outdated Show resolved Hide resolved
@elephaint
Copy link
Contributor Author

@christophertitchen Again thanks for the remarks, I made a few other micro optimizations, in the end in total further reducing computation time by ~30%

Copy link
Contributor

@christophertitchen christophertitchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy the comments were helpful!

The only further change I thought of would be to potentially modify the naïve approach when computing covariance and correlation to a pairwise approach for each time series, like what is done in stats::cov for R. This could better describe the relationship between pairs of time series, but honestly, it would slow down the generation of the shrunk $W$ and go against the spirit of what you were trying to achieve here by implementing a fast method, so I think it is perfectly fine as you have done it.

As a quick example to demonstrate, imagine monthly time series representing the units sold of products over the past six months. We take two of those products, $X$ and $Y$, with $Y$ being a NPI (new product introduction) which was implemented only four months prior. The residuals of a naïve forecast would be:

$X = [(x_{t-5} - x_{t-6}), (x_{t-4} - x_{t-5}), (x_{t-3} - x_{t-4}), (x_{t-2} - x_{t-3}), (x_{t-1} - x_{t-2}), (x_{t} - x_{t-1})]$

$Y = [-, -, -, (y_{t-2} - y_{t-3}), (y_{t-1} - y_{t-2}), (y_{t} - y_{t-1})]$

This approach would use an n_samples of $6$ and the the mean of $X$ would include all six observations. However, a pairwise approach would only use pairs of temporally aligned observations, so an n_samples of $3$, and the mean of $X$ would be $\overline{X} = ((x_{t-2} - x_{t-3}) + (x_{t-1} - x_{t-2}) + (x_{t} - x_{t-1})) / 3$, which would obviously lead to a different $\text{cov}(X, Y)$, $\text{corr}(X, Y)$, $\sigma_{x}$ for standardisation and so on.

not_mask = ~np.isnan(residuals)
...
for i in prange(n_timeseries):
    x = residuals[i]
    for j in range(i + 1):
        y = residuals[j]
        # Get the temporally aligned pairs of observations.
        mask = not_mask[i] & not_mask[j]
        n_samples = len(mask)
        # Check for insufficient observations for unbiased estimator.
        if n_samples - 1 == 0:
            ...
        x = x[mask]
        y = y[mask]
        # Calculate the pairwise sample means.
        mean_x = np.mean(x)
        mean_y = np.mean(y)
        ...

Anyway, I am not familiar with any literature on this topic or if there is any evidence to suggest it is significantly worse, and the legacy version also takes the same approach, so as I said, I favour the fast approach you have implemented here. Good job!

@elephaint
Copy link
Contributor Author

I am happy the comments were helpful!

The only further change I thought of would be to potentially modify the naïve approach when computing covariance and correlation to a pairwise approach for each time series, like what is done in stats::cov for R. This could better describe the relationship between pairs of time series, but honestly, it would slow down the generation of the shrunk W and go against the spirit of what you were trying to achieve here by implementing a fast method, so I think it is perfectly fine as you have done it.

As a quick example to demonstrate, imagine monthly time series representing the units sold of products over the past six months. We take two of those products, X and Y , with Y being a NPI (new product introduction) which was implemented only four months prior. The residuals of a naïve forecast would be:

X = [ ( x t − 6 − x t − 7 ) , ( x t − 5 − x t − 6 ) , ( x t − 4 − x t − 5 ) , ( x t − 3 − x t − 4 ) , ( x t − 2 − x t − 3 ) , ( x t − 1 − x t − 2 ) ]

Y = [ − , − , − , ( y t − 3 − y t − 4 ) , ( y t − 2 − y t − 3 ) , ( y t − 1 − y t − 2 ) ]

This approach would use an n_samples of 6 and the the mean of X would include all six observations. However, a pairwise approach would only use pairs of temporally aligned observations, so an n_samples of 3 , and the mean of X would be X ― = ( ( x t − 3 − x t − 4 ) + ( x t − 2 − x t − 3 ) + ( x t − 1 − x t − 2 ) ) / 3 , which would obviously lead to a different cov ( X , Y ) , corr ( X , Y ) , σ x for standardisation and so on.

not_mask = ~np.isnan(residuals)
...
for i in prange(n_timeseries):
    x = residuals[i]
    for j in range(i + 1):
        y = residuals[j]
        # Get the temporally aligned pairs of observations.
        mask = not_mask[i] & not_mask[j]
        n_samples = len(mask)
        # Check for insufficient observations for unbiased estimator.
        if n_samples - 1 == 0:
            ...
        x = x[mask]
        y = y[mask]
        # Calculate the pairwise sample means.
        mean_x = np.mean(x)
        mean_y = np.mean(y)
        ...

Anyway, I am not familiar with any literature on this topic or if there is any evidence to suggest it is significantly worse, and the legacy version also takes the same approach, so as I said, I favour the fast approach you have implemented here. Good job!

Didn't think about this, but it's a good point. I'll have a try and play around a bit with it :)

@elephaint
Copy link
Contributor Author

elephaint commented Aug 27, 2024

@christophertitchen Thanks for the suggestion for the temporal alignment - I included two versions, one that can handle NaNs in the way you describe and a faster one that doesn't have to deal with NaNs. The NaN-version is a bit slower (a.o. due to the masking involved) but it enlarges the scope of MinT significantly, as it can handle much more cases without leading to ill-conditioned W matrices (see example in the top post). Thanks!

Copy link
Contributor

@christophertitchen christophertitchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few more thoughts, but I am honestly not too familiar with the Schäfer and Strimmer paper, so I will leave the rest in your more capable hands! 😅

hierarchicalforecast/methods.py Outdated Show resolved Hide resolved
hierarchicalforecast/methods.py Outdated Show resolved Hide resolved
hierarchicalforecast/methods.py Outdated Show resolved Hide resolved
hierarchicalforecast/methods.py Outdated Show resolved Hide resolved
@elephaint
Copy link
Contributor Author

@christophertitchen Again thanks for the thoughtful comments.

I think I'll be deprecating the legacy version now that I get equivalent results across datasets.

@christophertitchen
Copy link
Contributor

@christophertitchen Again thanks for the thoughtful comments.

I think I'll be deprecating the legacy version now that I get equivalent results across datasets.

@elephaint great job!

Also, regarding Hyndman's implementation, you were right.

https://github.com/earowang/hts/blob/1408ab2c5c40f1022e6957bcf8438aaefc8464bf/R/MinT.R#L24

  covm <- crossprod(x) / n
  v <- (1/(n * (n - 1))) * (crossprod(xs^2) - 1/n * (crossprod(xs))^2)

It uses the biased estimator, i.e. normalisation by $N$. We could have changed ddof in np.ma.cov to $0$ to copy this approach, but deprecating the legacy approach in favour of your superior one makes sense. 😉

settings.ini Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
jmoralez
jmoralez previously approved these changes Sep 3, 2024
@elephaint elephaint merged commit 2e788e4 into main Sep 3, 2024
19 checks passed
@elephaint elephaint deleted the feature/efficient_mint branch September 3, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants