Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backtest() works, but backtest_many() gives MemoryError #177

Closed
quant5 opened this issue Sep 3, 2024 · 11 comments
Closed

backtest() works, but backtest_many() gives MemoryError #177

quant5 opened this issue Sep 3, 2024 · 11 comments
Labels

Comments

@quant5
Copy link
Contributor

quant5 commented Sep 3, 2024

Specifications

Windows machine (x86). Python 3.11.5.
Fresh install of cvxportfolio to the new version (1.3.2) from a fresh virtual environment.
image

Description

I try to run the hello_world.py file here: https://github.com/cvxgrp/cvxportfolio/blob/master/examples/hello_world.py

  • the script gets data correctly and the data files are fine.
  • the script fails on simulator.backtest_many([policy, cvx.Uniform()], start_time="2020-01-01"):
    • this is very strange since my machine is quite large (64gb RAM) and the memory allocation in the error is only 1mb.

image

  • If I replace backtest_many with simple backtest(), script succeeds. Example: simulator.backtest(policy, start_time="2020-01-01")orsimulator.backtest(cvx.Uniform(), start_time="2020-01-01")`

Other remarks

  • Switching the solver to something else e.g., "ECOS" or "CLARABEL" did not fix the issue.
  • Downgrading numpy to 1.5.x did not fix the issue.
  • Task manager does not indicate any huge spike in memory use.
  • The test suite (e.g., python -m cvxportfolio.tests) worked fine. (see below)
    image

My thought is that multiprocess compatibility with Numpy could be the issue? But not seeing anything from a quick search. Any help would be appreciated.

@enzbus
Copy link
Collaborator

enzbus commented Sep 3, 2024

Thanks for reporting this. Yes, it looks like an incompatibility between multiprocess and something on your system. Looks like you're on Windows; in the worst case you can try to switch to a Linux virtual machine. I'll work on adding an option to revert to using standard Python multiprocessing, since a few releases ago multiprocess is not necessary any more, but I left it because it looked more robust. Can you try in the next ~24hrs if a patch that drops multiprocess from the dependencies and moves it to an optional dependency (if installed, use it, otherwise no) fixes this issue for you? PS I assume everything works fine if you set parallel=False in the options to backtest_many?

@enzbus enzbus added the bug label Sep 3, 2024
@quant5
Copy link
Contributor Author

quant5 commented Sep 3, 2024

Of course, I am experimenting with trying this on my end too, but if you push out a patch I'll happily try it.

@quant5
Copy link
Contributor Author

quant5 commented Sep 3, 2024

And yes - no problem with parallel=False as expected. I should have put that in the original issue.

@enzbus
Copy link
Collaborator

enzbus commented Sep 3, 2024

Ok, working on it in PR #178

@quant5
Copy link
Contributor Author

quant5 commented Sep 3, 2024

Simply replacing multiprocess to multiprocessing here
https://github.com/cvxgrp/cvxportfolio/blob/master/cvxportfolio/simulator.py#L53

did not work on my end:
image

I am digging into the code further.

@enzbus
Copy link
Collaborator

enzbus commented Sep 3, 2024

Looks like the fix passes tests, I'm going to merge and then you can try by installing the development version (https://www.cvxportfolio.com/en/master/#advanced-install-development-version); not sure which environment manager you're using but the syntax should be similar to pip's. It's rather strange, I had made a test case specifically for multiprocessing; maybe it passes because GitHub only provides single-processor test runners? Let me merge and we see.

@enzbus
Copy link
Collaborator

enzbus commented Sep 3, 2024

Ok, in any case I've merged (always good to minimize dependencies, multiprocess was there for advanced usecases, like custom forecasters using difficult third party libraries).

@enzbus
Copy link
Collaborator

enzbus commented Sep 3, 2024

Just a thought. Perhaps @quant5 you have some Windows-specific
limit on processes spawned by a master process? Could be that GitHub test runners are sanitized against this sort of things. I recently found out new versions of MacOS have all sorts of limits on "unsigned" codes and you need to manually unset a bunch of code attributes to get many open source codes running. In any case I'm happy with having dropped a depencency :)

@quant5
Copy link
Contributor Author

quant5 commented Sep 3, 2024

I reverse engineered the call stack and got something to work. It turns out to be simple: setting processes=n within multiprocessing.Pool.

import pandas as pd

import cvxportfolio as cvx
from multiprocessing import Lock, Pool
from cvxportfolio.cache import _load_cache, _mp_init, _store_cache


def _worker(policy, simulator, start_time, end_time, h):
    return simulator._backtest(policy, start_time, end_time, h)


if __name__ == "__main__":
    # risk aversion parameter (Chapter 4.2)
    # chosen to match resulting volatility with the
    # uniform portfolio (for illustrative purpose)
    GAMMA = 2.5

    # covariance forecast error risk parameter (Chapter 4.3)
    # this can help regularize a noisy covariance estimate
    KAPPA = 0.05

    objective = (
        cvx.ReturnsForecast()
        - GAMMA * (cvx.FullCovariance() + KAPPA * cvx.RiskForecastError())
        - cvx.StocksTransactionCost()
    )

    constraints = [cvx.LeverageLimit(3)]
    universe = ["AAPL", "AMZN", "UBER", "ZM", "CVX", "TSLA", "GM", "ABNB", "CTAS", "GOOG"]

    policy = cvx.MultiPeriodOptimization(objective, constraints, planning_horizon=2)
    simulator = cvx.StockMarketSimulator(universe=universe)

    policies = [policy, cvx.Uniform()]
    initial_value = 1e6
    market_data = simulator.market_data
    tz = market_data.trading_calendar().tz

    start_time = pd.Timestamp("2020-01-01").tz_localize(tz)
    end_time = pd.Timestamp("2021-01-01").tz_localize(tz)

    trading_calendar_inclusive = market_data.trading_calendar(
        start_time, end_time, include_end=True
    )
    if len(trading_calendar_inclusive) < 1:
        raise ValueError("There are no trading days between the provided times.")
    start_time_t = trading_calendar_inclusive[0]
    end_time_t = trading_calendar_inclusive[-1]

    initial_universe = market_data.universe_at_time(start_time_t)
    h = [None] * len(policies)
    for i in range(len(policies)):
        if h[i] is None:
            h[i] = pd.Series(0.0, initial_universe)
            h[i].iloc[-1] = initial_value

    n = len(policies)
    zip_args = zip(policies, [simulator] * n, [start_time] * n, [end_time] * n, h)

    with Pool(processes=n, initializer=_mp_init, initargs=(Lock(),)) as p:
        result = p.starmap(_worker, zip_args)

    print(list(result))

@quant5
Copy link
Contributor Author

quant5 commented Sep 3, 2024

Obviously not sure if this is specific to my machine, Windows, etc. but if it makes it into any fix (e.g., a num_processes kwarg that can be set or default to cpu_count()), let me know, otherwise I will work with setting up my own pool like this.

As a side note, removing the initializer=_mp_init, initargs=(Lock(),) part on the Pool callsite, the example still works. I don't know enough about multiprocessing etc. so just curious why that is necessary.

@enzbus
Copy link
Collaborator

enzbus commented Sep 3, 2024

Sounds reasonable, I'll delve into the docs of multiprocessing just to be sure but it seems an innocuous addition. Feel free to do a PR yourself if you have it worked out. (Looks like you do.) Thanks!

@enzbus enzbus closed this as completed in 31663bd Sep 4, 2024
enzbus added a commit that referenced this issue Oct 24, 2024
This minor release contains various new features and fixes.

Features:
- new constraints Min/MaxHoldings, Min/MaxTradeWeights,
  Min/MaxTrades, FixedImbalance and NoCash (GH issue #180);
- improved ParticipationRateLimit constraint;
- improved exception reporting, now giving full path in
  evaluation tree where exception was raised (GH PR #176);
- redesigned forecast.py, no API changes, still work in progress
  for full support for regularized regression;
- added market_data = None option in Policy.execute; now Cvxportfolio
  policies can be executed (e.g., for online usage) without
  a MarketData server; all data needs to be provided separately
  to each individual object;
- AnnualizedVolatility utility object, for usage in risk constraints;
- added reject_trades_below and max_fraction_liquidity options
  to MarketSimulator, allowing to filter both small and too large
  trades (in simulation, using realized daily volumes);
- minor updates in examples;
- a few new sections in documentation manual;
- changed license from APACHE2 to GPLv3, see explanation in GH
  issue #166;
- moved documentation website to pydata-sphinx theme, and redesigned
  it a little;

Fixes:
- GH issue #146, cache files invalidatation on user interrupt;
- GH issue #177, now using default Python multiprocessing, moved;
  multiprocess to optional dependency;
- various smaller ones.

This release took a bit longer than the usual 2-3 months. We hope to
release 1.5.0 in about 2-3 months from now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants