Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sep-(a)CMA-ES does escape the solution basin on fsphere in high dimension with gradient injection #97

Closed
beniz opened this issue Jan 8, 2015 · 50 comments

Comments

@beniz
Copy link
Collaborator

beniz commented Jan 8, 2015

First, comparing sep-aCMA-ES with and without gradient injection (GI), GI allows fast convergence:
sepacma_fsphere_1m_comp
The difference in initial value is due to not showing iteration 0 (will be ticketed and fixed).

However, a closer look shows that while nearing the optima, sep-aCMA-ES with GI escapes the solution basin and stagnates around 5000 or so in 1M-D. Below is a logscale of the objective function value on the same run as above:
sepacma_fsphere_1m_gradient_pb
Same behavior is witnessed with sep-CMA-ES.

To reproduce:

./test_functions -fname fsphere -alg sepacmaes -ftarget 1e-8 -dim 1000000 -with_gradient -no_stagnation -max_hist 1000 -x0 1 -sigma0 0.1

This might be a bug, in all cases, this is a weird behavior.

@nikohansen
Copy link
Collaborator

It must already be a bug in "pure" separable CMA-ES, as it is supposed to converge. I also suggest to always plot the y-axis in log-scale. For diagnosis, the standard output with step-size(s), eigenvalues etc is advisable.

EDIT: I didn't consider that it was 1M-D, in which case this should be considered expected behavior, if the initial step-size is large.

@beniz
Copy link
Collaborator Author

beniz commented Jan 8, 2015

Yes, sigma moves down by ~1e-5, and the condition number up by ~1e-7 in both cases. I can't see any difference before and after step 60 when GI is activated.

Since this is in 1M-D, I will try to reproduce in lower dimension, which would simplify plotting of eigen and param values as well.

Sep was tested on BBOB and on neural nets of up to ~400-D without experiencing this behavior.

I will report here as it goes...

@loshchil
Copy link

loshchil commented Jan 8, 2015

modification of the initial step-size and a longer run should help

On Thu, Jan 8, 2015 at 3:20 PM, Emmanuel Benazera notifications@github.com
wrote:

Yes, sigma moves down by ~1e-5, and the condition number up by ~1e-7 in
both cases. I can't see any difference before and after step 60 when GI is
activated.

Since this is in 1M-D, I will try to reproduce in lower dimension, which
would simplify plotting of eigen and param values as well.

Sep was tested on BBOB and on neural nets of up to ~400-D without
experiencing this behavior.

I will report here as it goes...


Reply to this email directly or view it on GitHub
#97 (comment).

@beniz
Copy link
Collaborator Author

beniz commented Jan 8, 2015

modification of the initial step-size and a longer run should help

playing with the initial step-size certainly allows to trigger the problem:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 1 -no_stagnation

immediately diverges, while

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 0.1 -no_stagnation

I'm investigating whereas this could be due to a bug. I am worried that users might try in high dimensions and reject the whole thing while a change in sigma would do.

@beniz
Copy link
Collaborator Author

beniz commented Jan 8, 2015

So, a few more experiments in lower dims show that on sufficiently long runs, sigma lowers and the search takes the right direction again.

So I am thinking that it could make sense to have a trick that tries a larger than usual change in sigma when the search goes for a certain number of steps in the wrong direction ?

@nikohansen
Copy link
Collaborator

I see, then "initial elitism" might be a good try, and you are right, additionally decreasing sigma by a factor of 2 until the first success is observed is probably quite effective.

One general problem is that cumulative step-size adaptation has a horizon that depends on the dimension (though it is possible to modify this dependency down to sqrt dimension by setting cs accordingly). The parameters or even the method might not be perfectly suited for 1M-D problems. Two-point adaptation is a promising alternative candidate, where it should also be simpler to adjust the parameters.

To get the full picture: what is the population size?

@beniz
Copy link
Collaborator Author

beniz commented Jan 9, 2015

To get the full picture: what is the population size?

lambda = 45

then "initial elitism" might be a good try

The current implementation of initial elitism enables after the run ends. In high dim, an earlier trigger could be useful.

@nikohansen
Copy link
Collaborator

To me it seems generally useful even without restarts. Actually, it seems particularly useful for a user-supplied initial point to be "elitist" for a single run.

@nikohansen
Copy link
Collaborator

It remains to be understood why the gradient approach stagnates in the end?

@beniz
Copy link
Collaborator Author

beniz commented Jan 9, 2015

Yes, I may need to run it longer. I will post here results from a set of experiments.

@beniz
Copy link
Collaborator Author

beniz commented Jan 10, 2015

So, I am able to report on three tiny fronts:
1- the stagnation of the gradient approach after a good dip toward the optima (i.e the original pb this ticket started on);
2- an elitist x0 scheme where x0 is reinjected as long as half of candidates are above f(x0);
3- a sigma step-control scheme (division by half) whenever the search appears to be going in the wrong direction for a number of steps.

First, elitist x0 (the 2-) does not appear to work always well: in some reproducible cases, the search immediately stales at f(x0) value unless lambda is increased. This may be that somehow the reinjection prevents the cov matrix to shape such that a direction appears ?

Second, I have tried 3 on 1, with mitigated results, that may indicate either a bug in gradient injection and/or that scheme 3 needs more careful control.

Due to the linearity of github comments, I will separate and results into different comments.

@beniz
Copy link
Collaborator Author

beniz commented Jan 10, 2015

1- gradient injection scheme stagnation

  • with sufficient steps, on a 5000-D problem (instead of the 1M-D initial pb), the initial problem can be observed til further into the run:
    sepcma_fsphere_5kd_gradient
  • on the 1M-D initial problem, a longer run allows to see that after the bump the search starts converging again, though extremely slowly.

beniz pushed a commit that referenced this issue Jan 10, 2015
@beniz
Copy link
Collaborator Author

beniz commented Jan 10, 2015

2-elitist x0
This is a simple modification of 'initial elitism', available for testing on branch 'sep_97', that applies elitism on first run to x0, i.e. it reinjects x0 until half of the candidate fvalues are < f(x0). It is activated by using -elitist 2, while -elitist 1 applies the 'original' behavior, i.e. re-injection of previous run's best candidate.

While the scheme sometimes works, it appears to stagnate indefinitely in certain cases. Typically, taking the diverging case of a few comments above, applying the new scheme:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 1 -no_stagnation -elitist 2

which yields:

INFO - iter=245 / evals=7105 / f-value=5000 / sigma=0.628006445685068 / last_iter=8
INFO - iter=246 / evals=7134 / f-value=5000 / sigma=0.62681772456298 / last_iter=8
INFO - iter=247 / evals=7163 / f-value=5000 / sigma=0.62563139059145 / last_iter=8
INFO - iter=248 / evals=7192 / f-value=5000 / sigma=0.62444528509015 / last_iter=8
...

The scheme appears to even block convergence on a an otherwise converging example:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 0.1 -no_stagnation -elitist 2

@beniz
Copy link
Collaborator Author

beniz commented Jan 10, 2015

3- sigma (crude) step-control
This scheme halves sigma is search goes into the wrong direction for a number of step. In the following the number of steps has been arbitrarily set to 10. It is activated by using the -stigma_stepc option in 'test_functions'.

Tested two tests for triggering the scheme:
1- a test of whether f(x_cand) > f(x0) : this works but can fail when the test triggers repeatedly and lowers sigma too much to the point where f(x_cand) stagnates while above f(x0).
2- a test of whether f(x_{t-1}) < f(x_t) for a number of steps: this appears to work better on my tiny set of experiments, though there's always the risk of lowering sigma to the point of stagnation.

Using test 2 above:

  • Results on the non-converging example of a few comments above:
    sepacmaes_fsphere_50k_sigmastepc

to reproduce:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 1 -no_stagnation -sigma_stepc
  • Results on the initial problem of gradient injection dip and stagnation:
    sepacma_fsphere_1md_gradient_sigmaslash
    Here there's obviously the problem that sigma becomes too low and stagnation occurs. The trigger for this behavior seems to be the weird behavior of the search that goes back up after going toward the optimal. This behavior is very clearly repeated every time sigma is halved, with a dip and an upward shot. This might indicated a bug in gradient injection.
    To reproduce:
./test_functions -fname fsphere -alg sepacmaes -ftarget 1e-8 -dim 1000000 -with_gradient -no_stagnation -max_hist 1000 -x0 1 -sigma0 0.1

In all cases, the latter behavior is my next point of investigation. Any tip on how to understand this better is welcome.

@beniz
Copy link
Collaborator Author

beniz commented Jan 10, 2015

One general problem is that cumulative step-size adaptation has a horizon that depends on the dimension (though it is possible to modify this dependency down to sqrt dimension by setting cs accordingly). The parameters or even the method might not be perfectly suited for 1M-D problems. Two-point adaptation is a promising alternative candidate, where it should also be simpler to adjust the parameters.

Shall we think of a 'set_high' (much as the 'set_noisy') function that would attempt to set a slightly improved set of parameters for high dimensions ? If this appears feasible, I could open a dedicated ticket and prepare the required range of simulations for selecting the parameter values...

@nikohansen
Copy link
Collaborator

1- gradient injection scheme stagnation: Try with different initial step-sizes. My prediction is that the y-location of the spike corresponds to the step-size. This would mean, that the reason is that step-size cannot converge as fast to zero as the gradient can point the algorithm to the optimum. Not a surprise, because, the gradient makes effectively 5000 evaluations in one iteration.

The best resolution is probably to apply TPA instead of CSA and use a constant damping parameter instead a dimension-dependent setting.

@beniz
Copy link
Collaborator Author

beniz commented Jan 10, 2015

Not a surprise, because, the gradient makes effectively 5000 evaluations in one iteration.

Understood. Just FTR, I am using the function derivative to avoid the cost per iteration.

The best resolution is probably to apply TPA instead of CSA and use a constant damping parameter instead a dimension-dependent setting.

OK, I'll make that ticket higher on my list.

@nikohansen
Copy link
Collaborator

3- sigma (crude) step-control: I don't see anything wrong with the test cases. Just the same "problem" as before: the expected convergence rate (steepness of the f-graph) of the default strategy is roughly 1 order of magnitude for 10*dimension evaluations.

In the second example, the step-size is in the end IMHO far too large, because the gradient quickly shifted the x-value quickly much closer to the optimum, which renders a given optimal step-size to be too large. Same as the spike effect seen in the other experiments.

@nikohansen
Copy link
Collaborator

instead of

cs = (mueff + 2) / (N + mueff + 3)

we could use

cs = (sqrt(mueff) + 2) / (sqrt(N) + sqrt(mueff) + 3)

as alternative setting. It reduces the time horizon to sqrt(dimension), but I don't think it solves the problem to facilitate quicker step-size decrements.

@nikohansen
Copy link
Collaborator

2-elitist x0: I can't reproduce the "blocking" behavior.

INFO - iter=100 / evals=3600 / f-value=50000 / sigma=0.0976089129404275 / last_iter=92
.
.
.
INFO - iter=8100 / evals=291600 / f-value=9053.62436620463 / sigma=0.0435240965572909 / last_iter=106

It seems just to be about as slow as I expect (I didn't check the rate precisely).

@beniz
Copy link
Collaborator Author

beniz commented Jan 12, 2015

2-elitist x0: I can't reproduce the "blocking" behavior.

OK, thanks for pointing this out, my appreciation of convergence rate is still very much blurry.

At first I thought that elitist-x0 could be a default option, but then I witnessed that it was slowing (or hampering) convergence on the same run with sigma=0.1:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 0.1 -no_stagnation -elitist 2

compared to:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 0.1 -no_stagnation -elitist 2

Initially I thought that increasing lambda in this particular case would limit the impact of the reinjected x0, but it doesn't appear to do so. I still have difficulties grasping this one.

Note that elitist-x0 does re-inject a single instance of the initial candidate.

@beniz
Copy link
Collaborator Author

beniz commented Jan 12, 2015

1- gradient injection scheme stagnation: Try with different initial step-sizes. My prediction is that the y-location of the spike corresponds to the step-size. This would mean, that the reason is that step-size cannot converge as fast to zero as the gradient can point the algorithm to the optimum. Not a surprise, because, the gradient makes effectively 5000 evaluations in one iteration.

I believe you are correct, see below:
sep_97_gi_comp

@beniz
Copy link
Collaborator Author

beniz commented Jan 12, 2015

FYI, this whole thing started because I had noted from our chat, to double-check whether there was a bug in GI in high dim as I could not witness any difference with and without GI on solving neural nets for instance.

In practice, my opinion is that if the gradient is available for a high dimensional problem and can be injected even every x steps, it would make sense to mitigate the problem above.

I can try to see what comes out of injecting the gradient less often, which would also in practice be a computational gain. There are other simple things I am willing to try out, such as reinjecting the best solution so far (instead of waiting til the end of the run).

@nikohansen
Copy link
Collaborator

Right, exactly as predicted (the kink at the end of the right flank is supposed to correspond to step-size).

The most useful mitigation is to enable quick decrease of the step-size. I think reinjection of the best will only help if mu=1, probably not even then so much. Of course in any case the graphs for the current best might look better (depending on what exactly is plotted), but this is rather of cosmetic nature.

@beniz
Copy link
Collaborator Author

beniz commented Jan 12, 2015

we could use

cs = (sqrt(mueff) + 2) / (sqrt(N) + sqrt(mueff) + 3)

Interesting. It does appear to work better in general on the initial problem. In practice it seems to report the low spike in y to the first couple of steps of the algorithm. With initial sigma0=1 this early spike is so acute that it triggers the stopping criteria (e.g. going < 1e-8).

Same experiments as before:

sep_97_gi_comp_ncs

EDIT: fixed the log-scale

@beniz
Copy link
Collaborator Author

beniz commented Jan 12, 2015

Of course in any case the graphs for the current best might look better (depending on what exactly is plotted), but this is rather of cosmetic nature.

Yes, expected that as well. The bumps in the curves might simply be replaced by flat (reinjected) values.

@nikohansen
Copy link
Collaborator

very interesting and somewhat surprising, looks almost like problem virtually solved, given we see about 1e4 evaluations in 1M-D. We probably could still improve quite a bit with TPA (I will try to find some time to cross check this with my implementation, though in smaller dimension).

EDIT: it was 5K-D not 1M-D, so no conclusion for can be drawn for the latter case.

@beniz
Copy link
Collaborator Author

beniz commented Jan 12, 2015

very interesting and somewhat surprising, looks almost like problem virtually solved, given we see about 1e4 evaluations in 1M-D.

Hum, if you are referring to the change of csigma 'horizon' last graph reports in 5K-D unfortunately. I've tried in 1M-D with the updated csigma, and while the spike is closer to the optima, the rebound appears to jump much higher. I will post a graph of a re-run. Here:
sepacma_1m_gi_ncs

with sigma (green) and condition number (red):
sepacma_fsphere_1md_gradient_ncs

I will focus on the TPA scheme as a next step.

Also, I did run into this yesterday, not that I believe it may be of direct help here, but FTR:
http://www.reddit.com/r/MachineLearning/comments/2rzmj4/3_new_adaptive_sgd_papers_in_iclr_2015/

EDIT: graphs, sigma seems to move faster now.

@beniz
Copy link
Collaborator Author

beniz commented Jan 14, 2015

Below is a run with TPA (see #88 for updates), it is to be compared with the set of runs on fsphere in 5K from two days ago.

sepcma_fsphere_5k_tpa

It is reproducible from branch 'tpa_88' with:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 0.1 -no_stagnation -with_gradient -tpa

@beniz
Copy link
Collaborator Author

beniz commented Jan 14, 2015

And this is the 1M-D run with TPA, to be compared with the last 1M-D run from two days ago.
sepacma_fsphere_1md_gradient_tpa

@nikohansen
Copy link
Collaborator

The convergence speed seems rather independent of dimension. Do I interpret this correctly? Generally: could you add an x-annotation, such that it is clear whether it is evaluations or iterations?

@nikohansen
Copy link
Collaborator

I suggest two experiments: (1) setting c_sigma=1 (instead 0.3), to check whether this is the source of the oscillations. (2) setting d_sigma=10 (instead of sqrt(dimension)), and (2a) maybe both at the same time.

@beniz
Copy link
Collaborator Author

beniz commented Jan 14, 2015

Generally: could you add an x-annotation, such that it is clear whether it is evaluations or iterations?

Yes, I've committed a fix. This is all evaluations except on the two initial plots that the ticket was created from and the multi-plots with various values of sigma. Sorry about that, not using the same tools.

EDIT: fix

@nikohansen
Copy link
Collaborator

When I run TPA-ES (without CMA, with gradient injection) in dimension 5K with my Python code, I get:
screen shot 2015-01-14 at 18 24 54
and with d_sigma=5:
screen shot 2015-01-14 at 18 03 23
In particular I don't see these oscillations. Did you check that it small dimension everything is fine?

@beniz
Copy link
Collaborator Author

beniz commented Jan 14, 2015

I am puzzled, in 5K it looks worse (than with CSA and modified damping), in 1M it looks better. Also seems the convergence speed rather independent of dimension. Do I interpret this correctly?

My understanding is that it is much better with TPA in both cases.

For clarity, below are the two plots in 5K-D with sep-CMA-ES:

  • with csigma for high dimension:
    sepcma_fsphere_5k_ncs_notpa
  • with TPA:

sepcma_fsphere_5k_tpa2

EDIT: reproduce TPA on 5K-D from branch 'tpa_88' with:

./test_functions -fname fsphere -alg sepcmaes -ftarget 1e-8 -dim 50000 -x0 1 -sigma0 0.1 -no_stagnation -with_gradient -tpa

@nikohansen
Copy link
Collaborator

Agreed. The oscillations are worrying, they might indicate a bug.

@beniz
Copy link
Collaborator Author

beniz commented Jan 14, 2015

In particular I don't see these oscillations.

My code is young, there could be a bug in current TPA. I was not worried about the oscillations because they looked just like the mitigation of the original problem... will continue to investigate.

EDIT: also, I am already making sure that the gradient injection does not mix with the points used for the TPA.

@beniz
Copy link
Collaborator Author

beniz commented Jan 14, 2015

Here TPA / dsigma=5 / 5K-D / sepcma / gradient injection:
sepcma_fsphere_5k_tpa_dsig5

EDIT: could oscillations be due to the factor to z ? I am using the E||N(0,I)||.

@nikohansen
Copy link
Collaborator

Cross-check whether they are related to dimension, i.e. run in 20-D. Then see above suggested experiment with c_sigma=1. I cannot remember to have ever seen them and I don't see how they could be related to a certain normalization.

@nikohansen
Copy link
Collaborator

FYI, I get something very similar as you if I set d_sigma to 1.

@beniz
Copy link
Collaborator Author

beniz commented Jan 16, 2015

FYI, I get something very similar as you if I set d_sigma to 1.

Right to the point. I did fix the init of d_sigma. The bug was making use of the CMA value of d_sigma. I now can reproduce your code's results (branch 'tpa_88').

I did investigate several little changes in the scaling of z points and d_sigma, but did not find anything very exciting. Only that changing d_sigma sometimes highly affects the convergence speed.

I am wondering if it'd make sense to have a dedicated default value of d_sigma when gradient injection is active.

Also, I believe I will be able to close this ticket. For now the default with CMA is to activate the update value of csigma above 1000-D.

Finally, I can run tests in high dimension for TPA to see how best to set d_sigma in this case. I will report about this in ticket #88.

@beniz
Copy link
Collaborator Author

beniz commented Jan 19, 2015

FTR, sepacmaes / 5K-D/ TPA / Gradient injection:

  • dsigma default:
    sepacmaes_fsphere_5k_dsigma_default
  • dsigma=5:
    sepacmaes_fsphere_5k_dsigma5

@beniz
Copy link
Collaborator Author

beniz commented Jan 19, 2015

I'm bringing both the new csigma value (active above 1000-D) to 'dev' branch (readying it for next release).

I will close once tpa makes it as well to 'dev' branch.

@nikohansen
Copy link
Collaborator

I did a few experiments and

d_sigma = 4 - 3.6 / sqrt(dim) 

seems a valid choice for TPA, which (a) reflects the experiments in the original paper up to 200-D about as well as sqrt(dim) and (b) is a more sensible choice with popsize=dim or with gradient injections, because it reflects the possible convergence rate in this case.

@nikohansen
Copy link
Collaborator

For the record, I don't see anything wrong making

cs = (sqrt(mueff) + 2) / (sqrt(N) + sqrt(mueff) + 3)

the general default setting. AFAIK there is no empirical or theoretical evidence that ~ N^-1/2 is a better or worse choice than ~ N^-1 other than what we have found here.

@beniz
Copy link
Collaborator Author

beniz commented Jan 20, 2015

Thanks. FYI I have a set of experiments still running for determining best d_sigma in high dimensions. I will post the plots in the tpa ticket for cma/sepcma with/without gradient injection. This will allow to make sure we get similar plots as well.

@beniz
Copy link
Collaborator Author

beniz commented Jan 20, 2015

AFAIK there is no empirical or theoretical evidence that ~ N^-1/2 is a better or worse choice than ~ N^-1 other than what we have found here.

OK, well noted. I've noticed the other day that it was the value already used in VD-CMA.

@nikohansen
Copy link
Collaborator

Good point, the underlying reasoning is that in default CMA we have c_mu ~ N^-2, while in VD-CMA, and sep-CMA we have c_mu ~ N^-1 which suggests to also reduce the time horizon for the cumulation (cc and cs) to keep a gap. I use

cc = (1 + 1 / N + mueff / N) / (N**0.5 + 1 / N + 2 * mueff / N)

in the sepCMA case in the Python code. Unfortunately, I can't give a reasoning for the different mueff-corrections for cc and cs out of my head.

@beniz
Copy link
Collaborator Author

beniz commented Feb 4, 2015

I'm trying to wrap this ticket up, and my understanding is that this issue is solved by TPA.

Taking the initial problem of sphere in 1M-D with sep-aCMA-ES and gradient injection, and given the change to cmu operated in #88 for sep without TPA as well as the new default of TPA's dsigma (#97 (comment)), here are the current results:

  • CMA
    sepacmaes_1md
    Reproduced with:
./test_functions -fname fsphere -alg sepacmaes -ftarget 1e-8 -dim 1000000 -with_gradient -no_stagnation -max_hist 1000 -x0 1 -sigma0 0.1
  • TPA
    sepacmaes_1md_tpa
    Reproduced with:
./test_functions -fname fsphere -alg sepacmaes -ftarget 1e-8 -dim 1000000 -with_gradient -no_stagnation -max_hist 1000 -x0 1 -sigma0 0.1 -tpa 2

The oscillations in the latter (TPA) are observed here in the case of 1M-D but not in the case of 5K-D (I've checked again just right now). Not sure is this qualifies as an issue.

EDIT: added mention of gradient injection

@beniz
Copy link
Collaborator Author

beniz commented Feb 20, 2015

Solved with TPA, closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants