-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sep-(a)CMA-ES does escape the solution basin on fsphere in high dimension with gradient injection #97
Comments
It must already be a bug in "pure" separable CMA-ES, as it is supposed to converge. I also suggest to always plot the y-axis in log-scale. For diagnosis, the standard output with step-size(s), eigenvalues etc is advisable. EDIT: I didn't consider that it was 1M-D, in which case this should be considered expected behavior, if the initial step-size is large. |
Yes, sigma moves down by ~1e-5, and the condition number up by ~1e-7 in both cases. I can't see any difference before and after step 60 when GI is activated. Since this is in 1M-D, I will try to reproduce in lower dimension, which would simplify plotting of eigen and param values as well. Sep was tested on BBOB and on neural nets of up to ~400-D without experiencing this behavior. I will report here as it goes... |
modification of the initial step-size and a longer run should help On Thu, Jan 8, 2015 at 3:20 PM, Emmanuel Benazera notifications@github.com
|
playing with the initial step-size certainly allows to trigger the problem:
immediately diverges, while
I'm investigating whereas this could be due to a bug. I am worried that users might try in high dimensions and reject the whole thing while a change in sigma would do. |
So, a few more experiments in lower dims show that on sufficiently long runs, sigma lowers and the search takes the right direction again. So I am thinking that it could make sense to have a trick that tries a larger than usual change in sigma when the search goes for a certain number of steps in the wrong direction ? |
I see, then "initial elitism" might be a good try, and you are right, additionally decreasing sigma by a factor of 2 until the first success is observed is probably quite effective. One general problem is that cumulative step-size adaptation has a horizon that depends on the dimension (though it is possible to modify this dependency down to sqrt dimension by setting cs accordingly). The parameters or even the method might not be perfectly suited for 1M-D problems. Two-point adaptation is a promising alternative candidate, where it should also be simpler to adjust the parameters. To get the full picture: what is the population size? |
lambda = 45
The current implementation of initial elitism enables after the run ends. In high dim, an earlier trigger could be useful. |
To me it seems generally useful even without restarts. Actually, it seems particularly useful for a user-supplied initial point to be "elitist" for a single run. |
It remains to be understood why the gradient approach stagnates in the end? |
Yes, I may need to run it longer. I will post here results from a set of experiments. |
So, I am able to report on three tiny fronts: First, elitist x0 (the 2-) does not appear to work always well: in some reproducible cases, the search immediately stales at f(x0) value unless lambda is increased. This may be that somehow the reinjection prevents the cov matrix to shape such that a direction appears ? Second, I have tried 3 on 1, with mitigated results, that may indicate either a bug in gradient injection and/or that scheme 3 needs more careful control. Due to the linearity of github comments, I will separate and results into different comments. |
1- gradient injection scheme stagnation |
2-elitist x0 While the scheme sometimes works, it appears to stagnate indefinitely in certain cases. Typically, taking the diverging case of a few comments above, applying the new scheme:
which yields:
The scheme appears to even block convergence on a an otherwise converging example:
|
Shall we think of a 'set_high' (much as the 'set_noisy') function that would attempt to set a slightly improved set of parameters for high dimensions ? If this appears feasible, I could open a dedicated ticket and prepare the required range of simulations for selecting the parameter values... |
1- gradient injection scheme stagnation: Try with different initial step-sizes. My prediction is that the y-location of the spike corresponds to the step-size. This would mean, that the reason is that step-size cannot converge as fast to zero as the gradient can point the algorithm to the optimum. Not a surprise, because, the gradient makes effectively 5000 evaluations in one iteration. The best resolution is probably to apply TPA instead of CSA and use a constant damping parameter instead a dimension-dependent setting. |
Understood. Just FTR, I am using the function derivative to avoid the cost per iteration.
OK, I'll make that ticket higher on my list. |
3- sigma (crude) step-control: I don't see anything wrong with the test cases. Just the same "problem" as before: the expected convergence rate (steepness of the f-graph) of the default strategy is roughly 1 order of magnitude for 10*dimension evaluations. In the second example, the step-size is in the end IMHO far too large, because the gradient quickly shifted the x-value quickly much closer to the optimum, which renders a given optimal step-size to be too large. Same as the spike effect seen in the other experiments. |
instead of
we could use
as alternative setting. It reduces the time horizon to sqrt(dimension), but I don't think it solves the problem to facilitate quicker step-size decrements. |
2-elitist x0: I can't reproduce the "blocking" behavior.
It seems just to be about as slow as I expect (I didn't check the rate precisely). |
OK, thanks for pointing this out, my appreciation of convergence rate is still very much blurry. At first I thought that elitist-x0 could be a default option, but then I witnessed that it was slowing (or hampering) convergence on the same run with sigma=0.1:
compared to:
Initially I thought that increasing lambda in this particular case would limit the impact of the reinjected x0, but it doesn't appear to do so. I still have difficulties grasping this one. Note that elitist-x0 does re-inject a single instance of the initial candidate. |
|
FYI, this whole thing started because I had noted from our chat, to double-check whether there was a bug in GI in high dim as I could not witness any difference with and without GI on solving neural nets for instance. In practice, my opinion is that if the gradient is available for a high dimensional problem and can be injected even every x steps, it would make sense to mitigate the problem above. I can try to see what comes out of injecting the gradient less often, which would also in practice be a computational gain. There are other simple things I am willing to try out, such as reinjecting the best solution so far (instead of waiting til the end of the run). |
Right, exactly as predicted (the kink at the end of the right flank is supposed to correspond to step-size). The most useful mitigation is to enable quick decrease of the step-size. I think reinjection of the best will only help if mu=1, probably not even then so much. Of course in any case the graphs for the current best might look better (depending on what exactly is plotted), but this is rather of cosmetic nature. |
Interesting. It does appear to work better in general on the initial problem. In practice it seems to report the low spike in y to the first couple of steps of the algorithm. With initial sigma0=1 this early spike is so acute that it triggers the stopping criteria (e.g. going < 1e-8). Same experiments as before: EDIT: fixed the log-scale |
Yes, expected that as well. The bumps in the curves might simply be replaced by flat (reinjected) values. |
very interesting and somewhat surprising, looks almost like problem virtually solved, given we see about 1e4 evaluations in 1M-D. We probably could still improve quite a bit with TPA (I will try to find some time to cross check this with my implementation, though in smaller dimension). EDIT: it was 5K-D not 1M-D, so no conclusion for can be drawn for the latter case. |
Hum, if you are referring to the change of csigma 'horizon' last graph reports in 5K-D unfortunately. I've tried in 1M-D with the updated csigma, and while the spike is closer to the optima, the rebound appears to jump much higher. with sigma (green) and condition number (red): I will focus on the TPA scheme as a next step. Also, I did run into this yesterday, not that I believe it may be of direct help here, but FTR: EDIT: graphs, sigma seems to move faster now. |
Below is a run with TPA (see #88 for updates), it is to be compared with the set of runs on fsphere in 5K from two days ago. It is reproducible from branch 'tpa_88' with:
|
The convergence speed seems rather independent of dimension. Do I interpret this correctly? Generally: could you add an x-annotation, such that it is clear whether it is evaluations or iterations? |
I suggest two experiments: (1) setting c_sigma=1 (instead 0.3), to check whether this is the source of the oscillations. (2) setting d_sigma=10 (instead of sqrt(dimension)), and (2a) maybe both at the same time. |
Yes, I've committed a fix. This is all evaluations except on the two initial plots that the ticket was created from and the multi-plots with various values of sigma. Sorry about that, not using the same tools. EDIT: fix |
My understanding is that it is much better with TPA in both cases. For clarity, below are the two plots in 5K-D with sep-CMA-ES: EDIT: reproduce TPA on 5K-D from branch 'tpa_88' with:
|
Agreed. The oscillations are worrying, they might indicate a bug. |
My code is young, there could be a bug in current TPA. I was not worried about the oscillations because they looked just like the mitigation of the original problem... will continue to investigate. EDIT: also, I am already making sure that the gradient injection does not mix with the points used for the TPA. |
Cross-check whether they are related to dimension, i.e. run in 20-D. Then see above suggested experiment with c_sigma=1. I cannot remember to have ever seen them and I don't see how they could be related to a certain normalization. |
FYI, I get something very similar as you if I set d_sigma to 1. |
Right to the point. I did fix the init of d_sigma. The bug was making use of the CMA value of d_sigma. I now can reproduce your code's results (branch 'tpa_88'). I did investigate several little changes in the scaling of z points and d_sigma, but did not find anything very exciting. Only that changing d_sigma sometimes highly affects the convergence speed. I am wondering if it'd make sense to have a dedicated default value of d_sigma when gradient injection is active. Also, I believe I will be able to close this ticket. For now the default with CMA is to activate the update value of csigma above 1000-D. Finally, I can run tests in high dimension for TPA to see how best to set d_sigma in this case. I will report about this in ticket #88. |
I'm bringing both the new csigma value (active above 1000-D) to 'dev' branch (readying it for next release). I will close once tpa makes it as well to 'dev' branch. |
I did a few experiments and
seems a valid choice for TPA, which (a) reflects the experiments in the original paper up to 200-D about as well as sqrt(dim) and (b) is a more sensible choice with popsize=dim or with gradient injections, because it reflects the possible convergence rate in this case. |
For the record, I don't see anything wrong making
the general default setting. AFAIK there is no empirical or theoretical evidence that ~ N^-1/2 is a better or worse choice than ~ N^-1 other than what we have found here. |
Thanks. FYI I have a set of experiments still running for determining best d_sigma in high dimensions. I will post the plots in the tpa ticket for cma/sepcma with/without gradient injection. This will allow to make sure we get similar plots as well. |
OK, well noted. I've noticed the other day that it was the value already used in VD-CMA. |
Good point, the underlying reasoning is that in default CMA we have c_mu ~ N^-2, while in VD-CMA, and sep-CMA we have c_mu ~ N^-1 which suggests to also reduce the time horizon for the cumulation (cc and cs) to keep a gap. I use
in the sepCMA case in the Python code. Unfortunately, I can't give a reasoning for the different |
I'm trying to wrap this ticket up, and my understanding is that this issue is solved by TPA. Taking the initial problem of sphere in 1M-D with sep-aCMA-ES and gradient injection, and given the change to
The oscillations in the latter (TPA) are observed here in the case of 1M-D but not in the case of 5K-D (I've checked again just right now). Not sure is this qualifies as an issue. EDIT: added mention of gradient injection |
Solved with TPA, closing for now. |
First, comparing sep-aCMA-ES with and without gradient injection (GI), GI allows fast convergence:
The difference in initial value is due to not showing iteration 0 (will be ticketed and fixed).
However, a closer look shows that while nearing the optima, sep-aCMA-ES with GI escapes the solution basin and stagnates around 5000 or so in 1M-D. Below is a logscale of the objective function value on the same run as above:
Same behavior is witnessed with sep-CMA-ES.
To reproduce:
This might be a bug, in all cases, this is a weird behavior.
The text was updated successfully, but these errors were encountered: