Parallelization only part-time #204

Optiuse · 2020-02-19T00:37:27Z

Hello,

when I want multithreaded parallel function evaluations for 8 Threads i do this:

omp_set_dynamic(false); // omp setting
omp_set_num_threads(8); // omp setting
cmaparams.set_mt_feval(true); // libcmaes setting

This should aktivate in esostrategy.cc:
omp parallel for if (_parameters._mt_feval)

I have very expensive functions and some parameters take much longer than others. The problem is that often not all threads are used. Overall, this makes the optimization much longer because the CPU is often only used to 1/8 of its capacity. Therefore the effect of parallelization is only part-time.
It gives the impression that the parallelization is interrupted as soon as a "slow function execution" occurs in the overall process. As if the execution of new threads would have to wait.

What is the reason for this and is there a possibility that all 8 threads are always used and therefore the cpu is used at full capacity? This would significantly accelerate the total optimisation process.

Greetings

beniz · 2020-02-19T13:02:18Z

Hi, my guess is that some of your external function calls take much longer than others, and thus the parallelized for loop waits until all calls have finished.

Optiuse · 2020-02-19T13:14:55Z

Ok there always 8 runs to be completed before the next 8 are started? It is possible that it does not block and the next already start.

I have found: "OpenMP will automatically wait for all threads to finish before execution continues."
http://ppc.cs.aalto.fi/ch3/nowait/
Is it possible to use something like "nowait" so that it no longer blocks?
#pragma omp for nowait

beniz · 2020-02-19T13:33:31Z

If you'd like to reacquire the threads so that they can be used somehow by the still ongoing eval function calls, you may want to try using nowait, but my guess is that would only be of interest if your eval function is calling on new threads during its execution... Seems like a weird niche case I think. I guess the main question here is is your eval function multithreaded to begin with ?

Optiuse · 2020-02-19T18:11:23Z

Inside the FitFunc it is blocked until the return value is calculated. Inside the FitFunc a thread is called but FitFunc blocked inside until the thread is finished and the return value for FitFunc is calculated. So the FitFunc is not bypassed and gets its return.

But I think I found the reason. It's related to the value of lambda.
I have tracked the cpu load in the taskmanager with different lambda settings.
The effect that the cpu is not always working to capacity is probably related to the value of lambda.
Shortly before the number of runs that is equal to the value of lambda, is completed, the cpu load will decrease.
You can see in the picture that this effect of cpu-decrease always exists but the distances of cpu-decrease are different depending on the lambda.

So now i think the effect has to do with the algorithm of cmaes (value lambda) and not with openmp.
Do you think there is a way that all threads are always used? That would accelerate the total optimisation process.

yasamoka · 2020-03-06T18:25:38Z

Well, show us some code.

Optiuse · 2020-03-26T13:49:32Z

The code doesn't matter because everything that takes some time will have this effect. You can put in the function a random sleep (e.g. one second to 2 minutes) and the same thing will happen.

I think the problem is that the algorithm of LIBCMAES wants to finish the full number of lambda completely before going any further. It seems to block until a population is completely done. In the last runs of a population only a few or one thread is busy until a new population can start. This effect can be seen in the graphic above and prevents the CPU from being fully used. This slows down the total optimization time very much.

LIBCMAES ist great and can effectively find the global optimum in a small number of steps, but this advantage is nullified because the parallelization is only part-time.
Is it that genetic optimizers always have to block before starting a new population?

As an example there is another algorithm Simulated Annealing for optimization that also uses OpenMP but that doesn't block anything:
https://github.com/epwalsh/CSA
This needs much more runs to get the best result (not as effective as CMAES) but is faster overall because nothing is blocked.

Would it be possible to design cmaes so that the multithreading is not limited? LIBCMAES is more effective algorithm and can effectively find the global optimum in a small number of steps, but this advantage is nullified because the parallelization is only part-time.
It would be great if LIBCMAES could nonstop fully use the cpu.

nikohansen · 2020-03-26T14:16:19Z

Is it that genetic optimizers always have to block before starting a new population?

No.

Would it be possible to design cmaes so that the multithreading is not limited?

Without the knowledge about implementation details (that is without knowing how simple, complicated or impossible it may be to implement), I see out of my head two "simple" ways to go about this:

if one solution is finished, send a new solution to the idle CPU. Complete the iteration when enough (i.e. lambda) solutions have finished and kill the still running solutions.
set lambda to, say, 90% of the #CPUs and occupy all CPUs when starting the iteration. Then the iteration is completed when lambda solutions are finished and the remainders can be killed. This can still lead to a long iteration, if more than 10% of the solutions are stuck. This may however not be likely, because long running solutions have a selection disadvantage.

In principle, CMA-ES is robust to small changes in lambda or, equivalently, to assigning 15%-or-so of the solutions to arbitrary bad fitness values.

A note on the above figure: to me it suggests (if my interpretation is correct) that the loss in time is less than a factor of two: the reddish area is more than half of the white area.

Optiuse · 2020-03-26T17:02:34Z

Thanks for the feedback!

A note on the above figure: to me it suggests (if my interpretation is correct) that the loss in time is less than a factor of two: the reddish area is more than half of the white area.

this was just harmless example. here is a current example of the cpu load. you can see that almost more than half of the cpu is not used. if the solutions only take a slightly different time then less than 50% of the cpu is used.
32cores, lambda=50:

if one solution is finished, send a new solution to the idle CPU. Complete the iteration when enough (i.e. lambda) solutions have finished and kill the still running solutions.

currently it looks like there is a loop with the number of lambda. only when all in the loop are processed a new loop starts. this probably causes the always falling cpu curve.
if you kill calculations that take too long then they might be the successful ones. the function runs that take long might be the good ones.

set lambda to, say, 90% of the #CPUs and occupy all CPUs when starting the iteration. Then the iteration is completed when lambda solutions are finished and the remainders can be killed. This can still lead to a long iteration, if more than 10% of the solutions are stuck. This may however not be likely, because long running solutions have a selection disadvantage.

as said the long running solutions can be the good solutions. if these are killed, the optimum might not be found.
in order for the optimisation to work, lambda has to depend on the dimension of the problem. if lambda is too small, it will not find a solution. if it is too big, the optimisation takes far too long.
i think therefore lambda should be independent from the core number because it depends on the dimension.

if lambda is smaller than the cores then the cpu is never 100% utilized.
in the future there will be more and more cores in cpus. i am currently optimizing with 32 cores. if i set lambda to 16 it looks like this.

at highest 50% of the cpu is used. then directly it drops immediately to near 0%.
In total, only 25% of the cpu is used in such an optimization.

in the near future, there will be cpus with 128, 256, 512 or more cores. If you optimize with e.g. Lambda=50 you will never be able to fully use the cpu.
It would be perfect if such great lib like LIBCMAES are really fully scalable (independent from values like lambda) otherwise the effect of multithreading cannot be fully used.

beniz · 2020-03-26T17:27:29Z

I guess that theoretically (and @nikohansen can validate/invalidate), you should be able to hold on the values of points for which simulation takes longer, and reinject them back in when their f-value becomes available. If this is a valid option, then you may be able to hack this from libcmaes code without too much trouble, approximatively like this:

write a return signal for FitFunc to signify that the call hasn't finished in time
remove those points from the lambda set
keep track of points values under computation within a pool, reinject them whenever they are ready, and have FitFunc access the pool to get the value.

In all cases, we can only guide you here with direction on how to hack the libcmaes so that it best serves your purpose, you'll have to code it up.

Optiuse · 2020-03-26T17:43:17Z

I hope I got it. So if something takes too long, it'll continue independently and to continue the for-loop on libcmaes I have to return a pseudo-value, right?

when the run is finished in the background i will overwrite this pseudo value in the libcmaes pool right?

how can i access the pool? and is there an id to identify the correct entry?

beniz · 2020-03-26T18:35:59Z

You'd need to build a pool outside of libcmaes: pool <-> fitfunc <-> libcmaes . You'd have to maintain whatever id you need. It's sketchy but it may work.

@nikohansen hello! what would it mean for CMA-ES to deal with asynchronous f-values, meaning having CMA-ES sampling a bunch of points, but getting the point values back asynchronously ? The distribution at time t would use whatever points are available, possibly from past sampling rounds, but every sampled point would eventually be considered. Would that hamper convergence ?

Optiuse · 2020-03-26T19:14:45Z

ok I can start a seperate independent thread in FitFunc that get the parameters and then in FitFunc return a pseudo value that the libcmaes-loop not blocking.
But when the thread is finished, how do i get the return value back to libcmaes? Into which data structure from libcmaes must the value be placed?

FitFunc fsphere = [](const double *x, const int N)
{
// start seperate std::thread...
return val; // return pseudo value
};

thread()
{
// calc...
// how do i get the return value in cmaes?
}

nikohansen · 2020-03-26T23:16:53Z

you should be able to hold on the values of points for which simulation takes longer, and reinject them back in when their f-value becomes available.

what would it mean for CMA-ES to deal with asynchronous f-values, meaning having CMA-ES sampling a bunch of points, but getting the point values back asynchronously ? The distribution at time t would use whatever points are available, possibly from past sampling rounds, but every sampled point would eventually be considered. Would that hamper convergence ?

@beniz Hi! Right, I didn't think about this. It is possible. It is hard to predict how reliable it is (it should usually work, but what do I know). What needs to be available is something akin to check_points in the Python module which calls a repair function. This function basically shortens the difference step between the incoming solution and the mean, if it is longer than one should expect according to the current sample distribution. Or in other words, the incoming solution is then interpreted as a direction relative to the current mean, rather than as a solution. It is a generic technique which allows to inject arbitrary proposal solutions (and it works more often than not, if we don't inject too many of them).

If I find a simple way to mimic the situation in a test function, I may check out how robust the Python module is when faced with such delayed evaluations.

in the near future, there will be cpus with 128, 256, 512 or more cores. If you optimize with e.g. Lambda=50 you will never be able to fully use the cpu.

I don't see how one can get around this easily. If 500 CPUs are populated from the current distribution, the update will reflect the current distribution for 500 evaluations. It seems impossible to mimic 10 iterations without updating the samples in between. I usually consider to run several independent optimizations in parallel in this scenario. It's hard to see, if setting Lambda=500 is not a viable option, how to fully exploit the available CPUs otherwise.

nikohansen · 2020-03-27T19:09:32Z

A small update: if older solutions are fed back, it seems advisable to turn off active CMA (which updates the covariance matrix from bad solutions with negative weights) for understandable reasons. Also the TPA step-size adaptation is not likely to work out-of-the-box anymore (it relies on the first two solutions fed back).

This is the code I used to make a quick experiment:

import numpy as np
import cma
class LaggingFitness:
    """delivers X, F like the input argument to tell 
    """
    def __init__(self, f, fraction, X):
        """`fraction` is the fraction of solutions with delayed evaluation"""
        self.f = f
        self.fraction = fraction
        self.X = [np.array(x) for x in X]
        
    def __call__(self, X):
        """return X_out, F, where X_out != X"""
        Xout = []
        lenX = len(X)
        while len(Xout) < self.fraction * lenX:
            # TODO: we may want to choose the delay by fitness value rather than uniform
            Xout += [self.X.pop(np.random.randint(len(self.X)))]
        while len(Xout) < lenX:
            Xout += [X.pop(np.random.randint(len(X)))]
        self.X += X  # keep the rest for later
        return Xout, [self.f(x) for x in Xout]

es = cma.CMAEvolutionStrategy(11 * [1], 1, {'ftarget': 1e-9, 'CMA_active':False})
fit = LaggingFitness(cma.ff.elli, 0.8, es.ask() + es.ask()[:1])
while not es.stop():
    X = es.ask()
    X, F = fit(X)
    es.tell(X, F)
    es.disp()
    es.logger.add()
cma.plot()

yasamoka · 2020-03-28T02:55:29Z

Is your fitness function itself parallelizable? I went that route with OpenMP and turned multithreading off for CMAES. That way the level of parallelism would not be dependent on the number of samples taken every iteration.

Optiuse · 2020-03-28T22:51:43Z

I don't see how one can get around this easily. If 500 CPUs are populated from the current distribution, the update will reflect the current distribution for 500 evaluations. It seems impossible to mimic 10 iterations without updating the samples in between. I usually consider to run several independent optimizations in parallel in this scenario. It's hard to see, if setting Lambda=500 is not a viable option, how to fully exploit the available CPUs otherwise.

yes, that's true. if the number of cpu cores becomes too high in relation to lambda then it will not work. maybe from about factor 3 on it will not make sense anymore. if you increase lambda the cpu usage is better but then the total time will be longer.
yes maybe it makes sense to do independent optimizations in this case. a possibility would be to split the parameter-bounds.

Is your fitness function itself parallelizable? I went that route with OpenMP and turned multithreading off for CMAES. That way the level of parallelism would not be dependent on the number of samples taken every iteration.

yes that would be a good idea but in my case it is not so easy to parallelize the function itself because the data is processed serially.

nikohansen mentioned this issue Apr 2, 2020

Avoid C update with negative weights for (bad) injected solutions CMA-ES/pycma#124

Closed

nikohansen mentioned this issue Oct 2, 2020

Async tell and eval CMA-ES/pycma#147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization only part-time #204

Parallelization only part-time #204

Optiuse commented Feb 19, 2020

beniz commented Feb 19, 2020

Optiuse commented Feb 19, 2020

beniz commented Feb 19, 2020

Optiuse commented Feb 19, 2020

yasamoka commented Mar 6, 2020

Optiuse commented Mar 26, 2020

nikohansen commented Mar 26, 2020

Optiuse commented Mar 26, 2020

beniz commented Mar 26, 2020

Optiuse commented Mar 26, 2020 •

edited

Loading

beniz commented Mar 26, 2020

Optiuse commented Mar 26, 2020 •

edited

Loading

nikohansen commented Mar 26, 2020

nikohansen commented Mar 27, 2020

yasamoka commented Mar 28, 2020

Optiuse commented Mar 28, 2020

Parallelization only part-time #204

Parallelization only part-time #204

Comments

Optiuse commented Feb 19, 2020

beniz commented Feb 19, 2020

Optiuse commented Feb 19, 2020

beniz commented Feb 19, 2020

Optiuse commented Feb 19, 2020

yasamoka commented Mar 6, 2020

Optiuse commented Mar 26, 2020

nikohansen commented Mar 26, 2020

Optiuse commented Mar 26, 2020

beniz commented Mar 26, 2020

Optiuse commented Mar 26, 2020 • edited Loading

beniz commented Mar 26, 2020

Optiuse commented Mar 26, 2020 • edited Loading

nikohansen commented Mar 26, 2020

nikohansen commented Mar 27, 2020

yasamoka commented Mar 28, 2020

Optiuse commented Mar 28, 2020

Optiuse commented Mar 26, 2020 •

edited

Loading

Optiuse commented Mar 26, 2020 •

edited

Loading