-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SE++ doesn't always use full CPU power #580
Comments
For the previous runs, I had For |
The plot was done on single epoque fitting of about 50k sources or so. I don't think you can have too much But the unit of |
Another thing we found out just recently, the processing is faster if the objects that are fed to the fitting are ordered. Now, if you do the model fitting with detection you get the ordering for free, since the detection work in a sequence over the image. If you make model fitting without detection I would order the objects according to ra or dec or x or y. This reduces the I/O since the sequence of objects sent to the fitting are usually close and frequently on the same tile which is already in RAM, which avoids frequent data loading. |
I've figured out why 16 bands was so much faster than 4 or 8... Out of the 16 bands, 2 were completely blank (only 0s on the images and weight maps) for the cutouts I considered. And it would seem that SE++ doesn't perform model fitting when one of the measurement images is completely blank. All of the sources ended with |
First of all, do your CPU's go to 100% now? |
Thanks @mkuemmel for this insight! I do manage to get a more optimal allocation of the resources by implementing this advice! Also, on my side I am using the ASSOC mode withoud detection, and the ordering makes so much sense. I do see a slight speed-up by. doing the ordering, so thanks for that! |
You have to keep in mind that each CPU runs one source. And if there is a large source or a large source group at the end of the processing that takes much longer than a typical source those ones stay and there is nothing to keep the other cores busy. |
Which pixels are discarded or not depends much on your settings for the weight-threshold. So it could well be that your zero pixels are in the fit and the minimization just does not work. You can check the stop reasons in the levmar documentation. |
Hi,
I'm working with @mShuntov on running SE++ on big (more than 10 000 x 10 000px) images. To make the computations faster, we're using Amazon Web Services EC2 which offers scalable cloud computing1.
I've worked on benchmarking SE++ on different image sizes and different EC2 machines to find the optimal machine to make SE++ run the fastest.
However, I see that the CPU usage rarely reaches 100%, even on big images that take many hours to run on SE++. I tried to modify the
thread_count
parameter, but beyond a certain point it didn't seem to help.Here are my (empirical) conclusions :
thread_count
is set too low, SE++ can't use the full power of the machine and ends up being slowed down.thread_count
is set too high, the threads pile up and SE++ ends up being slowed down.thread_count
but this doesn't make SE++ faster).Here are some plots summarizing my benchmark, and a more detailed analysis :
This first one shows runs on small images (0.25 arcmin²=450 sources and 1.0 arcmin²=1570 sources). The metric I chose is the runtime in seconds per sources per bands (measurement images). One can see here the plateau of
thread_count
, making no improvment on the runtime after some point. Another surprise is to see how 16 bands is much faster than 2, 4 or 8 bands. I don't understand where this big gap comes from !This second plot shows unfinished runs of SE++ on bigger images (4.0 arcmin²=6100 sources and 16.0 arcmin²=27500 sources). The runtime is then estimated based on the time it took to reach a specific percentage of the number of sources. Again, we see the tendency for an optimal
thread_count
. Here the c6a.4xlarge machine is the bottleneck for such a big task and we can see that 8 bands is roughly twice as fast as 16 bands because the CPU is running at 100%. We also see the effect of disabling hyper-threading, lowering the CPU usage without sacrificing run time.Could we be enlighted on the
thread_count
parameter and why SE++ doesn't seem to always use the CPU at its full potential ?Footnotes
I've written a tutorial with different bash scripts to make the use of AWS EC2 more friendly with VS Code and Jupyter notebooks : https://github.com/AstroAure/VSJupytEC2 ↩
The text was updated successfully, but these errors were encountered: