Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not use all cpus for one walker ABF method #8

Open
sathishdasari opened this issue Dec 13, 2018 · 9 comments
Open

Can not use all cpus for one walker ABF method #8

sathishdasari opened this issue Dec 13, 2018 · 9 comments

Comments

@sathishdasari
Copy link

sathishdasari commented Dec 13, 2018

Dear Sir,

  1. I am trying to run a one walker ABF on ADP system. My system is consisting of 1 socket, 4 cores per socket and 2 threads per core (1x4x2=8 CPUs). But when I run the job using "ssages 1walker.json" it uses only 3 CPUs (from top command %CPU). How can I use all CPUs to get good performance?

  2. When I run the same job on a system consisting of 2 sockets, 6 cores per socket, 2 threads per core (2x6x2=24 CPUs) it gives the following error:

Fatal error:
Your choice of 1 MPI rank and the use of 24 total threads leads to the use of
24 OpenMP threads, whereas we expect the optimum to be with more MPI ranks with
1 to 6 OpenMP threads. If you want to run with this many OpenMP threads, specify
the -ntomp option. But we suggest to increase the number of MPI ranks.
  1. 2 walker job is running perfectly fine with full efficiency with the command mpirun -np 24 2walker.json on this system.

  2. How to know the convergence of ABF method using this software? Does the simulation terminate automatically after it converges? If not, how to extend the simulation using this software?

  3. How to extract the structures of the free energy minima from the trajectory? As we do not have any file which provides collective variable values to be printed along the simulation time, which is helpful to get the frame numbers in extracting the structures corresponding to particular minimum. Like COLVAR file in PLUMED Software.

@mquevill
Copy link
Collaborator

  1. When you call SSAGES without mpirun/mpiexec, you are only spawning one process. GROMACS sets OpenMP threads internally. There is code within GROMACS that will help choose the number of threads if unspecified. By default, it will try to use all available threads (8 in your case). However, each thread may not use 100% of the core, based on GROMACS's optimizations. For example, on my workstation, each core is only at ~82%, which is why using the percentage from top may appear that it is only using 3 CPUs. If you call top -H, each process will show its threads separately, so this should show 8 lines of ssages.

  2. GROMACS attempts to optimize the threads and ranks for your simulation; this error comes from GROMACS's attempts to optimize running parameters. One rank with 24 threads is often less efficient with GROMACS. For this, I would suggesting using multiple ranks. To specify this, change ssages 1walker.json to mpirun -np 4 ssages 1walker.json, to use 4 ranks, for example. This will use the MPI capabilities of GROMACS natively. [If, however, you would actually like to use 24 OpenMP threads, you can specify "-ntomp","24" within the "args" member of the .json file.]

  3. This is good to hear. In this case, you are specifying 24 MPI ranks, so GROMACS only assigns 1 OpenMP thread per rank.

  4. Currently, there is no criterion or indicator of convergence built into SSAGES. The development team has discussed various ways to do this, and is currently in-progress. To extend the simulation, add a JSON member to the method: "restart": true, which will read the files from the last run and continue from there. (If "restart" is false or unspecified, then the old files will be backed up once the new files are written.)

  5. You can set up a Logger that will print the CVs as the simulation proceeds. (Manual > Input Files > Simulation Properties > Logger) This can be helpful to track other CVs, while only sampling over a few. See below for the syntax:

"logger": {
        "frequency": 100,
        "output_file": "cvs.dat",
        "cvs": [0, 3]
}

If you have any further questions, please let us know!

@sathishdasari
Copy link
Author

Thank you very much for your suggestions.

@sathishdasari
Copy link
Author

sathishdasari commented Dec 14, 2018

Dear Sir,

  1. How to specify the logger for 2 walker simulation to print CVs with simulation time?
  2. When I try to restart a 2 walker simulation which is crashed in between it is giving the following error:
[mm3:06753] *** Process received signal ***
[mm3:06753] Signal: Segmentation fault (11)
[mm3:06753] Signal code: Address not mapped (1)
[mm3:06753] Failing at address: 0x428
  1. I tested restarting a 1 walker job which crashed in between and it restarts perfectly.

@mquevill
Copy link
Collaborator

  1. The JSON member "output_file" can take an array of strings. For two walkers, for example, you can use this:
"output_file": ["cvs_w0.dat", "cvs_w1.dat"]
  1. Do you get this error right at the beginning of the simulation? Or does the simulation start and the error occurs somewhere in the middle of the simulation? I have been able to restart the included 2 walker ADP example without a segmentation fault. Make sure that you are restarting a simulation with the same details (method parameters, number of walkers, etc.). If you have changed something about the method, then the software might have incorrect data when trying to read the files in.

@sathishdasari
Copy link
Author

sathishdasari commented Dec 17, 2018

Thank you.

@sathishdasari
Copy link
Author

sathishdasari commented Dec 18, 2018

I was trying a 2 walker simulation of ADP in solvent. After some time the job was killed, displaying the following error.

*** Error in `ssages': free(): invalid pointer: 0x00000000012dc3e0 ***

@sathishdasari
Copy link
Author

sathishdasari commented Dec 20, 2018

Dear Sir,
When I was trying to extend a 2 walker simulation, it was displaying following error.

*** error in `ssages': corrupted size vs. prev_size: 0x00000000025c24d0 ***

@mquevill
Copy link
Collaborator

I'm afraid that these error messages aren't enough to help diagnose your problem. If there is more output surrounding these error messages, please copy as much as is relevant.

Or if your issue is reproducible, you can attach the files needed to run your simulation so that the development team can try to reproduce your issue. This way, we can try to debug whatever is happening in this system.

@sathishdasari
Copy link
Author

sathishdasari commented Dec 22, 2018

Dear Sir,
I could not share the files as the file size is more than 10MB. I just changed the args in 2walker.json file from

"args" : ["-s","-deffnm","adp"],

to

"args" : ["-s","-deffnm","adp","-cpi", "adp", "-append"],

and added

"restart" : true,

to the .json file.
I used the following command to run the simulation on a system consisting of 2 sockets, 6 cores per socket, 2 threads per core (2x6x2=24 CPUs)

mpirun -np 24 ssages 2walker.json &

I am getting the following error:

*** Error in `ssages': corrupted size vs. prev_size: 0x0000000001e824a0 ***
[ccl2:22785] *** Process received signal ***
[ccl2:22785] Signal: Aborted (6)
[ccl2:22785] Signal code:  (-6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants