Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amber parser should be improved #222

Closed
DrDomenicoMarson opened this issue Sep 6, 2022 · 28 comments
Closed

Amber parser should be improved #222

DrDomenicoMarson opened this issue Sep 6, 2022 · 28 comments

Comments

@DrDomenicoMarson
Copy link
Contributor

DrDomenicoMarson commented Sep 6, 2022

Hello everyone, I think the Amber_parser needs some adjustment, other than issue #221 which was causing misreading of time information for both u_nk and dHdl parsers. Here I report the fixes that I think should be made, and I think the best thing should be a complete rewrite of the parser, properly tested:

  1. Temperature is required in input for both parsers (extract_dHdl and extract_u_nk), but system temperature is already automatically read from the output file (reading temp0). We could make the T variable optional in the parsers, and I suggest not removing it, not to break existing scripts from users. We should check that the requested temperature T is the same as reported by the AMBER output file and report eventual inconsistencies to the user.

  2. related to point 1), currently if temp0 is not found we log that 'Non-constant temperature MD not currently supported.' I don't think that this is right, as temp0 is settled even in non-constant temperature MD, so I think that the check is meaningless. I agree that we could warn the users if the simulation was not performed at a constant T, but that's not so easy to check, and I don't think a similar check exists for other parsers (the user should know the simulation have to be run in NPT/NVT ensemble).

  3. AMBER output file is one only with all the information for TI and MBAR reported every ntpr timesteps. Currently, I see two problems here:
    3a. If the user wants to extract both dHdl and u_nk values, the output file is completely parsed twice, which can waste quite some time.
    3b. If the user only wants dHdl values, and the file doesn't have proper MBAR information, the user is needlessly warned that MBAR values are not read.
    I think we should refactor the code to have a single run through the AMBER output file, extracting just what's needed by the user in one go. I imagine we could have the main function called extract_dHdl_and_u_nk, which defaults to returns both dHdl and u_nk, while maintaining the current extract_dHdL and extract_u_nk which will simply call the main function with the optional switch to extract only what's needed.

  4. AMBER output can optionally report every ntave steps the average of dHdl and other parameters, right now parsing dHdl we are collecting these values (here we are also collecting different components reported by AMBER), but we are not using this information at all in the code... I think we could just skip the reading of the averages since we are collecting all the values (and then we'll perform better equilibration/subsampling analysis).

I'm working on implementing those changes, please let me know if you think anything else should be changed (or if I should just not touch anything :-) )!

@orbeckst
Copy link
Member

orbeckst commented Sep 7, 2022

These all sound like good suggestions. As you point out, remaining backwards-compatible and not breaking existing code is important.

I would encourage you to start with a PR that doesn't do everything but only minimal changes. That's a lot easier to review: less code and the purpose can be stated clearly. Presumably, the biggest change is (3).

The conservative approach would be to first do (4), (1) and (2) for the current parser. These are all API-relevant changes. We might be able to get this into 1.0.0, which would be important for defining the API. We could then tackle (3), which keeps the API but changes the code underneath (and adds the new extract_dHdl_and_u_nk() function).

@DrDomenicoMarson
Copy link
Contributor Author

Hi! In my fork I already implemented all the changes, I found it easier to do it in one go, as the cose was heavily duplicated before in the two functions, so unifying them before making the changes allowed me to change just one function then.

My modified fork currently pass all the present tests, I want to add more tests before making a PR, though, if I need to add more files to be tested, I should also make a PR in alchemtest?

@orbeckst
Copy link
Member

orbeckst commented Sep 7, 2022 via email

@xiki-tempula
Copy link
Collaborator

xiki-tempula commented Sep 8, 2022

Thanks for this suggestion. Sorry I was in a conference.

Temperature is required in input for both parsers (extract_dHdl and extract_u_nk), but system temperature is already automatically read from the output file (reading temp0). We could make the T variable optional in the parsers, and I suggest not removing it, not to break existing scripts from users. We should check that the requested temperature T is the same as reported by the AMBER output file and report eventual inconsistencies to the user.

I think this is a good idea.

related to point 1), currently if temp0 is not found we log that 'Non-constant temperature MD not currently supported.' I don't think that this is right, as temp0 is settled even in non-constant temperature MD, so I think that the check is meaningless. I agree that we could warn the users if the simulation was not performed at a constant T, but that's not so easy to check, and I don't think a similar check exists for other parsers (the user should know the simulation have to be run in NPT/NVT ensemble).

This sounds good to me as well but I'm not an amber expert.

AMBER output file is one only with all the information for TI and MBAR reported every ntpr timesteps. Currently, I see two problems here:
3a. If the user wants to extract both dHdl and u_nk values, the output file is completely parsed twice, which can waste quite some time.

Agreed.

3b. If the user only wants dHdl values, and the file doesn't have proper MBAR information, the user is needlessly warned that MBAR values are not read.

I think the user still needs to be warned. I mean people who know what they are doing will not care but novice users might need this.

I think we should refactor the code to have a single run through the AMBER output file, extracting just what's needed by the user in one go. I imagine we could have the main function called extract_dHdl_and_u_nk, which defaults to returns both dHdl and u_nk, while maintaining the current extract_dHdL and extract_u_nk which will simply call the main function with the optional switch to extract only what's needed.

I think the overall idea is good but we might need to discuss how to set the API up. The alchemlyb will want to read more than amber files and we want a uniformed interface across all MD engines. Given that some MD engine don't give u_nk. I don't think the API call of extract_dHdl_and_u_nk would be great. I think one way of calling it could be

parsing = alchemlyb.parsing.amber(files, T)
n_uk = parsing.extract_u_nk()
dHdl = parsing.extract_dHdl()

This would mean that the file don't need to be read twice.

This could also be expended to

n_uk = alchemlyb.parsing.amber(files, T).extract_u_nk()
dHdl = alchemlyb.parsing.amber(files, T).extract_dHdl()

To kind of preserve the original API call.

If we have this infrastructure in place, we could also easily preserve the current call with a static method.

n_uk = alchemlyb.parsing.amber.extract_u_nk(files, T)
dHdl = alchemlyb.parsing.amber.extract_dHdl(files, T)

AMBER output can optionally report every ntave steps the average of dHdl and other parameters, right now parsing dHdl we are collecting these values (here we are also collecting different components reported by AMBER), but we are not using this information at all in the code... I think we could just skip the reading of the averages since we are collecting all the values (and then we'll perform better equilibration/subsampling analysis).

I also think this is a good idea.

So I think 1/2/4 are both very good ideas and can be made into one PR, I would imagine that it could be reviewed and merged relatively quickly. The point 3 definitely deserve its own PR as we need to change the API call for all MD engines, document the changes, and fix all the offending lines and it will be a big PR needs several rounds of reviewing and would take a while.

@DrDomenicoMarson
Copy link
Contributor Author

Sorry for the late replay! Regarding warning the user when MBAR is not read when just dHdl extraction is requested, I'm not so sure this is needed. Amber is quite different from other software, as in the same output file we could have all the information about both dHdl and MBAR. But if the user is requesting just dHdl values, he should not be warned about something he doesn't care about, while the warning should be displayed when both dHdl and u_nk are requested, returning maybe None for u_nk, to be more flexible. An error should finally be returned when only u_nk values are requested, obviously.

@DrDomenicoMarson
Copy link
Contributor Author

If I understand correctly, I could do a PR addressing 1/2/4, while to make the interface change that you proposed (which I find really interesting) we could have a discussion and then make a change in all the parsers, right?

If my understanding is correct, could I please make the PR addressing 1/2/4 with the hybrid implementation that I have right now? I would just rename and hide the current extract_dHdl_and_u_nk function, leaving the current API unchanged, just both the current functions will call the same parser to read the AMBER output.

Currently all present tests are OK, but I would like to implement some more test to check some missing part of the code. Regarding this, I noticed that currently some testing functions run on a collection of output file, testing for different inconsistencies, but in this way when a test fails the report is not such clear, and one have to go read what was wrong in each file. Could I rename the files and make one separate test for each of them?

@orbeckst
Copy link
Member

orbeckst commented Sep 9, 2022

I have some comments on the specific points in #222 (comment)

We could make the T variable optional in the parsers.

We cannot make it optional for the other parsers. I'd rather keep the API the same across all parsers so that you can use the same code for analyzing, say, AMBER and GROMACS files.

We should check that the requested temperature T is the same as reported by the AMBER output file and report eventual inconsistencies to the user.

Yes, adding a check is good.

3b. If the user only wants dHdl values, and the file doesn't have proper MBAR information, the user is needlessly warned that MBAR values are not read.
I think the user still needs to be warned. I mean people who know what they are doing will not care but novice users might need this.

I agree with @DrDomenicoMarson : If the user's intent is clearly that they only reading TI data (by using extract_dHdl()) then there shouldn't be a warning about MBAR data.

I think we should refactor the code to have a single run through the AMBER output file, extracting just what's needed by the user in one go. I imagine we could have the main function called extract_dHdl_and_u_nk, which defaults to returns both dHdl and u_nk, while maintaining the current extract_dHdL and extract_u_nk which will simply call the main function with the optional switch to extract only what's needed.

I agree with @xiki-tempula that it still needs to be possible to parse AMBER files in the same way as other files and it should be transparent. Note that GROMACS can also save both TI and MBAR data in one file so we should come up with an approach that is general. I feel that this is a bigger question so a separate issue might be the better place. I just want to say that @xiki-tempula 's approach looks feasible but we would be effectively introducing a class parsing.amber.AmberReader that holds state. Then AmberReader.extract_un_nk() and AmberParser.extract_dHdl() would just pull data out of the instance. Holding state is something that we have been trying to avoid in the design (at least when possible). Therefore, separate discussion would be useful.

So I think 1/2/4 are both very good ideas and can be made into one PR, I would imagine that it could be reviewed and merged relatively quickly. The point 3 definitely deserve its own PR as we need to change the API call for all MD engines, document the changes, and fix all the offending lines and it will be a big PR needs several rounds of reviewing and would take a while.

I agree with @xiki-tempula 's assessment.

@orbeckst
Copy link
Member

orbeckst commented Sep 9, 2022

If my understanding is correct, could I please make the PR addressing 1/2/4 with the hybrid implementation that I have right now? I would just rename and hide the current extract_dHdl_and_u_nk function, leaving the current API unchanged, just both the current functions will call the same parser to read the AMBER output.

I don't think that doing a PR with (3) already included is a good idea. This will be too complicated to review. It is really important to keep development of alchemlyb focused. Reviewing PRs takes a lot of our time — time that we have to take somehow from our more than full schedules (we don't get paid for working on alchemlyb!), we need you to help us to make the process as easy as possible. Breaking down changes into independent parts is really important. It also makes for a much cleaner history (which is important when debugging).

Additionally, we would like to get 1.0 out soon. It seems feasible to get (1), (2), (4) in in time, but (3) opens up a much bigger discussion.

@orbeckst
Copy link
Member

orbeckst commented Sep 9, 2022

Currently all present tests are OK, but I would like to implement some more test to check some missing part of the code. Regarding this, I noticed that currently some testing functions run on a collection of output file, testing for different inconsistencies, but in this way when a test fails the report is not such clear, and one have to go read what was wrong in each file.

Could you give an example — ideally in a new issue?

Better tests are always welcome. The best thing is to start with a small PR that fixes one thing.

Could I rename the files and make one separate test for each of them?

I am not sure what files you're referring to. We really try to avoid renaming data files and data sets (i.e., files that are already in alchemtest) as this can break code elsewhere. If you want to change the test_<something>.py files in alchemlyb, then raise an issue describing the problem and show a solution in a PR. But opening the issue first can save you time coding as we can explain why certain choices were made or brainstorm an improvement.

@DrDomenicoMarson
Copy link
Contributor Author

I agree that (3) is not feasible so quickly, as it implies changing the behavior of other parsers as well. I was suggesting that I'll do a PR with the other issues solved, but from a code standpoint I will merge the two functions into one, exposing to the user the same functions that there are now, so for the user, nothing will change. As an example

def _common_extractor(extract_dHdl, extract_u_nk):
    while [parsing the file]:
         if extract_u_nk: accumulate u_nk data
         if extract_dHdl: accumulate dHdl data
   if extract_dHdl: return dHdl
   if extract_u_nk: return u_nk

def extract_dHdl():
   return _common_extractor(extract_dHdl)

def extract_u_nk():
   return _common_extractor(extract_u_nk)

I think this is a good idea because currently the code is mostly repeated twice, so the adjustment has to be made in different portions of the code, so the reviewing should not be much harder in this way, but even easier. Also, I'm trying to implement more tests.

@DrDomenicoMarson
Copy link
Contributor Author

er tests are always welcome. The best thing is to start with a small PR that fixes one thing.

Could I rename the files and make one separate test for each of them?

I am not sure what files you're referring to. We really try to avoid

Regarding my question about tests, I raised an issue in alchemtest, as to keep the conversations more focused!

@DrDomenicoMarson
Copy link
Contributor Author

DrDomenicoMarson commented Sep 12, 2022

We cannot make it optional for the other parsers. I'd rather keep the API the same across all parsers so that you can use the same code for analyzing, say, AMBER and GROMACS files.

If we make it optional the API will not change compared to the other parsers, the user can still call extract_u_nk(file, T) explicitly with T, but the difference for AMBER files will be that the user could also NOT specify T. If the user specifies a T that is different than the temperature at which the simulation has been run, the new parser will give a WARNING (I noticed for example that in the current tests, we extracted with T=300, while the simulations were run at 298.15 K, not a big issue obviously XD)

EDIT: working on it a little bit further, I found that making T optional breaks the functionality of @_init_attrs, so I changed it back to non-optional and just kept the WARNING if the T is different from what was read from the file!

@DrDomenicoMarson
Copy link
Contributor Author

I just want to say that @xiki-tempula 's approach looks feasible but we would be effectively introducing a class parsing.amber.AmberReader that holds state. Then AmberReader.extract_un_nk() and AmberParser.extract_dHdl() would just pull data out of the instance. Holding state is something that we have been trying to avoid in the design (at least when possible). Therefore, separate discussion would be useful.

I didn't know you were trying to avoid holding state, and I think in this case it shouldn't be needed, for any parsers. We could just have a common hidden function that parses the file with some switch to select which information is requested, the function then will return only what was needed. Than there will be the two already existing parsing functions that will call the hidden function with the correct switch to read the proper values. Also, we could introduce a third function that will extract both dHdl and u_nk, calling the hidden function and requesting both values.

The user in this way has the flexibility to read what he needs from the files, calling the proper function.

@orbeckst
Copy link
Member

If possible, we would want to avoid switches that are specific for one parser. The API should be as general as possible.

Maybe the better choice is indeed as you propose to have a specific "read_u_nk_and_ti=True" switch only for the AMBER parser and then have other parsers ignore the kwarg. That's feasible. We could have a top level extractor that gets dHdl and u_nk in one function call. By default it uses the separate extract_dHdl() and extract_u_nk() but for AMBER (and GROMACS) it can default to an optimized reader.

The basic approach is that the API should be as uniform as possible. The goal is that you should be able to write code with alchemlyb that is agnostic to where the data comes from.

@orbeckst
Copy link
Member

orbeckst commented Sep 12, 2022

(I noticed for example that in the current tests, we extracted with T=300, while the simulations were run at 298.15 K, not a big issue obviously XD)

In all fairness, that should be changed eventually. Good catch. (EDIT: Make a separate issue, please. Don't fix as part of another PR.)

@DrDomenicoMarson
Copy link
Contributor Author

If possible, we would want to avoid switches that are specific for one parser. The API should be as general as possible.

I don't get where the problem is here, as with what I proposed the user is not exposed to any new switch, and the API stays the same as before with (I edited the pseudo-code a little bit)

def _common_extractor(file, T, extract_dHdl:bool, extract_u_nk:bool): # this is not used by the user
    while [parsing the file]:
         if extract_u_nk: accumulate u_nk data
         if extract_dHdl: accumulate dHdl data
   if extract_dHdl and extract_u_nk: return dHdl, u_nk 
   if extract_dHdl: return dHdl
   if extract_u_nk: return u_nk


# from the API point of view these two are the same functions that are exposed to the user right now, but they internally call the new common function

def extract_dHdl(file , T): 
   return _common_extractor(file, T, extract_dHdl=True, extract_u_nk=False)

def extract_u_nk(file, T):
   return _common_extractor(file, T, extract_dHdl=False, extract_u_nk=True)


# optionally for AMBER (and gromacs maybe) we can implement this other function

def read_dHdl_and_u_nk(file, T):
   return _common_extractor(file, T, extract_dHdl=True, extract_u_nk=True)

@DrDomenicoMarson
Copy link
Contributor Author

DrDomenicoMarson commented Sep 13, 2022

I understand that everything has to be made in very small steps, I'm sorry I didn't understand this earlier, but I thought the AMBER file parsing was a little bit "left behind" here as there are many small different issues to be addressed. That's the reason 1 year ago or so I gave up on using alchemlyb to analyze my AMBER simulations as I hadn't the time to check what was wrong (I couldn't concatenate properly different files, ntpr problem. among other things).

I thought I could just fix (or try to :-) ) different things in one go, I'm sorry.

I think it would be better to close this issue as "not planned", and I can re-open different issues for every simple thing I can address in small fixes.

For now, I leave this open, but I opened two issues addressing points 1)/2), and 4), leaving this issue just for point 3). let me know if it's better to close it and open another dedicated issue!

@orbeckst
Copy link
Member

Thank you @DrDomenicoMarson for coming back to alchemlyb and starting to fix things — it's incredibly valuable to have contributors like you who work with the specific code.

Reading your API proposal I agree with you that it can be done without breaking the existing API and instead introducing a new function read_dHdl_and_u_nk(file, T). @xiki-tempula 's idea (which I expanded upon) was that we do not introduce a new function but instead make it so that that if you either read the u_nk or the dHdl, then the result of the other were already read and effectively cached. On reflection, my feeling is that the current philosophy of alchemlyb is better served with your approach to introduce an explicit function like read_dHdl_and_u_nk(file, T).

Thank you for opening the other issues. That makes it a lot easier to break down the development into manageable chunks.

@xiki-tempula
Copy link
Collaborator

xiki-tempula commented Sep 13, 2022

I'm concerned with NAMD parser not having a dHdl reader so we cannot have read_dHdl_and_u_nk for NAMD.
I think a better approach might be to have extract(file, T) which returns a dictionary, which mimics the pymbar4 interface where things are stored as a dictionary. For not NAMD parsers, extract(file, T) returns {'dHdl': dHdl, 'u_nk': u_nk}, for NAMD, it will just be {'u_nk': u_nk}. Then we could have a common interface and it could also cope with the case when dHdl parser is not present.

@orbeckst
Copy link
Member

I like the idea of an extract() function that returns the data in a dict.

@DrDomenicoMarson
Copy link
Contributor Author

I like the idea of @xiki-tempula as well, it seems that it serves everything you wanted in a very elegant way!
We'll have to change all the parsers, and I don't know If this change will break some other functions/tests, I don't know the codebase so well, I used the API and just dug into the AMBER parsers for now!

@DrDomenicoMarson
Copy link
Contributor Author

@orbeckst @xiki-tempula what are you thought about the introduction of the extract(file, T) returning a dict?
As I understood you would like to have 1.0 out in the near future, I think an API change like that should be made before the release, so we don't break anything doing it after 1.0 release, right?

@orbeckst
Copy link
Member

Yes, let's get extract() with dict return in before 1.0.

@DrDomenicoMarson
Copy link
Contributor Author

Great!
This issue needs work on all the parsers, but I'm familiar just with the AMBER parser, should I try to make the change to all the parsers in a single PR or should we work each of us on a different parser?

@xiki-tempula
Copy link
Collaborator

For other parsers just do

def extract(file, T):
    return {'dHdl': extract_dHdl(file, T), 'u_nk': extract_u_nk(file, T)}

Then add a note saying that this is just a wrapper around extract_dHdl and extract_u_nk.

@DrDomenicoMarson
Copy link
Contributor Author

Ok, I think I can do it without breaking anything :-)

A bigger issue would be to create a proper test for each of the other parsers for this function I think!

@orbeckst
Copy link
Member

@DrDomenicoMarson just wanted to say a big thank you for all your recent work — well done. Please continue to contribute!!!!

xiki-tempula pushed a commit that referenced this issue Sep 27, 2022
…ests (#240)

Ok, I hope not to have broken anything with this PR :-)
I added the extract(file, T) function to all the existing parsers, with a relative test for each parser.

The only parser that was heavily hit by this PR is the AMBER parser, as was mentioned in issue #222 .
Here I had to make different adjustments, on my machine all tests are passing through.
I took the opportunity to make also some small, cosmetic, changes to the code in the amber parser (just a couple of them, like removing the unnecessary 'class SomeThing(Object').
@DrDomenicoMarson
Copy link
Contributor Author

Addressed with PR #240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants