Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenFAST chkp corrupted on frontier #984

Closed
marchdf opened this issue Feb 28, 2024 · 10 comments
Closed

OpenFAST chkp corrupted on frontier #984

marchdf opened this issue Feb 28, 2024 · 10 comments
Assignees

Comments

@marchdf
Copy link
Contributor

marchdf commented Feb 28, 2024

OpenFAST checkpoint files aren't closed properly. This is an issue on any machine we use but only results in corrupted chkp files on Frontier (that we've seen so far).

The reason for this is that OpenFAST assumes turbine ID numbering with fortran numbering (starting at 1) but we interface to the library using C numbering (starting at 0). So the following check (here):

IF (Turbine%TurbID == NumTurbines .OR. .NOT. PRESENT(Unit)) THEN
      CLOSE(unOut)
      unOut = -1
   END IF

never gets hit and the chkp file never gets closed.

The fix I am thinking about right now is to send a "fortran id" to OpenFAST from amr-wind. But this causes a segfault somewhere else. Still need to track that down.

@psakievich I am not sure how/if this affects the way Nalu-Wind interfaces with OpenFAST as well.

@psakievich
Copy link
Contributor

In nalu-wind we rely on the openfast-cpp interface which also loops over indices starting at 0:
https://github.com/OpenFAST/openfast/blob/4b6337fcffe859c5eeb5445deeef2046439e5152/glue-codes/openfast-cpp/src/OpenFAST.cpp#L58-L78

We'll have to dig deeper into openfast to see if this offset is handled inside the openfoam data structures.
@gantech do you know off hand?

@marchdf
Copy link
Contributor Author

marchdf commented Feb 28, 2024

Yeah that's interesting. Can you throw a print statement in that if condition I mentioned and see if it ever hits that close call? I am worried this isn't being handled right with the nalu-wind code path as well.

@psakievich
Copy link
Contributor

I'm not in a position to test this at the moment. I can put it on my backlog though. This would be more of an openfast core issue than a nalu-wind issue.

@lawrenceccheung
Copy link
Contributor

I tried a naive approach to get rid of the off-by-1 error. Basically I replaced this line

fast_func(FAST_CreateCheckpoint, &fi.tid_local, rst_file);

with

        auto my_tid_local = fi.tid_local+1;
        fast_func(FAST_CreateCheckpoint, &my_tid_local, rst_file);

However that also resulted in segfaults when writing out the chkp files. So in the end I just commented out the if else check in
https://github.com/OpenFAST/openfast/blob/4b6337fcffe859c5eeb5445deeef2046439e5152/modules/openfast-library/src/FAST_Subs.f90#L7090 that @marchdf mentioned above.

Lawrence

@marchdf
Copy link
Contributor Author

marchdf commented Mar 1, 2024

Yeah that's not the right fix. Turns out this is a bit messed up. You need to increment global_id by one to pass in. That sets the TurbId, which would then fix the chkp close check. The problem is that, on initialize, that argument doesn't get passed around correctly: the call site of FAST_InitializeAll_T and the argument list don't match:

The definition of the function: https://github.com/OpenFAST/openfast/blob/main/modules/openfast-library/src/FAST_Subs.f90#L37 has TurbId as it's second parameter. So that's fine. Except that

The call site: https://github.com/OpenFAST/openfast/blob/main/modules/openfast-library/src/FAST_Library.f90#L144: passes in iTurb to that argument and not TurbId.

I am in conversation with @andrew-platt for other fixes.

@marchdf
Copy link
Contributor Author

marchdf commented Mar 1, 2024

tracking this here: OpenFAST/openfast#2064

@andrew-platt
Copy link

I did a little digging into this. There is a discrepancy between how we handle the turbine id with FAST.Farm and with the cpp interface. With FAST.Farm, we index using the Fortran index start of 1 for the turbine array between 1:NumTurbines. However, in the FAST_AllocateTurbines routine in FAST_Library.f90, a start index of 0 is expected.

As noted above, since the closing of the checkpoint file assumes the start index of 1, it is never closed with the cpp interface. So to fix this, I think we have three options.

  1. Change FAST.Farm to index turbines starting at 0,
  2. Change FAST_AllocateTurbines to start with a Fortran index of 1, and change amr-wind and other codes to match
  3. If FAST_AllocateTurbines is called, we could set an internal flag to correctly handle this offset of turbine number.

I'm inclined to pursue option 3 as this will preserve the existing numbering systems for cpp and FAST.Farm . I don't think it will be all that difficult to do in OpenFAST. I'll post here with when I have a proposed solution in place.

@marchdf
Copy link
Contributor Author

marchdf commented Mar 14, 2024

Hi @andrew-platt , thank you for looking into this! I appreciate you digging through this. Option 3 sounds good. I do still think (as I noted in the openfast issue) that there seems to be an inconsistency in the way iTurb is being passed to FAST_InitializeAll_T and not the expected (from the argument list in the function definition) TurbId. But I probably don't understand the reasoning behind that and the intricacies of the coupling to all the fortran and cpp codes. Anyway, happy to help and test potential solutions. Thanks!

@andrew-platt
Copy link

Proposed solution: OpenFAST/openfast#2097

@marchdf
Copy link
Contributor Author

marchdf commented Mar 26, 2024

Confirmed that this is fixed with OpenFAST/openfast#2097

@marchdf marchdf closed this as completed Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants