-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
btl/vader: move memory barrier to where it belongs #5536
Conversation
The write memory barrier was intended to precede setting a fast-box header but instead follows it. This commit moves the memory barrier to the intended location. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@shamisp Here is the PR. |
👍 |
@hjelmn How bad is this bug? Do we need to back port it to other branches / do immediate releases? |
@jsquyres it is pretty bad. The issue was introduced by this commit: |
Really surprised this wasn't noticed earlier or during testing. @shamisp Can we get a reproducer we can add to MTT? |
It shows up with Graph500. @nSircombe may have more details. |
Per 2018-08-13 discussion:
|
I don't have anything that reproduced the issues 100% of the time, this bug has manifested as intermittent (~1/3 of runs) failures (usually segfaults, occasionally harder to spot issues (failure of core solver to converge)) on the reference version of Graph500 and Tealeaf. So far I have focussed on mini-apps and benchmarks, although I have also experienced erratic behaviour with full-scale production applications, notable OpenFOAM. However, 2.x and 3.0.x have not displayed the same issues, |
@hjelmn is out of range today, so I'm merging for him. |
Just to confirm, this patch has fixed the issues I'd seen with Graph500 and Tealeaf - 10s of test runs and all passed, no failures. |
Following up on github just to preserve the knowledge... Much discussion about this on the 2018-08-14 webex:
|
Thanks, I remember checking the other branches when I made #4955, but maybe I checked out master wrong when I did that (eg checking it out from my fork without getting the real top of master). When I look at master now I'm seeing a82f761 (Dec 5, 2017) as the point where the issue appears in master. So that sure sounds like my initial claim from around Mar 22, 2018 that master didn't have the issue was wrong |
Btw, did I hear this was affecting x86? If so did it appear with some new compiler version that was doing more compiler-level reordering? My understanding of the x86 CPU-level guarantees/non-guarantees is the only reordering it can do is "Loads may be reordered with older stores to different locations" which doesn't come up much. So I wouldn't have expected x86 to be able to hit this unless the reordering happened at the compiler level. |
@markalle See the MTT link, above. |
The write memory barrier was intended to precede setting a fast-box
header but instead follows it. This commit moves the memory barrier to
the intended location.
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov