Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl/vader: move memory barrier to where it belongs #5536

Merged
merged 1 commit into from
Aug 14, 2018

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Aug 13, 2018

The write memory barrier was intended to precede setting a fast-box
header but instead follows it. This commit moves the memory barrier to
the intended location.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

The write memory barrier was intended to precede setting a fast-box
header but instead follows it. This commit moves the memory barrier to
the intended location.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Aug 13, 2018

@shamisp Here is the PR.

@shamisp
Copy link
Contributor

shamisp commented Aug 13, 2018

👍

@jsquyres
Copy link
Member

@hjelmn How bad is this bug? Do we need to back port it to other branches / do immediate releases?

@shamisp
Copy link
Contributor

shamisp commented Aug 13, 2018

@jsquyres it is pretty bad.
it causes random data corruption on arm and power. We actually observed data corruption with real application.

The issue was introduced by this commit:
a82f761#diff-821da35c17ae856f23e11bbedbc14976

@hjelmn
Copy link
Member Author

hjelmn commented Aug 13, 2018

Really surprised this wasn't noticed earlier or during testing. @shamisp Can we get a reproducer we can add to MTT?

@shamisp
Copy link
Contributor

shamisp commented Aug 13, 2018

It shows up with Graph500. @nSircombe may have more details.

@jsquyres
Copy link
Member

Per 2018-08-13 discussion:

  • Appears to be ARM/POWER specific
  • Is in v2.1.x, v3.0.x, and v3.1.x (and recently-branched-but-not-yet-released v4.0.x)

@nSircombe
Copy link

I don't have anything that reproduced the issues 100% of the time, this bug has manifested as intermittent (~1/3 of runs) failures (usually segfaults, occasionally harder to spot issues (failure of core solver to converge)) on the reference version of Graph500 and Tealeaf. So far I have focussed on mini-apps and benchmarks, although I have also experienced erratic behaviour with full-scale production applications, notable OpenFOAM.

However, 2.x and 3.0.x have not displayed the same issues,

@jsquyres
Copy link
Member

@hjelmn is out of range today, so I'm merging for him.

@jsquyres jsquyres merged commit 1b96be5 into open-mpi:master Aug 14, 2018
@nSircombe
Copy link

Just to confirm, this patch has fixed the issues I'd seen with Graph500 and Tealeaf - 10s of test runs and all passed, no failures.

@jsquyres
Copy link
Member

Following up on github just to preserve the knowledge...

Much discussion about this on the 2018-08-14 webex:

@markalle
Copy link
Contributor

Thanks, I remember checking the other branches when I made #4955, but maybe I checked out master wrong when I did that (eg checking it out from my fork without getting the real top of master). When I look at master now I'm seeing a82f761 (Dec 5, 2017) as the point where the issue appears in master.

So that sure sounds like my initial claim from around Mar 22, 2018 that master didn't have the issue was wrong

@markalle
Copy link
Contributor

Btw, did I hear this was affecting x86? If so did it appear with some new compiler version that was doing more compiler-level reordering? My understanding of the x86 CPU-level guarantees/non-guarantees is the only reordering it can do is "Loads may be reordered with older stores to different locations" which doesn't come up much.

So I wouldn't have expected x86 to be able to hit this unless the reordering happened at the compiler level.

@jsquyres
Copy link
Member

@markalle See the MTT link, above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants