Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flang][OpenMP] Privatize locally destroyed values in do concurent #112

Merged
merged 6 commits into from
Jul 14, 2024

Conversation

ergawy
Copy link

@ergawy ergawy commented Jul 9, 2024

Locally destroyed values are those values for which the Fortran runtime calls @_FortranADestroy inside the loops body. If these values are allocated outside the loop, and the loop is mapped to OpenMP, then a runtime error would occur due to multiple teams trying to access the same allocation. In such cases, a local privatized value is created in the OpenMP region to prevent multiple teams of treads from accessing and destroying the same memory block which causes runtime issues.

This one of the issues uncovered by LBL's inference engine.

Locally destroyed values are those values for which the Fortran runtime
calls `@_FortranADestroy` inside the loops body. If these values are
allocated outside the loop, and the loop is mapped to OpenMP, then a
runtime error would occur due to multiple teams trying to access the
same allocation. In such cases, a local privatized value is created in
the OpenMP region to prevent multiple teams of treads from accessing and
destroying the same memory block which causes runtime issues.
@ergawy ergawy force-pushed the locally_alloc_loop_destroyed_values branch from 2485e85 to 27c02ed Compare July 9, 2024 07:53
@ergawy
Copy link
Author

ergawy commented Jul 10, 2024

Ping 🔔! Please take a look and let me know if you have any concerns or comments.

Copy link

@mjklemm mjklemm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, but I'm not a real expert here.

Copy link

@skatrak skatrak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me that the issue here is more fundamental than just being about temporaries deleted in a loop. This is a race condition which will happen every time the same shared variable is read from and written to from within a do concurrent loop, and it's a result of running in parallel iterations of a loop which is only guaranteed to be correct if each iteration runs as an atomic unit, even if they can run in any order.

I think the proper solution should involve checking that a given loop can be parallelized before trying to do so. However, we're currently giving compiler users the responsibility of checking their do-concurrent loops for such potential race conditions and blindly transforming loops if they tell us to do so.

So, assuming the original loop is actually something that the user verified it can run in parallel (otherwise they can't expect a correct result given the current restrictions), the only way we can run into these race conditions is through our own transformations. My guess is that the problem here is that we're inserting allocas for temporaries during lowering, before the omp.parallel operation is created by this pass, which would otherwise be the target location for these.

I think that in this pass we should identify every alloca that is only used within the loop being transformed and then sink it into the omp.parallel operation once it's created (both for host and device), regardless of the presence of "destroy" function calls. Something similar was implemented already by getSinkableAllocas() in OpenMPToLLVMIRTranslation.cpp, so maybe that helps as a starting point (it's intended for a different case, so I don't think you can just copy-paste it and use it here).

flang/docs/DoConcurrentConversionToOpenMP.md Outdated Show resolved Hide resolved
@mjklemm
Copy link

mjklemm commented Jul 11, 2024

I think the proper solution should involve checking that a given loop can be parallelized before trying to do so. However, we're currently giving compiler users the responsibility of checking their do-concurrent loops for such potential race conditions and blindly transforming loops if they tell us to do so.

This is, frankly, a bit of a philosophical question. There are basically two camps (I'm camp A):

  • Camp A: The programmer has written DO CONCURRENT so they assert that the code is race-free and satisfies the requirements stated in ISO Fortran.
  • Camp B: The programmer has written DO CONCURRENT and the compiler has to prove that it cannot only execute the loop in any order, but also in parallel.

The current way we are dealing with this, that we will have different levels of parallelization: none (serial execution, maybe default), auto (Camp B behavior), and parallel (for CPU and GPU) (Camp A behavior).

Since we are focusing on getting thing to work as fast we can, we are ignoring auto for the moment, but have it on our radar screen to make sure that this mode will eventually also be supported.,

@ergawy
Copy link
Author

ergawy commented Jul 11, 2024

Thanks @skatrak and @mjklemm for the discussion.

In addition to what Michael said. Somethig is worth pointing out:

This is a race condition which will happen every time the same shared variable is read from and written to from within a do concurrent loop, and it's a result of running in parallel iterations of a loop which is only guaranteed to be correct if each iteration runs as an atomic unit, even if they can run in any order.

This is not entirely true. For example take the following snippet:

    do concurrent (i=1:10)
        a(i) = test_struct(i)
    end do

where test_struct is a user-defined type. Here we are creating a temporary object each loop iteration and assigning it to a(i). In other words, there is no sharing or race condtions going on here.

If you convert this loop to its omp equivalent:

    !$omp parallel do
    do i = 1,10
        a(i) = test_struct(i)
    end do
    !$omp end parallel do

the allocation for the temp test_struct object will happen inside the omp.parallel region even on the hlfir level:

    omp.parallel {
      %22 = fir.alloca !fir.type<_QMstruct_modTtest_struct{x_:!fir.box<!fir.heap<i32>>}> {bindc_name = ".result", pinned}
      ...
    }

which means no alloca sinking is taking place for such case: flang knows from the getgo that the temp aollocation should happen within the boundaries of the parallel region.

With do concurrent we don't have this demarcation of serial/parallel regions, therefore the need to detect code patterns that create temps and sinking these temps inside the region.

But you raise a valid concern, also I did not know about the alloca sinking during LLVM translation.

What I will do now is to try to understand how flang is smart enough to emit the alloca for the temp inside the parallel region. Maybe we can learn something. My first guess was that maybe flang takes into account isolated-from-above regions but that is not the case of omp parallel do.

@skatrak
Copy link

skatrak commented Jul 11, 2024

Perhaps I didn't explain myself very well initially. What I tried to get to is that there are two ways in which we will find these kinds of issues:

  • If parallelizing the do concurrent loop introduces data races. With our current approach, this is a user error and not something we need to fix in the compiler, but if we want to eventually enable some level of automatic do-concurrent transformations, we'll have to implement these checks.
  • If we introduce these data races ourselves during lowering. I think this is the case that this PR tries to solve, but I suggest the solution needs to be more general because the problem is not restricted to deallocations.

Flang decides where to place temporary allocations during lowering based on the MLIR OutlineableOpenMPOpInterface (which omp.parallel has), and if a parent with that trait is not present it will use the parent function instead. Since we're introducing the omp.parallel operation after lowering, these allocations are located in the function but they should be moved inside of omp.parallel if they're only used inside of the loop. And I think doing that should take care of this issue and it should hopefully prevent any data races from being introduced by the compiler in this pass.

@ergawy
Copy link
Author

ergawy commented Jul 11, 2024

My first guess was that maybe flang takes into account isolated-from-above regions but that is not the case of omp parallel do.

My first guess was almost correct 😛. FIR builder find the proper alloc block when it creates a temporary. and getAllocaBlock uses the mlir::omp::OutlineableOpenMPOpInterface interface to help with finding that block.

@mjklemm
Copy link

mjklemm commented Jul 11, 2024

  • If parallelizing the do concurrent loop introduces data races. With our current approach, this is a user error and not something we need to fix in the compiler, but if we want to eventually enable some level of automatic do-concurrent transformations, we'll have to implement these checks.

Yes, that's exactly the statement I was making :-)

@ergawy
Copy link
Author

ergawy commented Jul 11, 2024

Since we're introducing the omp.parallel operation after lowering, these allocations are located in the function but they should be moved inside of omp.parallel if they're only used inside of the loop.

That makes sense. I will change the logic in collectLocallyDestroyedValuesInLoop (and its name) to detect values allocated outside the loop and used only inside of it then.

@ergawy
Copy link
Author

ergawy commented Jul 11, 2024

Thanks for the review @mjklemm @skatrak . I think I handled the outstanding issue now.

Copy link

@skatrak skatrak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes Kareem, it's looking good. I just have some smaller comments at this point.

flang/docs/DoConcurrentConversionToOpenMP.md Outdated Show resolved Hide resolved
flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp Outdated Show resolved Hide resolved
flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp Outdated Show resolved Hide resolved
flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp Outdated Show resolved Hide resolved
Copy link

@skatrak skatrak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! I just have minimal nits at this point, no need for another review by me after addressing them.

@ergawy ergawy merged commit 24980a6 into ROCm:amd-trunk-dev Jul 14, 2024
3 of 5 checks passed
ergawy added a commit to ergawy/llvm-project that referenced this pull request Aug 19, 2024
…vice

Extends ROCm#112.

This PR extends support for `do concurrent` mapping to the device a bit
more. In particular, it handles localization of loop-local values on the
deive. Previously, this was only supported and tested on the host.

See docs for `looputils::collectLoopLocalValues` for the definition of
"loop-local" values.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Aug 19, 2024
…vice

Extends ROCm#112.

This PR extends support for `do concurrent` mapping to the device a bit
more. In particular, it handles localization of loop-local values on the
deive. Previously, this was only supported and tested on the host.

See docs for `looputils::collectLoopLocalValues` for the definition of
"loop-local" values.
ergawy added a commit to ergawy/llvm-project that referenced this pull request Aug 19, 2024
…vice

Extends ROCm#112.

This PR extends support for `do concurrent` mapping to the device a bit
more. In particular, it handles localization of loop-local values on the
deive. Previously, this was only supported and tested on the host.

See docs for `looputils::collectLoopLocalValues` for the definition of
"loop-local" values.
ergawy added a commit that referenced this pull request Aug 19, 2024
…vice (#146)

Extends #112.

This PR extends support for `do concurrent` mapping to the device a bit
more. In particular, it handles localization of loop-local values on the
deive. Previously, this was only supported and tested on the host.

See docs for `looputils::collectLoopLocalValues` for the definition of
"loop-local" values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants