Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][flang] Trampolines for internal procedures. #66157

Merged
merged 4 commits into from
Oct 23, 2023

Conversation

vzakhari
Copy link
Contributor

I would like to start a discussion about the ways for modifying
the current trampolines approach for Fortran internal procedures
used as actual arguments or pointer targets.

As Peter Klausler noted before the current approach implies security
risks due to writeable and executable stack requirement. We may need
to agree on a new scheme that does not have this issue.

I would like to start a discussion about the ways for modifying
the current trampolines approach for Fortran internal procedures
used as actual arguments or pointer targets.

As Peter Klausler noted before the current approach implies security
risks due to writeable and executable stack requirement. We may need
to agree on a new scheme that does not have this issue.
@llvmbot llvmbot added the flang Flang issues not falling into any other category label Sep 12, 2023
Copy link
Contributor

@klausler klausler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good write-up. One suggestion: trampolines could be allocated in a ring buffer, and instead of being immediately freed for reuse at the end of a host procedure, they could be redirected to point to a runtime library error routine that complains about a dead internal procedure being called.

@vzakhari
Copy link
Contributor Author

Good point! I will add it to an enhancements section later.

Copy link
Contributor

@jeanPerier jeanPerier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together, I like the flexibility of this plan.

@Leporacanthicus
Copy link
Contributor

I'm not opposed to either of those solutions. For AArch64, currently LLVM Trampolines won't work, they are unimplemented, see bug referenced below. Clang uses some other method to make Lambda's and similar nested function calls work.

I don't think it's a huge amount of work to fix the trampoilines for AArch64 - it just needs a good enough use-case and someone sufficiently motivated to actually implement it. :)

JuliaLang/julia#27174

@kiranchandramohan
Copy link
Contributor

Thanks @vzakhari for proposing this solution for pointers to internal procedures. Form a quick read,
-> This primarily solves the security issues due to a writeable stack.
-> The runtime solution will work on platforms that currently do not support the llvm trampoline intrinsics.
-> The details of filling the trampoline area for each platform are yet to be worked out.

Clang uses some other method to make Lambda's and similar nested function calls work
@Leporacanthicus would you know what that is? Would analysing the code generated for the following c++ code help?

#include <iostream>
#include <functional>
using std::cout; 

int do_some_process(int x, std::function<int(int)> process)
{
    return process(x); 
}

int main() 
{
    int y;
    y = 20;
    std::function<int(int)> lambda = [&y](int x)->int{ return x*y;};
    cout << do_some_process(2, lambda) << "\n"; 
    return 0;
}

@Leporacanthicus
Copy link
Contributor

I'm looking into how clang does this, but the example as given produces some really complicated indirection (to do with how std::function is implemented). If I try to simplify it so it's possible to understand what goes on, it decides that "well, I can just call the function inside the function anyway"... :(

@klausler
Copy link
Contributor

Lambdas are function objects; their operator() is a non-static member function, so this is set by the caller. The address of a lambda is a data address, not code. C/C++ don't have internal functions in the sense that Fortran does.

@vzakhari
Copy link
Contributor Author

Hello @madvenka786 and @atgreen, I wonder if you have something to share on this topic.

@atgreen
Copy link

atgreen commented Sep 20, 2023 via email

@yus3710-fj
Copy link
Contributor

Hello, I'll comment about the implementation of Fujitsu compiler, which is referred in the last technical call.
Unfortunately, it is the same way as GNU compiler. That means Fujitsu compiler allocates trampolines on the stack and requires executable stack.
So, I don't have any useful idea at the moment.

Copy link
Contributor

@kiranchandramohan kiranchandramohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The libffi proposal looks good to me. I guess we can workout subsequently on how to integrate this into Flang.

Let us know if you need any help. Approving to signal there are not concerns from my side. Please feel free to submit or continue to enhance the proposal here.

end subroutine host
```

Procedure code generated for subprogram `inernal()` must have access to the scope of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: inernal -> internal

Comment on lines 363 to 367
### Option #2: LLVM/compiler-rt support

It may be beneficial for projects besides Flang to use the alternative trampolines
implementation, so does it sound reasonable to actually put the support
into LLVM/compiler-rt?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds reasonable to me. You might want to sample interest in discourse. If speed of this feature being available is important you can consider implementing in Flang Runtime and then moving it to compiler-rt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a language that needs uplevel references and has the requirement that procedure pointers/arguments be simple code addresses would benefit from a general facility.

used via host association inside `internal()`. We will call this extra argument
a static chain link.

Fortran standard 2008 allowed using internal procedures as actual arguments or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: or -> for

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Fortran's "static chains" only ever have one link in them.


```fortran
subroutine host()
integer :: local = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether this is the best example, since local becomes a global since it is saved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch! Will change the code.


Procedure code generated for subprogram `inernal()` must have access to the scope of
its host procedure, e.g. to access `local` variable. Flang achieves this by passing
an extra argument to `internal()` that is a tuple of references to all variables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the tuple include procedure pointers as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. In particular, it is possible inside an internal procedure to use a procedure pointer via a host association, where the procedure pointer resolves to an address of another (or the same) internal procedure. For example:

module other
  abstract interface
     function callback()
       integer :: callback
     end function callback
  end interface
contains
  subroutine foo(fptr)
    procedure(callback), pointer :: fptr
    ! `fptr` is pointing to `callee1`, which needs the static chain link.
    print *, fptr()
  end subroutine foo
end module other

subroutine host(local)
  use other
  integer :: local
  procedure(callback), pointer :: fptr1
  procedure(callback), pointer :: fptr2
  fptr1 => callee1
  fptr2 => callee2
  call foo(fptr1)
  return
contains
  function callee1()
    integer :: callee1
    callee1 = fptr2()
  end function callee1
  function callee2()
    integer :: callee2
    callee2 = local
  end function callee2
end subroutine host
program main
  call host(10)
end program main

Comment on lines 371 to 373
* The trampoline area initialization implies writing target specific binary code
for the trampolines. Are there utils that the runtime implementation
can reuse?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is the advantage of the llvm intrinsic approach. Where the lowering of the intrinsic can be handled by the backend for the target. The runtime approach will need this to be handled for all targets in the runtime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is the main advantage of reusing existing LLVM support for the trampolines.

At the same time, it is not uncommon to have custom trampolines implementations for different purposes. For example, there are Orc trampolines in https://github.com/llvm/llvm-project/blob/main/llvm/lib/ExecutionEngine/Orc/OrcABISupport.cpp that (I think) have a requirement to preserve the call context completely, so they save/restore all the registers.

The libffi implementation also goes an extra half-mile to save original values of the scratch registers used in the trampoline code (see Scratch register section in https://sourceware.org/pipermail/libffi-discuss/2021/002579.html).

I think if we add a reasonable requirement that the trampoline code used for Fortran pointers to internal procedures must be as fast as possible, then we may be better off sticking to our own implementation of the trampolines. I think we should always be able to find scratch registers to clobber in the trampoline code without saving/restoring them. It does imply that we will have to be able to emit the trampoline code for all targets that Flang needs to support, and reproduce correct handling of all the caveats for different targets (e.g. Intel CET endbranch at the beginning of the trampoline code, PowerPC __clear_cache for the trampoline area, etc.). It should not be a big deal, though.

So, currently, I am more inclined to having Fortran runtime implementation for trampolines because it gives us predictable performance behavior regardless of the requirements of external components like libffi.


* Each trampoline invocation implies two indirect accesses with this approach.

### Option #1: Fortran runtime support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a small description here about this option. i.e this a runtime based implementation for supporting trampolines and will not use llvm trampoline intrinsics (atleast in the default flow).

Could you add the equivalent for the following?

In MLIR LLVM dialect the replacement looks like this:

    llvm.call @llvm.init.trampoline(%8, %9, %7) : (!llvm.ptr<i8>, !llvm.ptr<i8>, !llvm.ptr<i8>) -> ()
    %10 = llvm.call @llvm.adjust.trampoline(%8) : (!llvm.ptr<i8>) -> !llvm.ptr<i8>
    %11 = llvm.bitcast %10 : !llvm.ptr<i8> to !llvm.ptr<func<void ()>>
    llvm.call @_QMotherPfoo(%11) {fastmathFlags = #llvm.fastmath<fast>} : (!llvm.ptr<func<void ()>>) -> ()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

* Added IR example.
* Added a note about performance overhead of other implementations.
@ceseo
Copy link
Contributor

ceseo commented Oct 11, 2023

I'm not opposed to either of those solutions. For AArch64, currently LLVM Trampolines won't work, they are unimplemented, see bug referenced below. Clang uses some other method to make Lambda's and similar nested function calls work.

I don't think it's a huge amount of work to fix the trampoilines for AArch64 - it just needs a good enough use-case and someone sufficiently motivated to actually implement it. :)

JuliaLang/julia#27174

I'm not opposed to either of those solutions. For AArch64, currently LLVM Trampolines won't work, they are unimplemented, see bug referenced below. Clang uses some other method to make Lambda's and similar nested function calls work.

I don't think it's a huge amount of work to fix the trampoilines for AArch64 - it just needs a good enough use-case and someone sufficiently motivated to actually implement it. :)

JuliaLang/julia#27174

I started taking a look at this for AArch64. I may have a patch soon implementing the trampolines.

@vchuravy
Copy link
Contributor

I started taking a look at this for AArch64. I may have a patch soon implementing the trampolines.

One word of caution, I think the calling convention on MacOS is using the register that you could use in Linux to implement it on Aarch64.
But my memory is fuzzy on that.

@ceseo
Copy link
Contributor

ceseo commented Oct 11, 2023

I started taking a look at this for AArch64. I may have a patch soon implementing the trampolines.

One word of caution, I think the calling convention on MacOS is using the register that you could use in Linux to implement it on Aarch64. But my memory is fuzzy on that.

Yes, macOS uses x18. I'll have that in mind, thanks!

@vzakhari
Copy link
Contributor Author

FWIW, you may look up the ABI rule used by LLVM for the static chain handling in llvm/lib/Target/*/*.td. The register used for the static chain is defined by CCIfNest, e.g. in AArch64CallingConvention.td:

def CC_AArch64_AAPCS : CallingConv<!listconcat(
  // The 'nest' parameter, if any, is passed in X18.
  // Darwin and Windows use X18 as the platform register and hence 'nest' isn't
  // currently supported there.
  [CCIfNest<CCAssignToReg<[X18]>>],
  AArch64_Common
)>;

@vzakhari vzakhari merged commit e5944c9 into llvm:main Oct 23, 2023
2 of 3 checks passed
ceseo added a commit to ceseo/llvm-project that referenced this pull request Oct 30, 2023
Add support for llvm.init.trampoline and llvm.adjust.trampoline intrinsics for
AArch64.

Fixes llvm#65573
Updates llvm#27174
Updates llvm#66157
ceseo added a commit to ceseo/llvm-project that referenced this pull request Jul 15, 2024
Add support for llvm.init.trampoline and llvm.adjust.trampoline intrinsics for
AArch64.

Fixes llvm#65573
Fixes llvm#76927
Fixes llvm#83555
Updates llvm#66157
ceseo added a commit to ceseo/llvm-project that referenced this pull request Jul 16, 2024
Add support for llvm.init.trampoline and llvm.adjust.trampoline intrinsics for
AArch64.

Fixes llvm#65573
Fixes llvm#76927
Fixes llvm#83555
Updates llvm#66157
ceseo added a commit to ceseo/llvm-project that referenced this pull request Jul 16, 2024
Add support for llvm.init.trampoline and llvm.adjust.trampoline intrinsics for
AArch64.

Fixes llvm#65573
Fixes llvm#76927
Fixes llvm#83555
Updates llvm#66157
ceseo added a commit to ceseo/llvm-project that referenced this pull request Jul 23, 2024
Add support for llvm.init.trampoline and llvm.adjust.trampoline intrinsics for
AArch64.

Fixes llvm#65573
Fixes llvm#76927
Fixes llvm#83555
Updates llvm#66157
ceseo added a commit to ceseo/llvm-project that referenced this pull request Jul 23, 2024
Add support for llvm.init.trampoline and llvm.adjust.trampoline intrinsics for
AArch64.

Fixes llvm#65573
Fixes llvm#76927
Fixes llvm#83555
Updates llvm#66157
ceseo added a commit to ceseo/llvm-project that referenced this pull request Jul 24, 2024
Add support for llvm.init.trampoline and llvm.adjust.trampoline intrinsics for
AArch64.

Fixes llvm#65573
Fixes llvm#76927
Fixes llvm#83555
Updates llvm#66157
ceseo added a commit that referenced this pull request Jul 24, 2024
Add support for llvm.init.trampoline and llvm.adjust.trampoline
intrinsics for AArch64.

Fixes #65573
Fixes #76927
Fixes #83555
Updates #66157
llvmbot pushed a commit to llvmbot/llvm-project that referenced this pull request Jul 24, 2024
Add support for llvm.init.trampoline and llvm.adjust.trampoline
intrinsics for AArch64.

Fixes llvm#65573
Fixes llvm#76927
Fixes llvm#83555
Updates llvm#66157

(cherry picked from commit c4b66bf)
llvmbot pushed a commit to llvmbot/llvm-project that referenced this pull request Jul 24, 2024
Add support for llvm.init.trampoline and llvm.adjust.trampoline
intrinsics for AArch64.

Fixes llvm#65573
Fixes llvm#76927
Fixes llvm#83555
Updates llvm#66157

(cherry picked from commit c4b66bf)
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
Summary:
Add support for llvm.init.trampoline and llvm.adjust.trampoline
intrinsics for AArch64.

Fixes #65573
Fixes #76927
Fixes #83555
Updates #66157

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60250624
tru pushed a commit to llvmbot/llvm-project that referenced this pull request Jul 26, 2024
Add support for llvm.init.trampoline and llvm.adjust.trampoline
intrinsics for AArch64.

Fixes llvm#65573
Fixes llvm#76927
Fixes llvm#83555
Updates llvm#66157

(cherry picked from commit c4b66bf)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flang Flang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants