Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[subinterpreters] Get the current Python interpreter state from Thread Local Storage (autoTSSkey) #84702

Closed
vstinner opened this issue May 5, 2020 · 12 comments
Labels
3.9 only security fixes topic-subinterpreters

Comments

@vstinner
Copy link
Member

vstinner commented May 5, 2020

BPO 40522
Nosy @vstinner, @seberg, @corona10, @pablogsal, @shihai1991, @h-vetinari
PRs
  • bpo-40522: _PyThreadState_Swap() sets autoTSSkey #19939
  • bpo-40522: Store tstate in a Thread Local Storage #23976
  • bpo-40522: Replace PyThreadState_GET() with PyThreadState_Get() #24575
  • bpo-40522: Use PyThreadState_Get() in _lsprof.c #24596
  • bpo-40522: Use PyThreadState_Get() in decimal #24597
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2020-05-05.17:12:14.636>
    labels = ['expert-subinterpreters', '3.9']
    title = '[subinterpreters] Get the current Python interpreter state from Thread Local Storage (autoTSSkey)'
    updated_at = <Date 2021-06-29.17:34:08.113>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2021-06-29.17:34:08.113>
    actor = 'h-vetinari'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Subinterpreters']
    creation = <Date 2020-05-05.17:12:14.636>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 40522
    keywords = ['patch']
    message_count = 11.0
    messages = ['368182', '368188', '380324', '383894', '383895', '383899', '383900', '383940', '383967', '384057', '387312']
    nosy_count = 6.0
    nosy_names = ['vstinner', 'seberg', 'corona10', 'pablogsal', 'shihai1991', 'h-vetinari']
    pr_nums = ['19939', '23976', '24575', '24596', '24597']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue40522'
    versions = ['Python 3.9']

    @vstinner
    Copy link
    Member Author

    vstinner commented May 5, 2020

    _PyThreadState_GET() gets the current Python thread state from _PyRuntime.gilstate.tstate_current atomic variable.

    When I experimented per-interpreter GIL (bpo-40512), I got issues with _PyThreadState_GET() which didn't return the expected Python thread state.

    I propose to modify _PyThreadState_GET() in the exprimental isolated subinterpreters mode to get and set the current Python thread state using a Thread Local Storage: autoTSSkey.

    @vstinner vstinner added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.9 only security fixes labels May 5, 2020
    @vstinner
    Copy link
    Member Author

    vstinner commented May 5, 2020

    New changeset e838a93 by Victor Stinner in branch 'master':
    bpo-40522: _PyThreadState_Swap() sets autoTSSkey (GH-19939)
    e838a93

    @vstinner vstinner added topic-subinterpreters and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels May 15, 2020
    @vstinner vstinner changed the title Subinterpreters: get the current Python interpreter state from Thread Local Storage (autoTSSkey) [subinterpreters] Get the current Python interpreter state from Thread Local Storage (autoTSSkey) May 15, 2020
    @vstinner vstinner added topic-subinterpreters and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels May 15, 2020
    @vstinner vstinner changed the title Subinterpreters: get the current Python interpreter state from Thread Local Storage (autoTSSkey) [subinterpreters] Get the current Python interpreter state from Thread Local Storage (autoTSSkey) May 15, 2020
    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 4, 2020

    See also bpo-15751: "Make the PyGILState API compatible with subinterpreters".

    @vstinner
    Copy link
    Member Author

    There are many ways to get the current interpreter (interp) and the current Python thread state (tstate).

    Public C API, opaque function call:

    • PyInterpreterState_Get() (new in Python 3.9)
    • PyThreadState_Get()

    Internal C API, static inline functions:

    • _PyInterpreterState_GET()
    • _PyThreadState_GET()

    There are so many variants that I wrote notes for myself:
    https://pythondev.readthedocs.io/pystate.html

    This issue is about optimizing _PyInterpreterState_GET() and _PyThreadState_GET() which are supposed to be the most efficient implementations.

    Currently, _PyInterpreterState_GET() is implemented as _PyThreadState_GET()->interp, and _PyThreadState_GET() is implemented as:

        _Py_atomic_load_relaxed(_PyRuntime.gilstate.tstate_current)

    --

    To find the _PyInterpreterState_GET() machine code, I read the PyInterpreterState_Get() assembly code (not optimized, it adds tstate==NULL test) and PyTuple_New() assembly code, since PyTuple_New() now needs to get the current interpreter:

    static struct _Py_tuple_state *
    get_tuple_state(void)
    {
        PyInterpreterState *interp = _PyInterpreterState_GET();
        return &interp->tuple;
    }

    To find the _PyThreadState_GET() machine code, I read the PyThreadState_Get() assembly code.

    I looked at the x86-64 machine code generated by GCC -O3 (no LTO, no PGO, it should not be relevant here), using GCC 10.2.1 on Fedora 33.

    _PyThreadState_GET():

    mov rax,QWORD PTR [rip+0x2292b1] # 0x743118 <_PyRuntime+568>

    _PyInterpreterState_GET():

    mov rax,QWORD PTR [rip+0x22a7dd] # 0x743118 <_PyRuntime+568>
    mov rax,QWORD PTR [rax+0x10]

    By default, Python is built with -fPIC: _PyRuntime variable does not have a fixed address.

    $ objdump -t ./python|grep '\<_PyRuntime\>'
    0000000000742ee0 g     O .bss	00000000000002a0              _PyRuntime

    The "[rip+0x2292b1] # 0x743118 <_PyRuntime+568>" indirection is needed by PIC.

    @vstinner
    Copy link
    Member Author

    _PyInterpreterState_GET():
    mov rax,QWORD PTR [rip+0x22a7dd] # 0x743118 <_PyRuntime+568>
    mov rax,QWORD PTR [rax+0x10]

    While working on bpo-39465, I wrote PR 20767 to optimize _PyInterpreterState_GET(): single instruction instead of two:

    • Add _PyRuntimeState.interp_current member: atomic variable
    • _PyThreadState_Swap() sets _PyRuntimeState.interp_current

    But I failed to measure any performance difference.

    @vstinner
    Copy link
    Member Author

    PR 23976 stores the currrent interpreter and the current Python thread state into a Thread Local Storage (TLS) using GCC/clang __thread keyword.

    On x86-64 using LTO and GCC -O3, _PyThreadState_GET() and _PyInterpreterState_GET() become a single MOV.

    Assembly code using LTO and gcc -O3.

    _PyThreadState_GET() in _PySys_GetObjectId():

    0x00000000004adabe <+14>: mov rbx,QWORD PTR fs:0xfffffffffffffff8

    _PyThreadState_GET() in PyThreadState_Get():

    0x000000000046b660 <+0>: mov rax,QWORD PTR fs:0xfffffffffffffff8

    _PyInterpreterState_GET() in PyTuple_New():

    0x000000000048dfcc <+12>: mov rax,QWORD PTR fs:0xfffffffffffffff0

    _PyInterpreterState_GET() in PyState_FindModule():

    0x000000000044bf20 <+16>: mov rax,QWORD PTR fs:0xfffffffffffffff0

    ---

    Note: Without LTO, sometimes there is an indirection:

    _PyThreadState_GET() in _PySys_GetObjectId(), 2 MOV (PIC indirection):

    mov rax,QWORD PTR [rip+0x1eb270] # 0x713fe0
    # rax = 0xfffffffffffffff0 (-16)
    mov r13,QWORD PTR fs:[rax]

    _PyInterpreterState_GET() in PyTuple_New(), 2 MOV (PIC indirection):

    mov rax,QWORD PTR [rip+0x294d95] # 0x713ff8
    mov rax,QWORD PTR fs:[rax]

    An optimized Python should always be built with LTO.

    @vstinner
    Copy link
    Member Author

    PR 23976 should make _PyInterpreterState_GET() more efficient, and it prepares the code base for one GIL per interpreter (which is purpose of this issue ;-)).

    @vstinner
    Copy link
    Member Author

    Mark Shannon experiment using __thread:

    He added " extern __thread struct _ts *_Py_tls_tstate;" to Include/cpython/pystate.h.

    @pablogsal
    Copy link
    Member

    An optimized Python should always be built with LTO.

    In MacOS is quite challenging to activate LTO, so normally optimized builds are only done with PGO. Also in Windows I am not sure is possible to use LTO. Same for many other platforms

    @vstinner
    Copy link
    Member Author

    One GIL per interpreter requires to store the tstate per thread. I don't see any other option. We need to replace the global _PyRuntime atomic variable with a TLS variable. I'm trying to reduce the overhead, but it's heard to beat the performance of an atomic variable.

    That's also we I modified many functions to pass explicitly tstate to subfunctions in internal C functions, to avoid any possible overhead of getting tstate.

    https://vstinner.github.io/cpython-pass-tstate.html

    Pablo:

    In MacOS is quite challenging to activate LTO, so normally optimized builds are only done with PGO.

    Oh right, I forgot macOS. I should check how TLS is compiled on macOS. IMO wwo MOV instead of MOV is not a major performance bottleneck.

    The best would be to be able to avoid pthread_getspecific() function which is less efficient than a TLS variable. The glibc implementation uses an array for a few variables (first 32 variables?) and then a slower hash table.

    Pablo:

    Also in Windows I am not sure is possible to use LTO. Same for many other platforms.

    I will check how it's implemented on Windows.

    We cannot use TLS on all platforms, since it requires C11 features which are not available on all platforms. Also, the implementation depends on the architecture.

    @vstinner
    Copy link
    Member Author

    New changeset 6207810 by Victor Stinner in branch 'master':
    bpo-40522: Replace PyThreadState_GET() with PyThreadState_Get() (GH-24575)
    6207810

    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 3, 2022

    Sadly, I don't have the bandwidth to work on this issue, I just close it.

    @vstinner vstinner closed this as completed Nov 3, 2022
    Repository owner moved this from Todo to Done in Subinterpreters Nov 3, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes topic-subinterpreters
    Projects
    Status: Done
    Development

    No branches or pull requests

    2 participants