Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-38748: gh-82929: Add test for stack corruption. #26204

Merged
merged 7 commits into from
Sep 26, 2022

Conversation

michaelDCurran
Copy link
Contributor

@michaelDCurran michaelDCurran commented May 18, 2021

bpo-38748 aka #82929: add expected fail test for stack corruption when executing x86 stdcall ctypes callback

On x86, When a ctypes callback is __stdcall (WINFUNCTYPE) and it takes an argument larger than 4 bytes (E.g. a long long or a VARIANT), with one or more arguments preceeding it such that this argument is not aligned on a multiple of 8 bytes, the stack seems to become corrupted, and the function does not return to the correct location. This causes either a crash or at very least an OSError is raised.
For example arguments can be:

  • long, long long
  • long, long, long, long long
    But the corruption does not occur with something like:
  • long, long, long long

The test in this pr declairs a Python ctypes WINFUNCTYPE callback, with a return type of long, and arguments of long and long long.
The callback adds the two arguments together, prints the arguments and the result, and returns the result.
It then executes a cdecl c function which takes the callback, and two numbers as arguments. In turn, the c function executes the callback passing it the two numbers as arguments, returning the result of the callback.
The reason for making the outer c function cdecl and not stdcall was because cdecl seems to better handle inner stack corruption, raising OSError and seeming to leav the stack in a suitable state for future tests, rather than stdcall which just outright crashes.

When running the test on main, you can see via the printed message in the test, that the callback is executed once with the expected arguments, but then as it returns, it seems to be executed a second time with garbage arguments. Eventually then Python raises OSError. No doubt the callback is executed a second time as the return location just happens to line up with the start of the function.

this test is currently marked as an expected failiar as OSError is raised.

Manually backporting this patch to Python 3.7, the test passes.

https://bugs.python.org/issue38748

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

CLA Missing

Our records indicate the following people have not signed the CLA:

@michaelDCurran

For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

@tirkarthi tirkarthi changed the title Add test for stack corruption in issue 38748. bpo-38748: Add test for stack corruption. May 21, 2021
@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Jun 21, 2021
@feerrenrut
Copy link

The failures in this test prevent our project from being able to upgrade Python. We are stuck on an unsupported version (3.7). @michaelDCurran has investigated and contributed the tests to reproduce the issue. Could someone (@zooba or @tiran) on the cPython project please merge this or provide feedback. Unfortunately we are quite blocked by this.

@ambv
Copy link
Contributor

ambv commented Sep 24, 2021

Closing and re-opening the PR to re-trigger CI.

@ambv ambv closed this Sep 24, 2021
@ambv ambv reopened this Sep 24, 2021
@ambv
Copy link
Contributor

ambv commented Sep 25, 2021

On win64 the test doesn't actually fail:

test_i38748_stackCorruption (ctypes.test.test_callbacks.Callbacks) ... unexpected success

The status is misreported as success because re-running ctypes tests is currently broken somehow:

0:06:20 load avg: 4.04 Re-running test_ctypes in verbose mode

----------------------------------------------------------------------
Ran 0 tests in 0.000s

OK

I'll deal with fixing regrtest to properly re-run test_ctypes but we will need to only mark the test as "expected failure" on win32. On win32 the test fails, as Michael is pointing out:

Current thread 0x00001140 (most recent call first):
  File "D:\a\1\s\lib\ctypes\test\test_callbacks.py", line 160 in test_i38748_stackCorruption
  File "D:\a\1\s\lib\unittest\case.py", line 547 in _callTestMethod
  File "D:\a\1\s\lib\unittest\case.py", line 591 in run
  File "D:\a\1\s\lib\unittest\case.py", line 646 in __call__
  File "D:\a\1\s\lib\unittest\suite.py", line 122 in run
  File "D:\a\1\s\lib\unittest\suite.py", line 84 in __call__
  File "D:\a\1\s\lib\unittest\suite.py", line 122 in run
  File "D:\a\1\s\lib\unittest\suite.py", line 84 in __call__
  File "D:\a\1\s\lib\unittest\suite.py", line 122 in run
  File "D:\a\1\s\lib\unittest\suite.py", line 84 in __call__
  File "D:\a\1\s\lib\unittest\suite.py", line 122 in run
  File "D:\a\1\s\lib\unittest\suite.py", line 84 in __call__
  File "D:\a\1\s\lib\test\regrtest.py", line 43 in _main
  File "D:\a\1\s\lib\test\regrtest.py", line 47 in <module>
  File "D:\a\1\s\lib\runpy.py", line 86 in _run_code
  File "D:\a\1\s\lib\runpy.py", line 196 in _run_module_as_main

@michaelDCurran
Copy link
Contributor Author

@ambv I have now marked the test as expected failure only on 32 bit (x86).
Therefore, on win64 the test is now expected to run successfully, which it does.

Copy link
Contributor

@MaxwellDupre MaxwellDupre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me.

@LeonarddeR
Copy link

Any chance this can be processed in the near future?

@dpy013
Copy link

dpy013 commented Jun 4, 2022

Will this pull request be merged this year?

@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@zooba
Copy link
Member

zooba commented Jun 6, 2022

I don't understand the urgency to get a failing test added, but I don't see why not.

Hopefully someone is submitting a fix to libffi so that we can adopt it in their next release?

@zooba zooba added the skip news label Jun 6, 2022
@codeofdusk
Copy link

I don't understand the urgency to get a failing test added

The associated bug is blocking a major screen reader project from upgrading its Python interpreter. The inclusion of this test would help facilitate tracking and hopefully start progress toward resolution.

@zooba
Copy link
Member

zooba commented Jun 7, 2022

Okay, so someone is working on the libffi fix then, and this is just for tracking. Good.

Incidentally, the test now crashes in CI. It might need to be moved into the "crashers" tests, or run in a separate process so that the crash can be handled gracefully.

@feerrenrut
Copy link

Okay, so someone is working on the libffi fix then, and this is just for tracking. Good.

@zooba I don't think we know that. We figure the first step is to get this acknowledged and accepted as a bug affecting Python, hence this PR.

As stated already, this bug blocks the screen reader project (NVDA on GitHub) from being able to update past Python 3.7. This is a high priority for the project.

@AA-Turner AA-Turner removed CLA signed stale Stale PR or inactive for long period of time. labels Jun 8, 2022
@dpy013
Copy link

dpy013 commented Jul 20, 2022

The way to get rid of python once and for all is to just drop the x86 version of python 3.x and go straight to the x64 version of python.
From what I've seen for so long, the python team doesn't have the idea of raising this pr to a higher review level.
So that nvaccess should turn passive to dimensional active.

@atgreen
Copy link

atgreen commented Sep 14, 2022

I'm the libffi author and just learned about this. I tried to replicate this in the libffi testsuite, here: https://github.com/libffi/libffi/blob/master/testsuite/libffi.call/bpo-38748.c
The test is passing on 32-bit x86 STDCALL ABI. Did I capture the test case properly? Which version of libffi are you using? And which compiler are you building with?

@zooba
Copy link
Member

zooba commented Sep 14, 2022

Assuming it's still current, it's libffi 3.4.2 (built from this mirror) and used whichever MSVC was the latest about a year ago. You can get our binaries from https://github.com/python/cpython-bin-deps/tree/libffi if you want to poke at them directly (no debug symbols, sorry).

@zooba
Copy link
Member

zooba commented Sep 14, 2022

I had a look at your test, and while I'm not 100% familiar with how you'd write this, I don't see anything there about stdcall? Maybe thats what ABI_ATTRandABI_NUM` mean, but I'm just not sure. The calling convention is an important piece here.

@atgreen
Copy link

atgreen commented Sep 14, 2022

Yes, that's right. See https://rl.gl/doc-text?id=RLGL-AKWR73CS
Search for bpo.
I was using gcc. Our CI testing only does 64-bit testing with msvc, so I guess we need to test with 32-bit as well.

@zooba
Copy link
Member

zooba commented Sep 14, 2022

Our CI testing only does 64-bit testing with msvc, so I guess we need to test with 32-bit as well.

FYI, though you probably know, MSVC only has a single calling convention on 64-bit and ignores the declarations. They only have an impact on 32-bit builds. So anything involving specific conventions won't show up unless you test 32-bit. (I believe ARM64 also only has one convention, but don't quote me on that... I also don't have the slightest idea how ARM64EC works on Windows...)

@michaelDCurran
Copy link
Contributor Author

I've done some further investigation on this as well.
@atgreen hope this extra info may assist you:
It looks like that when libffi makes a function closure for the x86 stdcall calling convention, it incorrectly assumes that 8 byte parameters (E.g. int64, uint64, double) will be aligned on the stack to 8 bytes. This is not at all the case. Rather on x86 with msvc, these parameters are pushed onto the stack by the caller as two separate 4 byte values, and can happen on any 4 byte alignment. This assumption then means that libffi calculates that the caller will allocate more stack space than it actually would, and as the x86 stdcall calling convention requires the callee (in this case libffi) to clean up the stack on return, libffi therefore cleans up too much stack space, and therefore the stack is not in the state the caller left it.
In libffi 3.4.2, refer to line 185 of ffi_prep_cif_machdep in src/x86/ffi.c.

bytes = FFI_ALIGN (bytes, t->alignment);

If compiled on MSVC and it is x86 and stdcall, it probably should be:

bytes = FFI_ALIGN (bytes, FFI_SIZE_ARG);

Although the code has moved quite a bit since Python 3.7's fork of libffi, it is still clear why Python 3.7's fork of libffi did not have the bug. Refer to Python 3.7.9's ffi_prep_cif.c starting at line 157. It has an ifdef only allowing the alignment based on the argument's type if the compiler is not msvc or mingw32. It also has a very explicit comment:

/* Don't know if this is a libffi bug or not.  At least on 
   Windows with MSVC, function call parameters are *not*
   aligned in the same way as structure fields are, they are
   only aligned in integer boundaries.
   This doesn't do any harm for cdecl functions and closures,
   since the caller cleans up the stack, but it is wrong for
   stdcall functions where the callee cleans.
*/

A specific patch could be:

diff --git a/src/x86/ffi.c b/src/x86/ffi.c
index 24431c17..987e4eaf 100644
--- a/src/x86/ffi.c
+++ b/src/x86/ffi.c
@@ -181,8 +181,16 @@ ffi_prep_cif_machdep(ffi_cif *cif)
   for (i = 0, n = cif->nargs; i < n; i++)
     {
       ffi_type *t = cif->arg_types[i];
-
+      #if defined(_MSC_VER) && defined(_M_IX86)
+      if(cabi == FFI_STDCALL) 
+      {
+        bytes = FFI_ALIGN (bytes, FFI_SIZEOF_ARG);
+      } else {
+        bytes = FFI_ALIGN (bytes, t->alignment);
+      }
+      #else
       bytes = FFI_ALIGN (bytes, t->alignment);
+      #endif
       bytes += FFI_ALIGN (t->size, FFI_SIZEOF_ARG);
     }
   cif->bytes = bytes;

It is questionable as to the exact ifdef check, but this patch currently removes argument alignment on x86 (32 bit) only when compiled with msvc and only when the calling convention is stdcall.

Running the ctypes tests, including my testcase in #26204 (no longer marked as expected failure), all tests run without errors, including my test.

Running libffi's own testsuit also runs with no further errors than what were already occurring without my patch.

@atgreen
Copy link

atgreen commented Sep 19, 2022

I just released libffi 3.4.3, which should include a fix for this. Please try it out and let me know.

@zooba zooba mentioned this pull request Sep 20, 2022
@michaelDCurran
Copy link
Contributor Author

@atgreen Firstly thanks for looking into this, it is much appreciated.
However, I am having trouble building libffi 3.4.3 in order to verify tests in cpython on Windows.
If I checkout tag v3.4.3 from the libffi upstream repo, and follow the Windows/cygwin build instructions, I get a compilation error. Yet, v3.4.2 compiles fine.
The error is as follows:

libtool: compile:  /cygdrive/c/Users/mick/programming/git/cpython-source-deps/msvcc.sh -DHAVE_CONFIG_H -I. -I.. -I. -I../include -Iinclude -I../src -DFFI_BUILDING_DLL -I. -I../include -Iinclude -I../src -c ../src/x86/sysv_intel.S  -DDLL_EXPORT -DPIC -o src/x86/.libs/sysv_intel.obj
sysv_intel.S
 Assembling: src/x86/.libs/sysv_intel.asm
src/x86/.libs/sysv_intel.asm(1807) : error A2008:syntax error : *
src/x86/.libs/sysv_intel.asm(1808) : error A2008:syntax error : *
make[3]: *** [Makefile:1279: src/x86/sysv_intel.lo] Error 1

Seems like some kind of syntax error in sysv_intel.S?

I also get the exact same error if I create a local branch of libffi-3.4.3 in cpython-source-deps, and run cpython's pcbuild\prepair_libffi.bat -x86 in order to build libffi for cpython.
Again, v3.4.2 compiles okay.
Let me know if you require any extra info from me in order to debug this further.

@zooba
Copy link
Member

zooba commented Sep 23, 2022

Seems like some kind of syntax error in sysv_intel.S?

Yes, I see the same issue.

Let's take it to #96965 rather than this issue.

@zooba
Copy link
Member

zooba commented Sep 23, 2022

libffi has been updated in main (and soon, 3.11), so once this branch is updated and tests are run it should pass.

@zooba
Copy link
Member

zooba commented Sep 26, 2022

Perfect! Thanks for being persistent on this one.

@zooba zooba merged commit d79dd92 into python:main Sep 26, 2022
@gpshead gpshead changed the title bpo-38748: Add test for stack corruption. bpo-38748: gh-82929: Add test for stack corruption. Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.