Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test suite crashes when run single process --with-pydebug enabled #121832

Closed
freakboy3742 opened this issue Jul 16, 2024 · 24 comments · Fixed by #122340
Closed

Test suite crashes when run single process --with-pydebug enabled #121832

freakboy3742 opened this issue Jul 16, 2024 · 24 comments · Fixed by #122340
Assignees
Labels
3.13 bugs and security fixes 3.14 new features, bugs and security fixes type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@freakboy3742
Copy link
Contributor

freakboy3742 commented Jul 16, 2024

Crash report

What happened?

The make testios testing target crashes when compiled --with-pydebug enabled (as is done in CI).

The error is a SIGABRT, raised during test_types; however, running test_types by itself isn't enough to reproduce the problem.

The error is raised on L8476 of typeobject.c:

    assert(self->tp_version_tag != 0);

The same failure can be manufactured on macOS with M1 hardware, as long as the test suite is executed as a single process (i.e. python -m test, not the multi-process option enabled by make test).

Full error trace on iOS:
stacktrace.txt

The same problem doesn't appear to occur when debug is not enabled.

CPython versions tested on:

CPython main branch

Operating systems tested on:

macOS, iOS

Output from running 'python -VV' on the command line:

3.14.0a0; verified on 2bac2b86, but possibly present back to dc03ce7

Linked PRs

@freakboy3742 freakboy3742 added type-crash A hard crash of the interpreter, possibly with a core dump 3.13 bugs and security fixes OS-ios 3.14 new features, bugs and security fixes labels Jul 16, 2024
@freakboy3742
Copy link
Contributor Author

I've narrowed the cause of the crash to #121602 (which was backported to 3.13 as #121632, so it explains the failures on both branches).

@freakboy3742
Copy link
Contributor Author

I have confirmed the crash is occurring in the SubInterpreterTest added as part of #121602. If that test is commented out, the test suite passes without error.

@ericsnowcurrently
Copy link
Member

FYI, I'm pretty sure gh-121636 will fix this.

The gist is: the change in gh-121602 runtime initialization makes a copy of the (mostly) unmodified PyTypeObject struct for each builtin static type. During finalization we restore that copy, including the tp_version_tag. Clearly a part of that is causing things to get in an inconsistent state on iOS. The fix in gh-121636 restricts the copy/restore to only the "tp" slots (function pointers), restoring the pre-121602 status quo for the rest, including tp_version_tag.

Is there a way you could verify gh-121636 fixes the iOS build?

All that said, I'm definitely curious why this would be manifesting only on iOS?

@freakboy3742
Copy link
Contributor Author

FYI, I'm pretty sure gh-121636 will fix this.
...
Is there a way you could verify gh-121636 fixes the iOS build?

The iOS buildbot should be sufficient to verify any fix; from the look of it, the current state of #121636 doesn't fix it. The test failure isn't very helpful - it's a hard crash of the emulator, which the test report doesn't show very nicely - but with the test_types SubinterpreterTest commented out, the suite completes and passes on iOS. I've verified the same tests locally, and I get the same failure on #121636.

All that said, I'm definitely curious why this would be manifesting only on iOS?

You and me both :-)

My best guess right now it's not iOS specific, but a function of how the test suite is executed on iOS. iOS is forced to run the entire test suite sequentially in a single process. This isn't (AFAIK) the default mode of operation on any other runner; so it's possible the specific sequence of test execution is causing an odd state to emerge. One of the debugging tasks I'm going to look into today is to try and reproduce this on macOS (the other is to try and find a subset of tests that causes the failure, rather than needing to get to test_types alphabetically... after 20 minutes)

@ericsnowcurrently
Copy link
Member

Looks like gh-121636 isn't sufficient. I'm going to try a different approach.

@ericsnowcurrently
Copy link
Member

See gh-121882.

@freakboy3742
Copy link
Contributor Author

Dang it... looks like gh-121882 doesn't fix the problem either... 😢

@freakboy3742
Copy link
Contributor Author

Confirming: the test_types test doesn't fail on macOS when the test suite is executed in single process mode, so the problem isn't purely about execution order. Now trying to bisect the test suite to find a smaller subset of tests that fail.

@freakboy3742
Copy link
Contributor Author

freakboy3742 commented Jul 17, 2024

Scratch that - it does crash on macOS - just you just have to remember to enable --with-pydebug.

@ericsnowcurrently To reproduce:

./configure --enable-universalsdk="`xcrun --show-sdk-path`" --with-universal-archs=universal2 --with-pydebug
make -j8
./python.exe -m test

(Although I also had to patch test_buildin.py::PtyTests.test_input_tty to avoid a lockup due to #116402).

Key details: Clean macOS build, with pydebug enabled; running as a single process (not multiprocess as make test will do), on an M1 MacBook Pro.

Test failure:

...
0:17:13 load avg: 14.30 [424/478/1] test_types
Assertion failed: (self->tp_version_tag != 0), function init_static_type, file typeobject.c, line 8476.
Fatal Python error: Aborted

Current thread 0x00000001fb35fac0 (most recent call first):
  <no Python frame>

Extension modules: _testinternalcapi, _testcapi, _testlimitedcapi, _testbuffer, _testclinic, _testclinic_limited, _ctypes_test, xxsubtype, _testexternalinspection, _testsinglephase_basic_wrapper, _testsinglephase_basic_copy, _testsinglephase_with_reinit, _testsinglephase_with_state, _test_non_isolated, _test_shared_gil_only (total: 15)
zsh: abort      ./python.exe -m test

Now trying to narrow down a subset of these tests that will fail.

@freakboy3742 freakboy3742 changed the title iOS testbed crashes when --with-pydebug enabled Test suite crashes when run single process --with-pydebug enabled Jul 17, 2024
@freakboy3742
Copy link
Contributor Author

@ericsnowcurrently I've got it narrowed down to just 2 tests:

rkm@eunectes build % ./python.exe -m test test_type_cache test_types 
Using random seed: 226415265
0:00:00 load avg: 9.54 Run 2 tests sequentially in a single process
0:00:00 load avg: 9.54 [1/2] test_type_cache
0:00:00 load avg: 9.54 [2/2] test_types
Assertion failed: (self->tp_version_tag != 0), function init_static_type, file typeobject.c, line 8476.
Fatal Python error: Aborted

Current thread 0x00000001fb35fac0 (most recent call first):
  <no Python frame>

Extension modules: _testinternalcapi, _testcapi (total: 2)
zsh: abort      ./python.exe -m test test_type_cache test_types

This is running on macOS M1 hardware, using a checkout of main at f036a46.

I've also been able to reproduce it on macOS x86_64 hardware, and under Ubuntu 22.04 on x86_64. The Ubuntu machine is a VM on my macOS x86_64 machine, if that matters; but it doesn't appear to be CPU or OS dependent.

@freakboy3742
Copy link
Contributor Author

Based on a comment from @Fidget-Spinner on Discord: the problem might be that test_type_cache is leaving something in a bad state. Apparently some of those tests exercise situations that should never happen, so maybe there's some problematic leftover state from those tests that has only been exposed as a result of #121602 landing.

I've also been able to narrow the problem even further:
test_type_cache.TypeCacheWithSpecializationTests.test_class_load_attr_specialization_static_type is the only test case that needs to run before test_types to cause the crash.

@freakboy3742
Copy link
Contributor Author

Oh! Could it be this obvious?

# Permanently overflow the static type version counter, and force str and bytes
# to have tp_version_tag == 0
for _ in range(2**16):
type_modified(str)
type_assign_version(str)
type_modified(bytes)
type_assign_version(bytes)

If I'm reading this right, the test is deliberately setting the tp_version_tag to 0... and then not resetting the version tag at the end of the test, which causes the assertion in test_types to break?

If I comment out L179 and L181, the test_type_cache test fails, but test_types doesn't crash (in fact, it passes!)

I'm not sure how to go about resetting tp_version_tag... but that definitely looks like what is needed here.

@Fidget-Spinner
Copy link
Member

The thing is. Setting the type version tag to 0 should be fine on its own. If the rest of the tests fail, that means the type cache itself is buggy when type versions overflow. So there's probably an underlying bug somewhere else.

@Fidget-Spinner
Copy link
Member

Interesting! This failure happens on the default build, but not on the free-threaded build.

@Fidget-Spinner
Copy link
Member

I wonder if the assert is even right? I don't quite understand the init_static_type code, but it's not guaranteed that the type version tag is not zero just because it's init. That is only guaranteed if there's a valid entry in the type cache.

@ericsnowcurrently
Copy link
Member

FYI, @markshannon added that assert last month in 00257c7

commit 00257c746c447a2e026b5a2a618f0e033fb90111
Author: Mark Shannon <mark@hotpy.org>
Date:   Wed Jun 19 17:38:45 2024 +0100

    GH-119462: Enforce invariants of type versioning (GH-120731)

    * Remove uses of Py_TPFLAGS_VALID_VERSION_TAG

diff --git a/Objects/typeobject.c b/Objects/typeobject.c
index 0dcf1d399d9..1cc6ca79298 100644
--- a/Objects/typeobject.c
+++ b/Objects/typeobject.c
@@ -8516,12 +8481,11 @@ init_static_type(PyInterpreterState *interp, PyTypeObject *self,

         assert(NEXT_GLOBAL_VERSION_TAG <= _Py_MAX_GLOBAL_TYPE_VERSION_TAG);
         _PyType_SetVersion(self, NEXT_GLOBAL_VERSION_TAG++);
-        self->tp_flags |= Py_TPFLAGS_VALID_VERSION_TAG;
     }
     else {
         assert(!initial);
         assert(self->tp_flags & _Py_TPFLAGS_STATIC_BUILTIN);
-        assert(self->tp_flags & Py_TPFLAGS_VALID_VERSION_TAG);
+        assert(self->tp_version_tag != 0);
     }

     managed_static_type_state_init(interp, self, isbuiltin, initial);

That implies the assert condition is meant to be guaranteed.

init_static_type() runs once for each builtin static type when the runtime is initialized and again when each subinterpreter is initialized. The branch where the problematic assert happens is taken only when the static type has not been readied yet, which is basically with each subinterpreter.

@ericsnowcurrently
Copy link
Member

Do note that the crash isn't unique to test_types. It can be triggered by any test that creates a subinterpreter. For example, test_import:

$ ./python -m test test_type_cache test_import -v -m test_basic_multiple_interpreters_main_no_reset -m test_class_load_attr_specialization_static_type
== CPython 3.14.0a0 (tags/main--refreshed-20240711-7-g5250a031332:5250a031332, Jul 17 2024, 10:52:02) [GCC 7.5.0]
== Linux-5.4.0-150-generic-x86_64-with-glibc2.27 little-endian
== Python build: debug
== cwd: /home/esnow/projects/work/cpython-perf/cpython/build/test_python_worker_11771æ
== CPU count: 8
== encodings: locale=UTF-8 FS=utf-8
== resources: all test resources are disabled, use -u option to unskip tests

Using random seed: 2891891991
0:00:00 load avg: 0.08 Run 2 tests sequentially in a single process
0:00:00 load avg: 0.08 [1/2] test_type_cache
test_class_load_attr_specialization_static_type (test.test_type_cache.TypeCacheWithSpecializationTests.test_class_load_attr_specialization_static_type) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.097s

OK
0:00:00 load avg: 0.08 [2/2] test_import
test_basic_multiple_interpreters_main_no_reset (test.test_import.SinglephaseInitTests.test_basic_multiple_interpreters_main_no_reset) ... python: Objects/typeobject.c:8476: init_static_type: Assertion `self->tp_version_tag != 0' failed.
Fatal Python error: Aborted

Current thread 0x00007fd132d09100 (most recent call first):
  <no Python frame>

Extension modules: _testinternalcapi, _testcapi, _testmultiphase, _testsinglephase (total: 4)
Aborted (core dumped)

@ericsnowcurrently
Copy link
Member

That crash using test_import (or others) pre-dates 5250a03, so gh-121602 did not introduce the problem. It only revealed it a bit more clearly.

@ericsnowcurrently
Copy link
Member

@markshannon, @Fidget-Spinner, at this point we need to decide if that assert in init_static_type() is actually correct.

@freakboy3742
Copy link
Contributor Author

@markshannon Gentle nudge on this one; this bug is currently breaking the iOS buildbots.

@markshannon
Copy link
Member

The assertion is correct. Static types must be immutable, so their version number is fixed.

The problem is that we aren't enforcing that invariant in PyType_Modified.
Unfortunately, PyType_Modified returns void and has done forever, so cannot signal failure.

I think we can assume that well behaved C extension don't modify static builtin types, so adding a new variant of PyType_Modified that checks isn't worth the disruption.

I think the best thing is to add an assertion in PyType_Modified so that the error shows up when the invariant is broken, and fix the tests to not break the invariant.
I'll do that now.

@markshannon markshannon self-assigned this Jul 22, 2024
@markshannon
Copy link
Member

markshannon commented Jul 22, 2024

Unfortunately, the datetime module modifies the static builtin types when reloading, which doesn't appear to be safe.
If two subinterpreters are using the datetime classes, and one of them unloads the module, the other will see changes.

I think the fix for datetime would be to make the builtin classes fully immutable, or to use heap types for the classes.

@freakboy3742
Copy link
Contributor Author

As a temporary workaround, I've added #122150; this disables the test that is causing the crash on iOS so that iOS CI can continue. Any fix for this bug should be able to remove the test skip added by that PR.

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jul 23, 2024
…ore test suite. (pythonGH-122150)

(cherry picked from commit 1bcc9eb)

Co-authored-by: Russell Keith-Magee <russell@keith-magee.com>
@markshannon
Copy link
Member

@ericsnowcurrently can you make sure to revert #122150 when this is fixed?

freakboy3742 added a commit that referenced this issue Jul 23, 2024
…tore test suite. (GH-122150) (#122159)

gh-121832: Skip subinterpreter static type check on iOS to restore test suite. (GH-122150)
(cherry picked from commit 1bcc9eb)

Co-authored-by: Russell Keith-Magee <russell@keith-magee.com>
markshannon added a commit that referenced this issue Jul 24, 2024
…not changed by PyType_Modified. (GH-122182)

Update datetime module and test_type_cache.py to not call PyType_Modified.
nohlson pushed a commit to nohlson/cpython that referenced this issue Jul 24, 2024
nohlson pushed a commit to nohlson/cpython that referenced this issue Jul 24, 2024
…es is not changed by PyType_Modified. (pythonGH-122182)

Update datetime module and test_type_cache.py to not call PyType_Modified.
nohlson pushed a commit to nohlson/cpython that referenced this issue Jul 24, 2024
nohlson pushed a commit to nohlson/cpython that referenced this issue Jul 24, 2024
…es is not changed by PyType_Modified. (pythonGH-122182)

Update datetime module and test_type_cache.py to not call PyType_Modified.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jul 25, 2024
…es is not changed by PyType_Modified. (pythonGH-122182)

Update datetime module and test_type_cache.py to not call PyType_Modified.
(cherry picked from commit e55b05f)

Co-authored-by: Mark Shannon <mark@hotpy.org>
ericsnowcurrently pushed a commit that referenced this issue Jul 25, 2024
…pes is not changed by PyType_Modified (gh-122290)

Update datetime module and test_type_cache.py to not call PyType_Modified.

(cherry picked from commit e55b05f, AKA gh--122182)

Co-authored-by: Mark Shannon <mark@hotpy.org>
freakboy3742 added a commit that referenced this issue Jul 27, 2024
freakboy3742 added a commit to freakboy3742/cpython that referenced this issue Jul 27, 2024
… (pythonGH-122340)

Revert test skip introduced by pythonGH-122150.
(cherry picked from commit 863a92f)

Co-authored-by: Russell Keith-Magee <russell@keith-magee.com>
freakboy3742 added a commit that referenced this issue Jul 27, 2024
… (#122342)

Revert test skip introduced by GH-122150.
(cherry picked from commit 863a92f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes 3.14 new features, bugs and security fixes type-crash A hard crash of the interpreter, possibly with a core dump
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants