Skip to content

hang instead of pthread_exit during interpreter shutdown #4874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 9, 2025

Conversation

arielb1
Copy link
Contributor

@arielb1 arielb1 commented Jan 26, 2025

This mimics the Python 3.14 behavior and avoids crashes in Rust 1.84.

See python/cpython#87135 and rust-lang/rust#135929 and the related pyo3-log issue vorner/pyo3-log#30

By submitting these contributions you agree for them to be dual-licensed under PyO3's MIT OR Apache-2.0 license.

Copy link

codspeed-hq bot commented Jan 27, 2025

CodSpeed Performance Report

Merging #4874 will not alter performance

Comparing arielb1:safe-exit (b86c1df) with main (295e67a)

Summary

✅ 87 untouched benchmarks

@arielb1 arielb1 force-pushed the safe-exit branch 4 times, most recently from 80f138a to 704ef0b Compare January 27, 2025 23:24
@arielb1
Copy link
Contributor Author

arielb1 commented Jan 27, 2025

Yay it passes tests.

@@ -3,7 +3,7 @@

from pyo3_pytests import othermod

INTEGER32_ST = st.integers(min_value=(-(2**31)), max_value=(2**31 - 1))
INTEGER32_ST = st.integers(min_value=(-(2**30)), max_value=(2**30 - 1))
Copy link
Contributor Author

@arielb1 arielb1 Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change here is to avoid the test failing due to a high rate of assume failure (filter_too_much).

@arielb1
Copy link
Contributor Author

arielb1 commented Jan 29, 2025

split the test_double fix to #4879

@arielb1
Copy link
Contributor Author

arielb1 commented Feb 17, 2025

any update?

Copy link
Contributor

@Icxolu Icxolu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me, but I would like at least one other maintainers opinion (maybe @davidhewitt?) before proceeding here.

@davidhewitt
Copy link
Member

Please forgive the delay, I will seek to review this on Friday, and ping me repeatedly from then if I do not achieve that!

Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think let's move forward with this. The cfg guards hack around the various edge cases and, as I read it, the interaction with pthread_exit and rust destructors was (and still is) UB, so this just creates a practical solution for now. Ultimately 3.14 will make this problem go away.

@davidhewitt davidhewitt added this pull request to the merge queue Feb 19, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 19, 2025
@mgorny mgorny mentioned this pull request Feb 19, 2025
7 tasks
@arielb1
Copy link
Contributor Author

arielb1 commented Feb 19, 2025

So the scary failure here is the CodSpeed regression?

@arielb1
Copy link
Contributor Author

arielb1 commented Feb 19, 2025

@davidhewitt
Copy link
Member

The failed stage is test-debug in https://github.com/PyO3/pyo3/actions/runs/13411900757/job/37463833819#step:9:8211, looks like the test might be triggering a debug build assertion?

I assume it's probably fine and we should just skip the new test on debug builds? After all, we're trying to test a bad edge case.

@davidhewitt
Copy link
Member

Agreed RE flaky benchmark, I will set that one to be ignored.

@arielb1
Copy link
Contributor Author

arielb1 commented Feb 19, 2025

python: Python/pystate.c:345: void unbind_gilstate_tstate(PyThreadState *): Assertion `tstate == tstate_tss_get(&(tstate->interp->runtime)->autoTSSkey)' failed.

This looks bad. Not sure it's our fault tho. Trying to investigate - if it's PyO3's fault, we should fix it. I want to know whether it's a CPython or a PyO3 problem.

@arielb1
Copy link
Contributor Author

arielb1 commented Feb 19, 2025

I didn't manage to debug it easily, will try more after Rust Nation.

@arielb1
Copy link
Contributor Author

arielb1 commented Mar 9, 2025

Looks like the failure does not reproduce on the builder either. I say merge this unless someone can reproduce the failure.

@arielb1 arielb1 mentioned this pull request Mar 9, 2025
@arielb1
Copy link
Contributor Author

arielb1 commented Mar 9, 2025

ed: I can reproduce the assertion locally

@arielb1
Copy link
Contributor Author

arielb1 commented Mar 9, 2025

Backtrace:

Thread 2 (Thread 0x7fe67a1846c0 (LWP 190623) "pytest" (Exiting)):
#0  0x00007fe6c03b53f6 in __libc_pause () at ../sysdeps/unix/sysv/linux/pause.c:29
#1  0x00007fe6bed91d3b in pyo3_ffi::pystate::{impl#0}::drop (self=0x7fe67a1839cd) at pyo3-ffi/src/pystate.rs:83
#2  0x00007fe6bed91d2b in core::ptr::drop_in_place<pyo3_ffi::pystate::HangThread> () at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/ptr/mod.rs:523
#3  0x00007fe6bed91913 in pyo3_ffi::pystate::PyGILState_Ensure () at pyo3-ffi/src/pystate.rs:137
#4  0x00007fe6bed5d2c6 in pyo3::gil::GILGuard::acquire_unchecked () at src/gil.rs:207
#5  0x00007fe6bed5d27f in pyo3::gil::GILGuard::acquire () at src/gil.rs:194
#6  0x00007fe6becd52ac in pyo3::marker::Python::with_gil<pyo3_pytests::misc::hammer_gil_in_thread::{closure#0}::{closure_env#0}, ()> (f=...) at src/marker.rs:409
#7  0x00007fe6bed1ee72 in pyo3_pytests::misc::hammer_gil_in_thread::{closure#0} () at pytests/src/misc.rs:28
#8  0x00007fe6bed50913 in std::sys::backtrace::__rust_begin_short_backtrace<pyo3_pytests::misc::hammer_gil_in_thread::{closure_env#0}, ()> (f=...) at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/sys/backtrace.rs:152
#9  0x00007fe6bed3c15b in std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure#0}<pyo3_pytests::misc::hammer_gil_in_thread::{closure_env#0}, ()> () at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/thread/mod.rs:564
#10 0x00007fe6becd7fe0 in core::panic::unwind_safe::{impl#23}::call_once<(), std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<pyo3_pytests::misc::hammer_gil_in_thread::{closure_env#0}, ()>> (self=...) at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/panic/unwind_safe.rs:272
#11 0x00007fe6becef08f in std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<pyo3_pytests::misc::hammer_gil_in_thread::{closure_env#0}, ()>>, ()> (data=0x7fe67a183d50) at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/panicking.rs:584
#12 0x00007fe6bed3c26b in __rust_try () from /pyo3/pytests/.nox/test/lib/python3.13/site-packages/pyo3_pytests/pyo3_pytests.cpython-313-x86_64-linux-gnu.so
#13 0x00007fe6bed3bbd3 in std::panicking::try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<pyo3_pytests::misc::hammer_gil_in_thread::{closure_env#0}, ()>>> (f=...) at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/panicking.rs:547
#14 std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<pyo3_pytests::misc::hammer_gil_in_thread::{closure_env#0}, ()>>, ()> (f=...) at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/panic.rs:358
#15 std::thread::{impl#0}::spawn_unchecked_::{closure#1}<pyo3_pytests::misc::hammer_gil_in_thread::{closure_env#0}, ()> () at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/thread/mod.rs:562
#16 0x00007fe6becaf28e in core::ops::function::FnOnce::call_once<std::thread::{impl#0}::spawn_unchecked_::{closure_env#1}<pyo3_pytests::misc::hammer_gil_in_thread::{closure_env#0}, ()>, ()> () at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/ops/function.rs:250
#17 0x00007fe6bedb6d5b in alloc::boxed::{impl#28}::call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:1993
#18 alloc::boxed::{impl#28}::call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:1993
#19 std::sys::pal::unix::thread::{impl#2}::new::thread_start () at library/std/src/sys/pal/unix/thread.rs:106
#20 0x00007fe6c0357aa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#21 0x00007fe6c03e4a34 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

Thread 1 (Thread 0x7fe6c02b8740 (LWP 190590) "pytest"):
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
--Type <RET> for more, q to quit, c to continue without paging--
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007fe6c030027e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007fe6c02e38ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007fe6c02e381b in __assert_fail_base (fmt=0x7fe6c048b1e8 "%s%s%s:%u: %s%sAssertion `%s' failed.
%n", assertion=assertion@entry=0x7fe6c15d1e2e "tstate == tstate_tss_get(&(tstate->interp->runtime)->autoTSSkey)", file=file@entry=0x7fe6c15d05b1 "Python/pystate.c", line=line@entry=345, function=function@entry=0x7fe6c15d1e01 "void unbind_gilstate_tstate(PyThreadState *)") at ./assert/assert.c:96
#6  0x00007fe6c02f6517 in __assert_fail (assertion=0x7fe6c15d1e2e "tstate == tstate_tss_get(&(tstate->interp->runtime)->autoTSSkey)", file=0x7fe6c15d05b1 "Python/pystate.c", line=line@entry=345, function=0x7fe6c15d1e01 "void unbind_gilstate_tstate(PyThreadState *)") at ./assert/assert.c:105
#7  0x00007fe6c0a248ef in unbind_gilstate_tstate (tstate=0x7fe674000d40) at Python/pystate.c:345
#8  tstate_delete_common (tstate=tstate@entry=0x7fe674000d40, release_gil=release_gil@entry=0) at Python/pystate.c:1811
#9  0x00007fe6c0a22a38 in zapthreads (interp=0x7fe6c1ceda90 <_PyRuntime+104352>) at Python/pystate.c:1837
#10 PyInterpreterState_Delete (interp=0x7fe6c1ceda90 <_PyRuntime+104352>) at Python/pystate.c:952
#11 0x00007fe6c0a016c9 in finalize_interp_delete (interp=0x7fe6c1ceda90 <_PyRuntime+104352>) at Python/pylifecycle.c:1908
#12 _Py_Finalize (runtime=0x7fe6c1cd42f0 <_PyRuntime>) at Python/pylifecycle.c:2187
#13 0x00007fe6c0a0359b in Py_Exit (sts=0) at Python/pylifecycle.c:3392
#14 0x00007fe6c0a28776 in handle_system_exit () at Python/pythonrun.c:635
#15 _PyErr_PrintEx (tstate=0x7fe6c1d1d2e0 <_PyRuntime+298992>, set_sys_last_vars=set_sys_last_vars@entry=1) at Python/pythonrun.c:644
#16 0x00007fe6c0a271ea in PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:721
#17 PyErr_Print () at Python/pythonrun.c:727
#18 _PyRun_SimpleFileObject (fp=fp@entry=0x55b5e01072f0, filename=filename@entry=0x7fe6c020af10, closeit=closeit@entry=1, flags=flags@entry=0x7ffc74113f58) at Python/pythonrun.c:496
#19 0x00007fe6c0a26ab2 in _PyRun_AnyFileObject (fp=fp@entry=0x55b5e01072f0, filename=filename@entry=0x7fe6c020af10, closeit=closeit@entry=1, flags=flags@entry=0x7ffc74113f58) at Python/pythonrun.c:77
#20 0x00007fe6c0a52df2 in pymain_run_file_obj (program_name=0x7fe6c00c6ff0, filename=0x7fe6c020af10, skip_source_first_line=0) at Modules/main.c:409
#21 pymain_run_file (config=config@entry=0x7fe6c1cef9c8 <_PyRuntime+112344>) at Modules/main.c:428
#22 0x00007fe6c0a52419 in pymain_run_python (exitcode=0x7ffc74113fd4) at Modules/main.c:696
#23 Py_RunMain () at Modules/main.c:775
#24 0x00007fe6c0a52731 in pymain_main (args=args@entry=0x7ffc74114280) at Modules/main.c:805
#25 0x00007fe6c0a5278b in Py_BytesMain (argc=<optimized out>, argv=0x2e87e) at Modules/main.c:829
#26 0x00007fe6c02e51ca in __libc_start_call_main (main=main@entry=0x55b5b3805150 <main>, argc=argc@entry=2, argv=argv@entry=0x7ffc741143c8) at ../sysdeps/nptl/libc_start_call_main.h:58
#27 0x00007fe6c02e528b in __libc_start_main_impl (main=0x55b5b3805150 <main>, argc=2, argv=0x7ffc741143c8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc741143b8) at ../csu/libc-start.c:360
#28 0x000055b5b3805089 in _start ()                                                                                                                                                                

@arielb1
Copy link
Contributor Author

arielb1 commented Mar 9, 2025

It looks like unbind_gilstate_tstate is expecting to only unbind the interpreter state for the current thread, but it unbinds the interpreter state for the hanging thread. Not sure whether this is bad. Definitely does not look like our bug:

--- this is Thread 2 [see native_thread_id]
(gdb) p *tstate
$7 = {prev = 0x0, next = 0x7fe6c1d1d2e0 <_PyRuntime+298992>, interp = 0x7fe6c1ceda90 <_PyRuntime+104352>, eval_breaker = 0, _status = {initialized = 1, bound = 1, unbound = 0,
    bound_gilstate = 1, active = 0, holds_gil = 0, finalizing = 1, cleared = 1, finalized = 0}, _whence = 4, state = 0, py_recursion_remaining = 1000, py_recursion_limit = 1000,
  c_recursion_remaining = 500, recursion_headroom = 0, tracing = 0, what_event = -1, current_frame = 0x0, c_profilefunc = 0x0, c_tracefunc = 0x0, c_profileobj = 0x0, c_traceobj = 0x0,
  current_exception = 0x0, exc_info = 0x7fe674000e40, dict = 0x0, gilstate_counter = 0, async_exc = 0x0, thread_id = 140627867616960, native_thread_id = 190623, delete_later = 0x0,
  critical_section = 0, coroutine_origin_tracking_depth = 0, async_gen_firstiter = 0x0, async_gen_finalizer = 0x0, context = 0x0, context_ver = 1, id = 22, datastack_chunk = 0x0,
  datastack_top = 0x0, datastack_limit = 0x0, exc_state = {exc_value = 0x0, previous_item = 0x0}, previous_executor = 0x0, dict_global_version = 0, threading_local_key = 0x0,
  threading_local_sentinel = 0x0}
--- this is the thread that is currently exiting [thread 1]
(gdb) p *(PyThreadState*)pthread_getspecific(1)
$8 = {prev = 0x0, next = 0x0, interp = 0x7fe6c1ceda90 <_PyRuntime+104352>, eval_breaker = 0, _status = {initialized = 1, bound = 1, unbound = 0, bound_gilstate = 1, active = 0, holds_gil = 0,
    finalizing = 1, cleared = 1, finalized = 0}, _whence = 1, state = 0, py_recursion_remaining = 1000, py_recursion_limit = 1000, c_recursion_remaining = 500, recursion_headroom = 0,
  tracing = 0, what_event = -1, current_frame = 0x0, c_profilefunc = 0x0, c_tracefunc = 0x0, c_profileobj = 0x0, c_traceobj = 0x0, current_exception = 0x0,
  exc_info = 0x7fe6c1d1d3e0 <_PyRuntime+299248>, dict = 0x0, gilstate_counter = 2, async_exc = 0x0, thread_id = 140629043283776, native_thread_id = 190590, delete_later = 0x0,
  critical_section = 0, coroutine_origin_tracking_depth = 0, async_gen_firstiter = 0x0, async_gen_finalizer = 0x0, context = 0x0, context_ver = 25, id = 1, datastack_chunk = 0x7fe6c05cc000,
  datastack_top = 0x7fe6c05cc020, datastack_limit = 0x7fe6c05d0000, exc_state = {exc_value = 0x0, previous_item = 0x0}, previous_executor = 0x0, dict_global_version = 0,
  threading_local_key = 0x0, threading_local_sentinel = 0x0}

@arielb1
Copy link
Contributor Author

arielb1 commented Mar 9, 2025

My feeling is that my change doesn't make the problem worse, and we should skip this test on debug builds. What's the way to do that?

@davidhewitt
Copy link
Member

Understood, thanks for checking that. I've pushed something and will click auto-merge now 👍

@davidhewitt davidhewitt enabled auto-merge March 9, 2025 19:15
@davidhewitt davidhewitt added this pull request to the merge queue Mar 9, 2025
Merged via the queue into PyO3:main with commit b726d6a Mar 9, 2025
86 of 89 checks passed
@arielb1
Copy link
Contributor Author

arielb1 commented Mar 9, 2025

As you can see in the python/cpython#131012 the assertion failure is currently expected on the CPython side.

@ngoldbaum
Copy link
Contributor

It looks like test_hammer_gil is breaking during interpreter shutdown when we try to run it on 3.14: #4811 (comment)

Any idea what the appropriate way to handle this is? Should we be using HandThread on 3.14 as well?

@davidhewitt
Copy link
Member

Replied in #4811 (review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants