Skip to content

shell command does not terminate on cancel #3237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zoranbosnjak opened this issue Apr 1, 2025 · 10 comments
Open

shell command does not terminate on cancel #3237

zoranbosnjak opened this issue Apr 1, 2025 · 10 comments

Comments

@zoranbosnjak
Copy link

As a simplified example, I would like to:

  • start a shell command
  • wait some time
  • stop it

The problem is that a child subprocesses from the shell won't terminate on cancel and would keep running even after python process is finished.

The problem is present only when a command consists of several processes, like ping localhost | cat, but not when a command is simple, like sleep 20.

Tested with: ubuntu-24.04, python 3.12.8, trio 0.29.0

Steps to reproduce:

  • check for currently running ping processes: ps aux | grep ping (confirm no ping process is running)
  • run the test script (see below) python ./test.py
  • check list of processes again, while the test script is running and after the test script terminates

Expected result is that a shell + ping + cat are all running only for 5 seconds.

Actual result that a shell process is running for 5 seconds, but the ping and also cat processes remain running not just beyond 5 seconds, but even after a script terminates. It looks like the subprocesses get detached from the shell and are not considered childs any more. I am not sure if this is 1 or 2 separate problems. I am not even sure if this is a proper way to run a shell script from within trio.

In any case, is there a workaround or a better way to run shell commands?

test.py script:

import trio
from subprocess import DEVNULL

# cmd = 'sleep 20'  # this is properly stopped
cmd = 'ping localhost | cat'  # this command won't stop properly

async def test():
    async with trio.open_nursery() as ns:
        async def service_runner(task_status = trio.TASK_STATUS_IGNORED):
            with trio.CancelScope() as scope:
                task_status.started(scope)
                await trio.run_process(cmd, shell=True, check=False,
                                        stdout=DEVNULL, stderr=DEVNULL)
                ns.cancel_scope.cancel()

        service_scope = await ns.start(service_runner)
        await trio.sleep(5.0)
        service_scope.cancel()
        await trio.sleep(5.0)

trio.run(test)
@CoolCat467
Copy link
Member

Reading the docs page about run_process, it is implied that on unix systems, processes are sent the SIGTERM signal and after 5 seconds SIGKILL is sent. If you want to immediately kill subprocesses, you might want to change deliver_cancel, though I've not worked with sub processes extensively before and I might be missing something.

@zoranbosnjak
Copy link
Author

@CoolCat467 thanks for the suggestion.
I have tryed to use deliver_cancel to force stop the process, but it does not help.

This is a new test scenario. It shows that deliver_cancel actually gets executed and the (shell) process had actually stopped. But it looks like childs of a shell did not receive termination signal. The ping process still gets orphaned and remains running.

Do you have any other idea?

import trio
from subprocess import DEVNULL

# cmd = 'ping localhost'
cmd = 'ping localhost | cat'

async def test():
    async def stop_process(process):
        print('stopping...')
        process.kill()
        print('waiting...')
        await process.wait()
        print('done...')

    async with trio.open_nursery() as ns:
        async def service_runner(task_status = trio.TASK_STATUS_IGNORED):
            with trio.CancelScope() as scope:
                task_status.started(scope)
                await trio.run_process(cmd, shell=True, check=False,
                                        stdout=DEVNULL, stderr=DEVNULL,
                                        deliver_cancel=stop_process)
                ns.cancel_scope.cancel()

        service_scope = await ns.start(service_runner)
        await trio.sleep(5.0)
        service_scope.cancel()
        await trio.sleep(5.0)

trio.run(test)

@A5rocks
Copy link
Contributor

A5rocks commented Apr 1, 2025

This isn't trio-specific. I haven't tried looking more but I suspect /bin/sh doesn't propagate signals through pipes.

alice@halimede /t/tmp.9WFtkHEbrd> cat ex2.py
import subprocess
import time

cmd = 'while true; do\necho "h"\nsleep 1\ndone | cat'

proc = subprocess.Popen(cmd, shell=True)
time.sleep(1)
proc.kill()
alice@halimede /t/tmp.9WFtkHEbrd> python ex2.py
h
h
alice@halimede /t/tmp.9WFtkHEbrd> h
h
h

@A5rocks
Copy link
Contributor

A5rocks commented Apr 1, 2025

On Python 3.11+ (on Unix at least) you can do this:

alice@halimede /t/tmp.9WFtkHEbrd> cat ex2.py
import subprocess
import time
import os
import signal

cmd = 'while true; do\necho "h"\nsleep 10\ndone | cat'

proc = subprocess.Popen(cmd, shell=True, process_group=0)

time.sleep(1)
os.killpg(proc.pid, signal.SIGTERM)
proc.wait()
alice@halimede /t/tmp.9WFtkHEbrd> python ex2.py
h
alice@halimede /t/tmp.9WFtkHEbrd>                                                                                       

and for earlier I think you need to use like preexec_fn with like os.setpgid(0, 0). Maybe trio should do this by default if shell=True (on Unix).

@zoranbosnjak
Copy link
Author

@A5rocks I am not exactly sure how to translate your suggestion to current trio defaults. Do I suppose to use trio.run_process or any other low level method? I would appreciate a short trio example to demonstrate start/stop of a shell pileline.

@A5rocks
Copy link
Contributor

A5rocks commented Apr 2, 2025

This should work, though I haven't tested it:

import trio
import os  # changed
import signal  # changed
from subprocess import DEVNULL

# cmd = 'ping localhost'
cmd = 'ping localhost | cat'

async def test():
    async def stop_process(process):
        print('stopping...')
        os.killpg(process.pid, signal.SIGTERM)  # changed
        print('waiting...')
        await process.wait()
        print('done...')

    async with trio.open_nursery() as ns:
        async def service_runner(task_status = trio.TASK_STATUS_IGNORED):
            with trio.CancelScope() as scope:
                task_status.started(scope)
                await trio.run_process(cmd, shell=True, check=False,
                                        stdout=DEVNULL, stderr=DEVNULL,
                                        deliver_cancel=stop_process, process_group=0)  # changed
                ns.cancel_scope.cancel()

        service_scope = await ns.start(service_runner)
        await trio.sleep(5.0)
        service_scope.cancel()
        await trio.sleep(5.0)

trio.run(test)

@zoranbosnjak
Copy link
Author

@A5rocks thanks. This approach works.

It is sufficient for my application. Let me know I you want me to close the issue or keep it open as a reminder to improve trio. Either by better defaults or at least documentation update to highlight the problem and a solution.

There is another observation. trio correctly stops all subprocesses on CTRL-C on all cases, either during the first 5 seconds or during the second 5 seconds interval. The behaviour is correct even with the original example (that is: even after the subprocess was already improperly canceled). This makes me believe that there is another problem, namely the CTRL-C does better (expected) cleanup than regular application exit. Not sure if this observation is related to the original issue.

@A5rocks
Copy link
Contributor

A5rocks commented Apr 2, 2025

I guess I should explain what I think the problem is.

The issue is that process.kill() kills a process -- and processes don't have to propagate things to subprocesses. Instead, we can put them all in a group and this means that we can kill the group, which sends a signal to every member.

The reason why Ctrl+C works by default is that subprocesses by default share a process group with the process that spawned them. So, the shell has the same process group as the python program. When you press Ctrl+C, every process in the process group receives a SIGINT. We can't use this for process.kill() because we don't want to kill the current process; just the subprocesses. Hence the process_group=0 to create a new process group (0 = "use my pid for pgid").

(I'm not sure where "it worked without pipes" fits into this. I assume the shell does some exec if there's no pipes and sub processes if there are? That's the main bit that makes me couch this explanation in "what I think the problem is")


I think trio should improve its defaults here:

  • make shell=True subprocesses pass process_group=0 by default unless that overrides something
  • if process_group is in the kwargs, then the default deliver_cancel should use os.killpg(os.getpgid(process.pid), signal)
  • probably also backport process_group by updating preexec_fn on pre-3.11

Though we should make sure we don't make it worse in response to ctrl+c (since the subprocess is no longer in the process group I don't think they automatically received SIGINT. This is probably the correct behavior though? We want them to go through typical shutdown behavior as deliver_cancel specifies.)

A5rocks added a commit to A5rocks/trio that referenced this issue Apr 8, 2025
- `shell=True` implies `process_group=0`
- make default `deliver_cancel` aware of process groups
- backport `process_group` to Pythons prior to 3.11
@jab
Copy link
Member

jab commented Apr 12, 2025

https://biriukov.dev/docs/fd-pipe-session-terminal/3-process-groups-jobs-and-sessions/ explains what’s going on here.

Not an expert, but I understand enough to say this is not a Trio-specific issue, nor a shell=True / pipe specific issue (though this does often come up in that use case).

Trio can’t in general enforce structured concurrency across the subprocess boundary: It can’t stop an arbitrary subprocess from e.g. spawning its own subprocess in a new session that Trio doesn’t know about, which outlives the ancestor Trio program, and which won’t receive kill signals sent to ancestor process group(s), etc.

But in the common enough case that you control all the descendent subprocesses you’re spawning with Trio, you can use the process group trick mentioned above along with os.killpg to honor structured concurrency. It’s a subtle enough yet common enough use case that it probably deserves at least a recipe in the docs, if not also some higher-level API support.

@A5rocks
Copy link
Contributor

A5rocks commented Apr 13, 2025

Would you happen to know why this only works on 3.12 and 3.14, not 3.13 or anything less than 3.12? (I'll spend some time debugging this at some point but this specific matrix of success makes no sense and yet it persists across reruns): A5rocks@c7991d4

Here's the CI run: https://github.com/A5rocks/trio/actions/runs/14330484295

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants