I tried to write a “simple” Python function. All it had to do is to run a command line in a subprocess, and enforce a timeout on that subprocess.
Turns out that this “simple” goal is not so simple. I considered multiple approaches to solve this, and ran into interesting concurrency issues.
In this post I describe a solution based on Python subprocess
and various threading
constructs. I demonstrate a concurrency inconsistency I discovered when testing my solution on Linux and OS X.
The conclusion, if that’s all what you’re looking for, is that timer.is_alive()
is not a safe way to test whether a timer had expired!
Note: My experience is based on Python 2.7 with the default subprocess module from the standard library. If you’re on Python 3.3+, a timeout argument was added to subprocess. You can also install python-subprocess32, which brings this joy to Python 2.4-2.7.
Problem statement
To be clear, the function I want to implement is this:
def run_command_with_timeout(cmd, timeout_sec): """Execute `cmd` in a subprocess and enforce timeout `timeout_sec` seconds. Return subprocess exit code on natural completion of the subprocess. Raise an exception if timeout expires before subprocess completes.""" # Pseudo-code: run cmd in subprocess if subprocess completed naturally - return its exit code otherwise, if timeout expired - raise an exception
To be even clearer, here’s a short unittest program that tests for the expected behavior:
import unittest class SubprocessTimeoutError(RuntimeError): """Error class for subprocess timeout.""" pass class CommandWithTimeoutTest(unittest.TestCase): def test_natural_success(self): """Try one that should complete naturally and successfully""" self.assertEqual(0, run_command_with_timeout(['sleep', '0.1'], 1.0)) def test_natural_failure(self): """Try one that should complete naturally with an error code""" self.assertEqual(1, run_command_with_timeout(['sleep', '-w00t'], 1.0)) def test_timeout(self): """Try one that should be killed after the timeout""" with self.assertRaises(SubprocessTimeoutError): run_command_with_timeout(['sleep', '10'], 1.0) if '__main__' == __name__: unittest.main()
First attempt
An intuitive approach to solve this may be:
- Use Python
subprocess
to run the command in a subprocess. - Use Python
threading.Timer
to enforce the timeout.
Here’s an implementation based on this approach:
def run_command_with_timeout(cmd, timeout_sec): """Execute `cmd` in a subprocess and enforce timeout `timeout_sec` seconds. Return subprocess exit code on natural completion of the subprocess. Raise an exception if timeout expires before subprocess completes.""" proc = subprocess.Popen(cmd) timer = Timer(timeout_sec, proc.kill) timer.start() proc.communicate() if timer.is_alive(): # Process completed naturally - cancel timer and return exit code timer.cancel() return proc.returncode # Process killed by timer - raise exception raise SubprocessTimeoutError('Process #%d killed after %f seconds' % (proc.pid, timeout_sec))
Seems straight forward, doesn’t it?
I start the timer before calling the communicate method. The call to communicate()
returns only after the process is finished (for whatever reason). If the timer expires before the process finishes naturally, proc.kill()
is called, killing the process, and making the communicate()
call return.
Once communicate()
returns, I need to check why. To do that, I use the timer.is_alive()
method to check if the timer is still running. If it is, it means proc.kill()
wasn’t called, and the process exited naturally. In that case, I cancel the timer (so it doesn’t kill a non-existing process) and return the returncode
. If the timer is not alive, it means proc.kill()
was called, so I raise the Timeout error.
Good enough?
Running the tests on my Mac:
itamar@legolas misc $ ./timercaveat.py usage: sleep seconds ... ---------------------------------------------------------------------- Ran 3 tests in 1.122s OK
Looks promising.
Not good enough!
Running on my Linux VM:
itamar@ubuntu-vm misc>./timercaveat.py sleep: invalid option -- 'w' Try 'sleep --help' for more information. ..F ====================================================================== FAIL: test_timeout (__main__.CommandWithTimeoutTest) Try one that should be killed after the timeout ---------------------------------------------------------------------- Traceback (most recent call last): File "./timercaveat.py", line 39, in test_timeout run_command_with_timeout(['sleep', '10'], 1.0) AssertionError: SubprocessTimeoutError not raised ---------------------------------------------------------------------- Ran 3 tests in 1.112s FAILED (failures=1)
Damn. That’s not good.
What happened?
Writing concurrent code is a bitch. When multithreaded code breaks, it usually happens in non-deterministic ways. Oddly enough, this issue seems to be deterministic but inconsistent across the two systems I’m using. Running the tests hundreds of times, I always get the same behavior.
Looking closer at the timer.is_alive()
method:
This method returns True just before the `run()` method starts until just after the `run()` method terminates.
What do you mean just before and just after?! Can I get more specific guarantees please??
Apparently I can’t, so I’ll test it my self. To do that, I add some debug prints:
def run_command_with_timeout(cmd, timeout_sec): """Execute `cmd` in a subprocess and enforce timeout `timeout_sec` seconds. Return subprocess exit code on natural completion of the subprocess. Raise an exception if timeout expires before subprocess completes.""" proc = subprocess.Popen(cmd) timer = Timer(timeout_sec, proc.kill) timer.start() proc.communicate() print 'timer alive?', timer.is_alive() sleep(0.001) print 'and now?', timer.is_alive() if timer.is_alive(): # Process completed naturally - cancel timer and return exit code timer.cancel() return proc.returncode # Process killed by timer - raise exception raise SubprocessTimeoutError('Process #%d killed after %f seconds' % (proc.pid, timeout_sec))
Running just the problematic test:
On my Mac:
itamar@legolas misc $ ./timercaveat.py CommandWithTimeoutTest.test_timeout timer alive? False and now? False . ---------------------------------------------------------------------- Ran 1 test in 1.009s OK
On my Linux VM:
itamar@ubuntu-vm misc>./timercaveat.py CommandWithTimeoutTest.test_timeout timer alive? True and now? False . ---------------------------------------------------------------------- Ran 1 test in 1.005s OK
Aha! The is_alive()
method changes to False “just after” the timer terminates, with “just after” being different across the two systems!
Pinpointing the problem?
It seems my code has a race condition between proc.communicate()
and timer.is_alive()
, when the process is killed by the timer.
On my Mac, the timer kills the process and sets is_alive
to False before returning control to the main thread. This makes the test behave correctly.
On my Linux VM, the control returns to the main thread immediately after the timer kills the process, but before setting is_alive
to False. This makes the test fail, unless the added sleep gives the timer a chance to change is_alive
.
I found it hard to believe that the “consistent inconsistency” is a result of the OS (Mac vs. Linux). When thinking about it further, I realized it might be a result of CPU cores.
On the Mac, with 4 cores, the timer can kill the process and change is_alive
on one core while the main thread returns from proc.communicate()
on another core. This means that there’s a race condition between the two threads running on two cores, on for some reason the thread that changes the value of is_alive
consistently wins the race.
On my Linux VM I allocated just 1 core. This means that when the timer expires, it will call proc.kill()
, which will context switch (because kill is a syscall). Once the kernel finishes handling the syscall, it will see there are two threads in the ready state – the timer thread, and the main thread. For some reason, it consistently schedules the main thread (maybe because the timer thread was the last to run). Since this happens on a single CPU core, the timer thread is waiting (not yet changing still_alive
), while the main thread calls is_alive()
to get the wrong value. This is the same race condition, but running on 1 CPU core makes the timer thread consistently lose the race.
To test this theory, I created a new Linux VM with 2 CPU cores, and ran the same test:
itamar@ubuntu-vm misc>grep -c ^processor /proc/cpuinfo 2 itamar@ubuntu-vm misc>./timercaveat.py CommandWithTimeoutTest.test_timeout timer alive? False and now? False . ---------------------------------------------------------------------- Ran 1 test in 1.005s OK
Theory confirmed! On the 2-core Linux VM the behavior is consistent with the Mac!
OK, but you still have a race condition!
You may claim that this whole thing is pointless, because I had a real race condition in my original solution. You’d be right, of course. But it’s still interesting to learn from mistakes of others, isn’t it? 🙂
To keep this post from getting too long, I’ll just show you two different fixes – one that introduces other race conditions, and one that solves it (hopefully).
A fix that still has race conditions
def run_command_with_timeout(cmd, timeout_sec): """Execute `cmd` in a subprocess and enforce timeout `timeout_sec` seconds. Return subprocess exit code on natural completion of the subprocess. Raise an exception if timeout expires before subprocess completes.""" proc = subprocess.Popen(cmd) def kill_proc(): timer.expired = True proc.kill() timer = Timer(timeout_sec, kill_proc) timer.expired = False timer.start() proc.communicate() if timer.expired: # Process killed by timer - raise exception raise SubprocessTimeoutError('Process #%d killed after %f seconds' % (proc.pid, timeout_sec)) # Process completed naturally - cancel timer and return exit code timer.cancel() return proc.returncode
The threading.Timer
object doesn’t really have an expired
attribute. I artificially inject this attribute, to use it as a flag for myself. In effect, this replaces is_alive()
with expired
that I explicitly set before killing the process.
Can you spot all the race conditions in this so-called solution? Try to, in the comments! 🙂
A fix that actually fixes (I hope)
def run_command_with_timeout(cmd, timeout_sec): """Execute `cmd` in a subprocess and enforce timeout `timeout_sec` seconds. Return subprocess exit code on natural completion of the subprocess. Raise an exception if timeout expires before subprocess completes.""" proc = subprocess.Popen(cmd) proc_thread = Thread(target=proc.communicate) proc_thread.start() proc_thread.join(timeout_sec) if proc_thread.is_alive(): # Process still running - kill it and raise timeout error try: proc.kill() except OSError, e: # The process finished between the `is_alive()` and `kill()` return proc.returncode # OK, the process was definitely killed raise SubprocessTimeoutError('Process #%d killed after %f seconds' % (proc.pid, timeout_sec)) # Process completed naturally - return exit code return proc.returncode
You might want to refactor this a bit to allow for proper use of communicate()
– with support for read/write buffers.
This solution still has a race condition, but I deal with it using a try-except block around the proc.kill()
.
If you catch more corner cases, I’d love to hear about it!
(tip: try adding a test of the form self.assertEqual(0, run_command_with_timeout(['sleep', '1'], X))
, where X
is 1.0, 1.1, 0.9, 0.99, 0.999, etc.)
Another possible approach
One could argue that all this concurrency-fu is an overkill for this scenario, and suggest doing this instead:
def run_command_with_timeout(cmd, timeout_sec): proc = subprocess.Popen(['timeout', str(timeout_src)] + cmd) proc.communicate() if proc.returncode == TIMEOUT_RETCODE: raise SubprocessTimeoutError('Process #%d killed after %f seconds' % (proc.pid, timeout_sec)) return proc.returncode
This approach uses the OS timeout
command to run the requested command with a time limit.
I opted not to pursue this approach for two main reasons:
- It assumes that there is a
timeout
command in the underlying OS, which is not always true. - It assumes that
TIMEOUT_RETCODE
is not a valid return code of the requested command, which is not generally true.
Summary
I just wanted to run a command from Python in a subprocess with a timeout.
Sounds like something that should be simple, but I found out it isn’t.
I demonstrated a straight forward solution based on Python threading Timer, and showed how it is prone to concurrency issues. The improved solution, based on a Python thread, should do better.
I’d love to hear about even better solutions, or more problems with the solution I presented! If you have something to say, go ahead and say it in the comments 🙂 .
November 5, 2015
Excellent post. This is a tough problem that I only figured out with stack overflow.
http://stackoverflow.com/questions/1191374/using-module-subprocess-with-timeout
November 5, 2015
Thanks!
Indeed, a tough problem to tackle.
This is a nice solution. It’s one of the variants of the solution that the timer kills the subprocess instead of communicating with the subprocess.
I personally didn’t like these variants, because it creates another race condition between the natural completion of the subprocess and the cancellation of the timer, in which the timer may expire and try to kill a non-existing process.
In practice, I’ve been using the subprocess32 package to take advantage of the back-ported timeout feature from Python 3.2, instead of all these custom solutions 🙂
November 5, 2015
actually, I have the same issue in my solution. I addressed it by wrapping the kill in a try-except block, so if the process is already gone when the kill is called, nothing bad happens. this can be applied here too!