I recently learned about a subtle difference between lists and generators in Python.
The gist: when passing iterables, always prefer generators over lists!
Also: when returning iterables, always prefer generators over lists!
Most of the time, the intuitive way does the Right Thing. I noticed the subtlety in some cases where my intuition didn’t do the Right Thing.
In this post I explain the difference using several examples. The last example is an important one, demonstrating how fnmatch.filter() can misbehave!
Simple examples with builtin all and any functions
Consider these two similar ways to use the builtin all()
function:
def foo(num): print 'hi from foo', num return num < 5 print 'Using all([ ... ])' print all([foo(n) for n in range(10)]) print 'Using all( ... )' print all(foo(n) for n in range(10))
The only difference is whether the iterable passed is a list or a generator. When using [ ... ]
, a list is constructed in-place. Without the [ ... ]
, a generator is passed.
Running this snippet:
Using all([ ... ]) hi from foo 0 hi from foo 1 hi from foo 2 hi from foo 3 hi from foo 4 hi from foo 5 hi from foo 6 hi from foo 7 hi from foo 8 hi from foo 9 False Using all( ... ) hi from foo 0 hi from foo 1 hi from foo 2 hi from foo 3 hi from foo 4 hi from foo 5 False
Both ways returned the correct answer – False
, but the behaviors are inherently different! In the list-based example, Python constructed the list before passing it into all()
, so all 10 foo(n)
calls were executed. In the generator-based example, Python passed the generator into all()
, allowing it to generate only as few elements as it needed to return the correct answer!
This may seem like a minor nuance, but it is easy to see how this nuance may have significant performance and correctness implications.
The performance part is straight forward. If foo()
is expensive, or there are many many elements, and foo()
returns False
for one of the early elements, then many computations are wasted cycles that can be spared.
The correctness aspect is less straight forward. If foo()
changes the state of the system in some way (e.g. creating / deleting files), then the list-based example can cause unexpected state changes.
The example applies to the any()
function as well:
def foo(num): print 'hi from foo', num return num >= 5 print 'Using any([ ... ])' print any([foo(n) for n in range(10)]) print 'Using any( ... )' print any(foo(n) for n in range(10))
Running it:
Using any([ ... ]) hi from foo 0 hi from foo 1 hi from foo 2 hi from foo 3 hi from foo 4 hi from foo 5 hi from foo 6 hi from foo 7 hi from foo 8 hi from foo 9 True Using any( ... ) hi from foo 0 hi from foo 1 hi from foo 2 hi from foo 3 hi from foo 4 hi from foo 5 True
More examples with for loops over list comprehension
Consider these two less obvious examples of a for-loop over some dynamic expression.
def foo_gen(num): i = 0 while i < num: print 'hi from foo', i, num yield i i += 1 print 'Using for x in [ ... ]' for x in [n/2 for n in foo_gen(20) if n%2==0]: print 'hi from for loop, x:', x if x > 5: print 'breaking out from for loop' break print '' print 'Using for x in ( ... )' for x in (n/2 for n in foo_gen(20) if n%2==0): print 'hi from for loop, x:', x if x > 5: print 'Breaking out of for loop' break
Both loops use my made-up foo_gen(), which is just a chatty version of the builtin xrange().
The difference between the for-loops is quite subtle. The first one puts the list comprehension inside [ ... ]
, and the second one puts it in ( ... )
.
What do you expect will happen when I run this? Let’s see:
Using for x in [ ... ] hi from foo 0 20 hi from foo 1 20 hi from foo 2 20 hi from foo 3 20 hi from foo 4 20 hi from foo 5 20 hi from foo 6 20 hi from foo 7 20 hi from foo 8 20 hi from foo 9 20 hi from foo 10 20 hi from foo 11 20 hi from foo 12 20 hi from foo 13 20 hi from foo 14 20 hi from foo 15 20 hi from foo 16 20 hi from foo 17 20 hi from foo 18 20 hi from foo 19 20 hi from for loop, x: 0 hi from for loop, x: 1 hi from for loop, x: 2 hi from for loop, x: 3 hi from for loop, x: 4 hi from for loop, x: 5 hi from for loop, x: 6 breaking out from for loop Using for x in ( ... ) hi from foo 0 20 hi from for loop, x: 0 hi from foo 1 20 hi from foo 2 20 hi from for loop, x: 1 hi from foo 3 20 hi from foo 4 20 hi from for loop, x: 2 hi from foo 5 20 hi from foo 6 20 hi from for loop, x: 3 hi from foo 7 20 hi from foo 8 20 hi from for loop, x: 4 hi from foo 9 20 hi from foo 10 20 hi from for loop, x: 5 hi from foo 11 20 hi from foo 12 20 hi from for loop, x: 6 Breaking out of for loop
Kudos to you if you expected this! When I started playing around with this subject, I could not foresee it!
With a fully-constructed list (the [ ... ]
case), Python calls foo_gen() 20 times to finish constructing the list, and only then the for-loop begins, breaking out after 7 iterations. The for-loop never reaches beyond the 13th value that foo_gen() generated, but nonetheless – all values were generated.
With a “naked generator” (the ( ... )
case – does this have a Pythonic term?), the for-loop runs while values are being generated, also breaking out after 7 iterations. The clear advantage in this case is the fact that Python calls foo_gen() exactly 13 times – no more than actually needed to complete the for-loop.
Just like the simple all() / any() examples, the potential impact here is huge.
A similar example with string splitting
If the previous example is too made-up for you, consider this example:
def my_split(my_str): start_pos = 0 str_len = len(my_str) if str_len == 0: yield '' return end_pos = 0 while end_pos < str_len: if my_str[end_pos] == '\n': print 'yielding:', my_str[start_pos:end_pos] yield my_str[start_pos:end_pos] start_pos = end_pos + 1 end_pos += 1 print 'yielding:', my_str[start_pos:end_pos] yield my_str[start_pos:end_pos] my_test_str = 'foo\nbar\nbaz\n' print 'Iterating over a list' for part in list(my_split(my_test_str)): print 'got', part if part == 'bar': print 'breaking from loop' break print '' print 'Iterating over the raw generator' for part in my_split(my_test_str): print 'got', part if part == 'bar': print 'breaking from loop' break
I implemented my own custom string-splitting function, so I can embed some prints inside it. As far as I can tell, the builtin Python string-split returns a fully-constructed list, and not a generator as I did here.
I hope you agree that this is a realistic example – iterating over split string, potentially breaking out midway through.
Running it:
Iterating over a list yielding: foo yielding: bar yielding: baz yielding: got foo got bar breaking from loop Iterating over the raw generator yielding: foo got foo yielding: bar got bar breaking from loop
By now I guess you could have guessed the behavior 🙂 .
fnmatch.filter returns a list, not a generator!
Consider this final example a Public Service Announcement, warning against problematic behavior of fnmatch.filter():
import fnmatch my_files = 'a.foo\nb.foo\nc.bar\nd.foo' print 'using fnmatch.filter' for filename in fnmatch.filter(my_split(my_files), '*.foo'): print 'working on', filename print '' print 'using fnmatch.fnmatch with list comprehension' for filename in [fn for fn in my_split(my_files) if fnmatch.fnmatch(fn, '*.foo')]: print 'working on', filename print '' print 'using fnmatch.fnmatch with generator comprehension' for filename in (fn for fn in my_split(my_files) if fnmatch.fnmatch(fn, '*.foo')): print 'working on', filename
The output:
using fnmatch.filter yielding: a.foo yielding: b.foo yielding: c.bar yielding: d.foo working on a.foo working on b.foo working on d.foo using fnmatch.fnmatch with list comprehension yielding: a.foo yielding: b.foo yielding: c.bar yielding: d.foo working on a.foo working on b.foo working on d.foo using fnmatch.fnmatch with generator comprehension yielding: a.foo working on a.foo yielding: b.foo working on b.foo yielding: c.bar yielding: d.foo working on d.foo
It appears that fnmatch.filter() turns a generator into a list internally, and returns a filtered list!
This can be a terribly Bad Thing to do in some cases, like:
- As in previous examples, if the for-loop breaks midway through, the filter still processed all items.
- This actually happened to me in production code: if the iterable passed to fnmatch.filter() does paged API calls inside, this behavior significantly hurts performance! It artificially increased the load and rate of API calls, and counteracts any parallelization attempts of background I/O while crunching CPU cycles in the loop body.
My recommendation here is to always prefer the generator-comprehension form of fnmatch.fnmatch over fnmatch.filter.
Summary
Got more examples of lists vs. generators? Share them in the comments!
Additional cases of bad behavior similar to fnmatch.filter are especially welcome.
May 1, 2015
very nice i read about generator comprehension but almost never used them, maybe i”l do it more often now