The os.walk function in Python is a powerful function. It generates the file names and sub-directory names in a directory tree by walking the tree. For each directory in the tree, it yields a 3-tuple (dirpath, dirnames, filenames)
.
It is not well-known that you can modify dirnames
in the body of the os.walk()
loop to manipulate the recursion!
I’ve seen programmers avoid using os.walk()
, and hack their own version of it using recursive calls to os.listdir()
, with various path manipulations in the process. It was rare that the programmer doing this was not familiar with os.walk()
. More often than not, the reason was that the programmer wanted more control over the recursion. Unfortunately, if the programmer was aware that this can be done with os.walk()
, she would probably use it and save time and sweat!
This specific feature is well documented in the Python os.walk docs. Seeing how under-used it is, I wanted to highlight it here, hoping it will serve someone out there 🙂 .
The case for manipulating directory tree recursion
Why would anyone want to manipulate the dir-tree recursion, anyway?
In fact, there are multiple valid reasons to do that! (also mentioned in the Python docs, by the way)
- Prune the directory tree being traversed, skipping specific sub-trees.
- Impose a specific order of visiting sub-directories.
- Adding directories that were created during iteration.
- Updating names of directories that were renamed during iteration.
Cool! How do I do it?
Just edit the dirnames
list in-place, in the body of the loop!
For example, if you’d walk starting the current directory like this:
for dirpath, dirnames, filenames in os.walk('.'): print dirpath, dirnames, filenames
You can do any of these manipulations to change the behavior of the walk:
# Walk sub-directories in reverse order for dirpath, dirnames, filenames in os.walk('.', topdown=True): dirnames.reverse() print dirpath, dirnames, filenames # Prune the ".git" directory for dirpath, dirnames, filenames in os.walk('.', topdown=True): dirnames[:] = [dirname for dirname in dirnames if dirname != '.git'] print dirpath, dirnames, filenames # Pruning directories that contain a file named "foo" for dirpath, dirnames, filenames in os.walk('.', topdown=True): if 'foo' in filenames: del dirnames continue print dirpath, dirnames, filenames
You get the idea.
It should be emphasized that this is effective only when topdown=True
! Think about it for a moment to become convinced… 🙂
September 3, 2015
Thanks for this piece. However, I’m having trouble understanding how the pruning actually works. For example, I tried:
for root,dirs,files in os.walk(TOPDIR,topdown=True):
if len(dirs) > 1:
dirs.sort()
dirs = [dirs[-1]] # or dirs = dirs[-1], doesn’t seem to matter
print ‘root =’,root
This is for a directory tree with 4 levels of subdirectories below TOPDIR. I thought this would lead to printing the directories for the greatest sorted value directory name at each level (which correspond to year/month/day/hour) but instead it appears that the entire directory tree is printed. Can you explain why?
Thanks,
Jon
September 3, 2015
Thank you for the feedback, Jon!
I think I can explain what’s going on with your example. I think that the problem is in
dirs = [dirs[-1]]
. This doesn’t modify the actual list used in the external loop, it only creates a new list with one element and binds it to the name “dirs”. In order to actually modify the list, you need to usedirs[:] = [dirs[-1]]
.Can you try it and update here if this indeed was the issue?
September 3, 2015
Yes! That did the trick. Thanks
September 3, 2015
Great! Glad to hear 🙂