Say you have two directories with a bunch of text files. How can you create a third directory that will contain all the files that are common to both input directories, such that every output file is a concatenation of both input files?
Shell-Foo is a series of fun ways to take advantage of the powers of the shell. In the series, I highlight shell one-liners that I found useful or interesting. Most of the entries should work on bash on Linux, OS X and other UNIX-variants. Some probably work with other shells as well. Your mileage may vary.
Feel free to suggest your own Shell-Foo one-liners!
My solution
comm -12 <(ls input1/) <(ls input2/) | \ xargs -n 1 -P 8 -I fname sh -c \ 'cat input1/fname input2/fname >combined/fname'
Explanation
comm
takes two files, and prints the lines from the files in 3 columns:- The first column contains lines that appear in the first file, but not in the second.
- The second column contains lines that appear in the second file, but not in the first.
- The third column contains lines that appear in both files.
comm -12
prints just the third column, resulting a list of common lines.<(command)
takes the output ofcommand
and “wraps” it as a file. It’s a neat bash shortcut to using temporary files like this:ls input1/ >input1.ls ls input2/ >input2.ls comm -12 input1.ls input2.ls rm input1.ls input2.ls
- I wrote about
xargs
(with the-n
and-P
flags) before. The-I replstr
flag is used to tell xargs to replace occurrences of “replstr” in the arguments list with the line from stdin. By default, up to 5 occurrences are replaced. This can be controlled using the-R number
flag.
In case you’re wondering why I wrapped the entire command in sh -c '...'
, it’s because I want to redirect the output of every command separately, as opposed to redirecting outputs from all commands together to one file. To make this clearer, consider the “intuitive alternative”: xargs -n 1 -P 8 -I fname cat input1/fname input2/fname >combined/fname
. This will run cat
as expected, but the result will be that all files are concatenated into a single file combined/fname, keeping just the output from the last command.
Extra tips
This can be easily generalized to any “combining function” (cat
in this case). For example, to get a sorted combined file:
comm -12 <(ls input1/) <(ls input2/) | \ xargs -n 1 -P 8 -I fname sh -c \ 'cat input1/fname input2/fname | sort >combined/fname'
I believe this is bash-specific, due to the way I redirected the output of two ls
commands into the input of comm
.
Leave a Reply