Managed concurrency in the bash shell
Concurrency is all the rage these days. I started to wonder about using it in my command shell (which is regrettably still bash) because so much of the infrastructure for job control is already in place. I’ve been using xargs -P
for a while, but I wondered about doing it all in the shell.
“conc”
I came up with a super-simple function to do some experiments:
export CONC_MAX=2
conc() {
local procs=(`jobs -p`)
local proc_count=${#procs[*]}
# Block until there is an open slot
if ((proc_count >= CONC_MAX)); then
wait
fi
# Start our task
(eval "$@") &
}
export -f conc
To use it, just throw the word ‘conc’ in front of your commands. It will only use up to CONC_MAX processes. I matched that to the number of cores I have. It is especially useful in loops. For example, while this does some things:
for file in *.tar; do bzip2 -9 $file; done
This does some things a little more quickly:
for file in *.tar; do conc bzip2 -9 $file; done
You can control the amount of concurrency pretty easily:
$ (CONC_MAX=10; for x in 1 2 3 4 5 6 7 8 9; do conc "ping -c 1 $x.$x.$x.$x | grep 'bytes from'"; done; wait)
64 bytes from 8.8.8.8: icmp_seq=0 ttl=246 time=37.010 ms
$
You don’t have to place it in a loop though:
$ conc 'sleep 5; echo there'
[3] 68012
[2] Done ( eval "$@" )
$ conc 'sleep 3; echo hi'
[4] 68017
$ hi
there
[3]- Done ( eval "$@" )
[4]+ Done ( eval "$@" )
$
So you can silence all the nasty job-control stuff the normal ways, or by doing the concs all in a subshell:
$ (conc 'sleep 2; echo there'; conc 'sleep 1; echo hi'; conc 'echo handsome'; wait)
hi
there
handsome
$
Some Trials
So time for some tests.
gzip a few big files
This comparison involves compressing three identical 840 MB tar files which contain repetitive character data (CSV files with LIDAR point data)
without conc
$ time gzip -1 *.tar
real 2m0.844s
user 1m41.010s
sys 0m4.071s
with conc
$ time (for file in *.tar; do conc gzip -1 $file; done; wait)
real 1m34.912s
user 1m41.267s
sys 0m4.371s
About 30 seconds saved there. Not bad on a 2 minute job. Now the decompression:
without conc
$ time gunzip *.gz
real 1m2.366s
user 0m42.361s
sys 0m3.898s
with conc
$ time (for file in *.gz; do conc gunzip $file; done; wait)
real 0m51.911s
user 0m42.286s
sys 0m3.972s
This time I only saved about ten seconds on a one minute job.
bzip2 many small files
Now I’ll work with bzip2 and 2,101 small CSV files (412K in size).
without conc
$ time bzip2 -9 *.csv
real 2m36.933s
user 2m23.795s
sys 0m5.110s
with conc
$ time (for file in *.csv; do conc bzip2 -9 $file; done; wait)
real 2m18.669s
user 2m31.386s
sys 0m17.942s
I saved about 18 seconds on a 2.5 minute job. Now the decompression:
without conc
$ time bunzip2 *.bz2
real 1m3.719s
user 0m55.775s
sys 0m4.752s
with conc
$ time (for file in *.bz2; do conc bunzip2 $file; done; wait)
real 1m20.025s
user 1m6.066s
sys 0m17.732s
Now I lost about 17 seconds on a 1.33 minute job. What’s the problem? It is the minimum 2101 processes that get started when using conc. bunzip has the advantage of only starting one process in this case. What can be done?
xargs vs conc
Re-visiting the bunzip2 example with xargs:
So in any normal situation, with 2101 files on a dual core machine, I’d probably use this type of commandline:
$ time (find . -type f | xargs -P 2 -n 1050 bunzip2)
real 0m42.016s
user 0m56.797s
sys 0m4.971s
That is 20 seconds saved on a one minute job. That really shows the overhead involved in all the shell processes used by conc. But if we do a little arg magic with conc, we can minimize the process count:
$ time (conc bunzip2 0*; conc bunzip2 1* 2*; wait)
real 0m42.428s
user 0m56.821s
sys 0m4.938s
That fixes the time problem, but it is more than a bit ugly. It would be nice to have conc handle big argument lists automatically…
“xconc”
It isn’t too hard to get automatic concurrency. It’s just a matter of dividing the total arguments into groups appropriate for CONC_MAX and running them concurrently. This is basically just emulating xargs -P $CONC_MAX -n $(( $arg_count / $CONC_MAX ))
. Here’s how it looks with conc:
xconc ()
{
local command=$1;
shift;
local arg_count=${#@};
local group_size=$(( arg_count / CONC_MAX ));
local group_count=$(( (arg_count / group_size) + (arg_count % group_size ? 1 : 0) ));
(
local i;
local start;
for ((i = 0; i < group_count; i++ ))
do
start=$(( (i * group_size) + 1 ));
conc "$command ${@:$start:$group_size}";
done;
wait
)
}
export -f xconc
Running it is very easy: just xconc <somecommand> <the arguments>
. If you have more than one argument to your command, just quote the command portion, e.g.: xconc 'bzip2 -9' *.csv
. Lets take a quick look at the earlier bunzip2 example using xconc:
$time xconc bunzip2 *
real 0m42.296s
user 0m56.814s
sys 0m4.954s
Look at that run time!
Conclusion
I think these could be useful tools to have for casual concurrency. I’ll have to do some more experimenting.
conc.sh, a conc/xconc include for your shell profile, is available on github.