Passing runtime data to AWK

Shell script and AWK are very complementary languages. AWK was designed from its very beginnings at Bell Labs as a pattern-action language for short programs, ideally one or two lines long. It was intended to be used on the Unix shell interactive command line, or in shell scripts. Its feature set filled out some functionality that shell script at the time lacked, and often still lacks, as is the case with floating point numbers; it thereby (indirectly) brings much of the C language’s expressive power to the shell.

It’s therefore both common and reasonable to see AWK one-liners in shell scripts for data processing where doing the same in shell is unwieldy or impossible, especially when floating point operations or data delimiting are involved. While AWK’s full power is in general tragically underused, most shell script users and developers know about one of its most useful properties: selecting a single column from whitespace-delimited data. Sometimes, cut(1) doesn’t, uh, cut it.

In order for one language to cooperate with another usefully via embedded programs in this way, data of some sort needs to be passed between them at runtime, and here there are a few traps with syntax that may catch out unwary shell programmers. We’ll go through a simple example showing the problems, and demonstrate a few potential solutions.

Easy: Fixed data

Embedded AWK programs in shell scripts work great when you already know before runtime what you want your patterns for the pattern-action pairs to be. Suppose our company has a vendor-supplied program that returns temperature sensor data for the server room, and we want to run some commands for any and all rows registering over a certain threshold temperature. The output for the existing server-room-temps command might look like this:

$ server-room-temps
ID  Location    Temperature_C
1   hot_aisle_1 27.9
2   hot_aisle_2 30.3
3   cold_aisle_1    26.0
4   cold_aisle_2    25.2
5   outer       23.9

The task for the monitoring script is simple: get a list of all the locations where the temperature is above 28°C. If there are any such locations, we need to email the administrator the full list. Easy! It looks like every introductory AWK example you’ve ever seen—it could be straight out of the book. Let’s type it up on the shell to test it:

$ server-room-temps | awk 'NR > 1 && $3 > 28 {print $2}'
hot_aisle_2

That looks good. The script might end up looking something like this:

#!/bin/sh
alerts=/var/cache/temps/alerts
server-room-temps |
    awk 'NR > 1 && $3 > 28 {print $2}' > "$alerts" || exit
if [ -s "$alerts" ] ; then
    mail -s 'Temperature alert' sysadmin < "$alerts"
fi

So, after writing the alerts data file, we test if with [ -s ... ] to see whether it’s got any data in it. If it does, we send it all to the administrator with mail(1). Done!

We set that running every few minutes with cron(8) or systemd.timer(5), and we have a nice stop-gap solution until the lazy systems administrator gets around to fixing the Nagios server. He’s probably just off playing ADOM again…

Hard: runtime data

A few weeks later, our sysadmin still hasn’t got the Nagios server running, because his high elf wizard is about to hit level 50, and there’s a new request from the boss: can we adjust the script so that it accepts the cutoff temperature data as an argument, and other departments can use it? Sure, why not. Let’s mock that up, with a threshold of, let’s say, 25.5°C.

$ server-room-temps > test-data
$ threshold=25.5
$ awk 'NR > 1 && $3 > $threshold {print $2}' test-data
hot_aisle_1
hot_aisle_2

Wait, that’s not right. There are three lines with temperatures over 25.5°C, not two. Where’s cold_aisle_1?

Looking at the code more carefully, you realize that you assumed your shell variable would be accessible from within the AWK program, when of course, it isn’t; AWK’s variables are independent of shell variables. You don’t know why the hell it’s showing those two rows, though…

Maybe we need double quotes?

$ awk "NR > 1 && $3 > $threshold {print $2}" test-data
awk: cmd. line:1: NR > 1 &&  > 25.5 {print}
awk: cmd. line:1:            ^ syntax error

Hmm. Nope. Maybe we need to expand the variable inside the quotes?

$ awk 'NR > 1 && $3 > "$threshold" {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1
cold-aisle-2
outer

That’s not right, either. It seems to have printed all the locations, as if it didn’t test the threshold at all.

Maybe it should be outside the single quotes?

$ awk 'NR > 1 && $3 > '$threshold' {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1

The results look right, now … ah, but wait, we still need to quote it to stop spaces expanding

$ awk 'NR > 1 && $3 > '"$threshold"' {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1

Cool, that works. Let’s submit it to the security team and go to lunch.

Caught out

To your surprise, the script is rejected. The security officer says you have an unescaped variable that allows arbitrary code execution. What? Where? It’s just AWK, not SQL…!

To your horror, the security officer demonstrates:

$ threshold='0;{system("echo rm -fr /*");exit}'
$ echo 'NR > 1 && $3 > '"$threshold"' {print $2}'
NR > 1 && $3 > 0;{system("echo rm -fr /*");exit} {print $2}
$ awk 'NR > 1 && $3 > '"$threshold"' {print $2}' test-data
rm -fr /bin /boot /dev /etc /home /initrd.img ...

Oh, hell… if that were installed, and someone were able to set threshold to an arbitrary value, they could execute any AWK code, and thereby shell script, that they wanted to. It’s AWK injection! How embarrassing—good thing that was never going to run as root (…right?) Back to the drawing board …

Validating the data

One approach that might come readily to mind is to ensure that no unexpected characters appear in the value. We could use a case statement before interpolating the variable into the AWK program to check it contains no characters outside digits and a decimal:

case $threshold in
    *[!0-9.]*) exit 2 ;;
esac

That works just fine, and it’s appropriate to do some data validation at the opening of the script, anyway. It’s certainly better than leaving it as it was. But we learned this lesson with PHP in the 90s; you don’t just filter on characters, or slap in some backslashes—that’s missing the point. Ideally, we need to safely pass the data into the AWK process without ever parsing it as AWK code, sanitized or nay, so the situation doesn’t arise in the first place.

Environment variables

The shell and your embedded AWK program may not share the shell’s local variables, but they do share environment variables, accessible in AWK’s ENVIRON array. So, passing the threshold in as an environment variable works:

$ THRESHOLD=25.5
$ export THRESHOLD
$ awk 'NR > 1 && $3 > ENVIRON["THRESHOLD"] {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1

Or, to be a little cleaner:

$ THRESHOLD=25.5 \
    awk 'NR > 1 && $3 > ENVIRON["THRESHOLD"] {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1

This is already much better. AWK will parse our data only as a variable, and won’t try to execute anything within it. The only snag with this method is picking a name; make sure that you don’t overwrite another, more important environment variable, like PATH, or LANG

Another argument

Passing the data as another argument and then reading it out of the ARGV array works, too:

$ awk 'BEGIN{ARGC--} NR > 1 && $3 > ARGV[2] {print $2}' test-data 25.5

This method is also safe from arbitrary code execution, but it’s still somewhat awkward because it requires us to decrease the argument count ARGC by one so that AWK doesn’t try to process a file named “25.5” and end up upset when it’s not there. AWK arguments can mean whatever you need them to mean, but unless told otherwise, AWK generally assumes they are filenames, and will attempt to iterate through them for lines of data to chew on.

Here’s another way that’s very similar; we read the threshold from the second argument, and then blank it out in the ARGV array:

$ awk 'BEGIN{threshold=ARGV[2];ARGV[2]=""}
    NR > 1 && $3 > threshold {print $2}' test-data 25.5

AWK won’t treat the second argument as a filename, because it’s blank by the time it processes it.

Pre-assigned variables

There are two lesser-known syntaxes for passing data into AWK that allow you safely to assign variables at runtime. The first is to use the -v option:

$ awk -v threshold="$threshold" \
    'NR > 1 && $3 > threshold {print $2}' \
    test-data

Another, perhaps even more obscure, is to set them as arguments before the filename data, using the var=value syntax:

$ awk 'NR > 1 && $3 > threshold {print $2}' \
    threshold="$threshold" test-data

Note that in both cases, we still quote the $threshold expansion; this is because the shell is expanding the value before we pass it in.

The difference between these two syntaxes is when the variable assignment occurs. With -v, the assignment happens straight away, before reading any data from the input sources, as if it were in the BEGIN block of the program. With the argument form, it happens when the program’s data processing reaches that argument. The upshot of that is that you could test several files with several different temperatures in one hit, if you wanted to:

$ awk 'NR > 1 && $3 > threshold {print $2}' \
    threshold=25.5 test-data-1 threshold=26.0 test-data-2

Both of these assignment syntaxes are standardized in POSIX awk.

These are my preferred methods for passing runtime data; they require no argument count munging, avoid the possibility of trampling on existing environment variables, use AWK’s own variable and expression syntax, and most importantly, the chances of anyone reading the script being able to grasp what’s going on are higher. You can thereby avoid a mess of quoting and back-ticking that often plagues these sorts of embedded programs.

Safety not guaranteed

If you take away only one thing from this post, it might be: don’t interpolate shell variables in AWK programs, because it has the same fundamental problems as interpolating data into query strings in PHP. Pass the data in safely instead, using either environment variables, arguments, or AWK variable assignments. Keeping this principle in mind will serve you well for other embedded programs, too; stop thinking in terms of escaping and character whitelists, and start thinking in terms of passing the data safely in the first place.

Bash history expansion

Setting the Bash option histexpand allows some convenient typing shortcuts using Bash history expansion. The option can be set with either of these:

$ set -H
$ set -o histexpand

It’s likely that this option is already set for all interactive shells, as it’s on by default. The manual, man bash, describes these features as follows:

-H  Enable ! style history substitution. This option is on
    by default when the shell is interactive.

You may have come across this before, perhaps to your annoyance, in the following error message that comes up whenever ! is used in a double-quoted string, or without being escaped with a backslash:

$ echo "Hi, this is Tom!"
bash: !": event not found

If you don’t want the feature and thereby make ! into a normal character, it can be disabled with either of these:

$ set +H
$ set +o histexpand

History expansion is actually a very old feature of shells, having been available in csh before Bash usage became common.

This article is a good followup to Better Bash history, which among other things explains how to include dates and times in history output, as these examples do.

Basic history expansion

Perhaps the best known and most useful of these expansions is using !! to refer to the previous command. This allows repeating commands quickly, perhaps to monitor the progress of a long process, such as disk space being freed while deleting a large file:

$ rm big_file &
[1] 23608
$ du -sh .
3.9G    .
$ !!
du -sh .
3.3G    .

It can also be useful to specify the full filesystem path to programs that aren’t in your $PATH:

$ hdparm
-bash: hdparm: command not found
$ /sbin/!!
/sbin/hdparm

In each case, note that the command itself is printed as expanded, and then run to print the output on the following line.

History by absolute index

However, !! is actually a specific example of a more general form of history expansion. For example, you can supply the history item number of a specific command to repeat it, after looking it up with history:

$ history | grep expand
 3951  2012-08-16 15:58:53  set -o histexpand
$ !3951
set -o histexpand

You needn’t enter the !3951 on a line by itself; it can be included as any part of the command, for example to add a prefix like sudo:

$ sudo !3850

If you include the escape string \! as part of your Bash prompt, you can include the current command number in the prompt before the command, making repeating commands by index a lot easier as long as they’re still visible on the screen.

History by relative index

It’s also possible to refer to commands relative to the current command. To subtitute the second-to-last command, we can type !-2. For example, to check whether truncating a file with sed worked correctly:

$ wc -l bigfile.txt
267 bigfile.txt
$ printf '%s\n' '11,$d' w | ed -s bigfile.txt
$ !-2
wc -l bigfile.txt
10 bigfile.txt

This works further back into history, with !-3, !-4, and so on.

Expanding for historical arguments

In each of the above cases, we’re substituting for the whole command line. There are also ways to get specific tokens, or words, from the command if we want that. To get the first argument of a particular command in the history, use the !^ token:

$ touch a.txt b.txt c.txt
$ ls !^
ls a.txt
a.txt

To get the last argument, add !$:

$ touch a.txt b.txt c.txt
$ ls !$
ls c.txt
c.txt

To get all arguments (but not the command itself), use !*:

$ touch a.txt b.txt c.txt
$ ls !*
ls a.txt b.txt c.txt
a.txt  b.txt  c.txt

This last one is particularly handy when performing several operations on a group of files; we could run du and wc over them to get their size and character count, and then perhaps decide to delete them based on the output:

$ du a.txt b.txt c.txt
4164    a.txt
5184    b.txt
8356    c.txt
$ wc !*
wc a.txt b.txt c.txt
16689    94038  4250112 a.txt
20749   117100  5294592 b.txt
33190   188557  8539136 c.txt
70628   399695 18083840 total
$ rm !*
rm a.txt b.txt c.txt

These work not just for the preceding command in history, but also absolute and relative command numbers:

$ history 3
 3989  2012-08-16 16:30:59  wc -l b.txt
 3990  2012-08-16 16:31:05  du -sh c.txt
 3991  2012-08-16 16:31:12  history 3
$ echo !3989^
echo -l
-l
$ echo !3990$
echo c.txt
c.txt
$ echo !-1*
echo c.txt
c.txt

More generally, you can use the syntax !n:w to refer to any specific argument in a history item by number. In this case, the first word, usually a command or builtin, is word 0:

$ history | grep bash
 4073  2012-08-16 20:24:53  man bash
$ !4073:0
man
What manual page do you want?
$ !4073:1
bash

You can even select ranges of words by separating their indices with a hyphen:

$ history | grep apt-get
 3663  2012-08-15 17:01:30  sudo apt-get install gnome
$ !3663:0-1 purge !3663:3
sudo apt-get purge gnome

You can include ^ and $ as start and endpoints for these ranges, too. 3* is a shorthand for 3-$, meaning “all arguments from the third to the last.”

Expanding history by string

You can also refer to a previous command in the history that starts with a specific string with the syntax !string:

$ !echo
echo c.txt
c.txt
$ !history
history 3
 4011  2012-08-16 16:38:28  rm a.txt b.txt c.txt
 4012  2012-08-16 16:42:48  echo c.txt
 4013  2012-08-16 16:42:51  history 3

If you want to match any part of the command line, not just the start, you can use !?string?:

$ !?bash?
man bash

Be careful when using these, if you use them at all. By default it will run the most recent command matching the string immediately, with no prompting, so it might be a problem if it doesn’t match the command you expect.

Checking history expansions before running

If you’re paranoid about this, Bash allows you to audit the command as expanded before you enter it, with the histverify option:

$ shopt -s histverify
$ !rm
$ rm a.txt b.txt c.txt

This option works for any history expansion, and may be a good choice for more cautious administrators. It’s a good thing to add to one’s .bashrc if so.

If you don’t need this set all the time, but you do have reservations at some point about running a history command, you can arrange to print the command without running it by adding a :p suffix:

$ !rm:p
rm important-file

In this instance, the command was expanded, but thankfully not actually run.

Substituting strings in history expansions

To get really in-depth, you can also perform substitutions on arbitrary commands from the history with !!:gs/pattern/replacement/. This is getting pretty baroque even for Bash, but it’s possible you may find it useful at some point:

$ !!:gs/txt/mp3/
rm a.mp3 b.mp3 c.mp3

If you only want to replace the first occurrence, you can omit the g:

$ !!:s/txt/mp3/
rm a.mp3 b.txt c.txt

Stripping leading directories or trailing files

If you want to chop a filename off a long argument to work with the directory, you can do this by adding an :h suffix, kind of like a dirname call in Perl:

$ du -sh /home/tom/work/doc.txt
$ cd !$:h
cd /home/tom/work

To do the opposite, like a basename call in Perl, use :t:

$ ls /home/tom/work/doc.txt
$ document=!$:t
document=doc.txt

Stripping extensions or base names

A bit more esoteric, but still possibly useful; to strip a file’s extension, use :r:

$ vi /home/tom/work/doc.txt
$ stripext=!$:r
stripext=/home/tom/work/doc

To do the opposite, to get only the extension, use :e:

$ vi /home/tom/work/doc.txt
$ extonly=!$:e
extonly=.txt

Quoting history

If you’re performing substitution not to execute a command or fragment but to use it as a string, it’s likely you’ll want to quote it. For example, if you’ve just found through experiment and trial and error an ideal ffmpeg command line to accomplish some task, you might want to save it for later use by writing it to a script:

$ ffmpeg -f alsa -ac 2 -i hw:0,0 -f x11grab -r 30 -s 1600x900 \
> -i :0.0+1600,0 -acodec pcm_s16le -vcodec libx264 -preset ultrafast \
> -crf 0 -threads 0 "$(date +%Y%m%d%H%M%S)".mkv 

To make sure all the escaping is done correctly, you can write the command into the file with the :q modifier:

$ echo '#!/usr/bin/env bash' >ffmpeg.sh
$ echo !ffmpeg:q >>ffmpeg.sh

In this case, this will prevent Bash from executing the command expansion "$(date ... )", instead writing it literally to the file as desired. If you build a lot of complex commands interactively that you later write to scripts once completed, this feature is really helpful and saves a lot of cutting and pasting.

Thanks to commenter Mihai Maruseac for pointing out a bug in the examples.