Passing runtime data to AWK

Shell script and AWK are very complementary languages. AWK was designed from its very beginnings at Bell Labs as a pattern-action language for short programs, ideally one or two lines long. It was intended to be used on the Unix shell interactive command line, or in shell scripts. Its feature set filled out some functionality that shell script at the time lacked, and often still lacks, as is the case with floating point numbers; it thereby (indirectly) brings much of the C language’s expressive power to the shell.

It’s therefore both common and reasonable to see AWK one-liners in shell scripts for data processing where doing the same in shell is unwieldy or impossible, especially when floating point operations or data delimiting are involved. While AWK’s full power is in general tragically underused, most shell script users and developers know about one of its most useful properties: selecting a single column from whitespace-delimited data. Sometimes, cut(1) doesn’t, uh, cut it.

In order for one language to cooperate with another usefully via embedded programs in this way, data of some sort needs to be passed between them at runtime, and here there are a few traps with syntax that may catch out unwary shell programmers. We’ll go through a simple example showing the problems, and demonstrate a few potential solutions.

Easy: Fixed data

Embedded AWK programs in shell scripts work great when you already know before runtime what you want your patterns for the pattern-action pairs to be. Suppose our company has a vendor-supplied program that returns temperature sensor data for the server room, and we want to run some commands for any and all rows registering over a certain threshold temperature. The output for the existing server-room-temps command might look like this:

$ server-room-temps
ID  Location    Temperature_C
1   hot_aisle_1 27.9
2   hot_aisle_2 30.3
3   cold_aisle_1    26.0
4   cold_aisle_2    25.2
5   outer       23.9

The task for the monitoring script is simple: get a list of all the locations where the temperature is above 28°C. If there are any such locations, we need to email the administrator the full list. Easy! It looks like every introductory AWK example you’ve ever seen—it could be straight out of the book. Let’s type it up on the shell to test it:

$ server-room-temps | awk 'NR > 1 && $3 > 28 {print $2}'
hot_aisle_2

That looks good. The script might end up looking something like this:

#!/bin/sh
alerts=/var/cache/temps/alerts
server-room-temps |
    awk 'NR > 1 && $3 > 28 {print $2}' > "$alerts" || exit
if [ -s "$alerts" ] ; then
    mail -s 'Temperature alert' sysadmin < "$alerts"
fi

So, after writing the alerts data file, we test if with [ -s ... ] to see whether it’s got any data in it. If it does, we send it all to the administrator with mail(1). Done!

We set that running every few minutes with cron(8) or systemd.timer(5), and we have a nice stop-gap solution until the lazy systems administrator gets around to fixing the Nagios server. He’s probably just off playing ADOM again…

Hard: runtime data

A few weeks later, our sysadmin still hasn’t got the Nagios server running, because his high elf wizard is about to hit level 50, and there’s a new request from the boss: can we adjust the script so that it accepts the cutoff temperature data as an argument, and other departments can use it? Sure, why not. Let’s mock that up, with a threshold of, let’s say, 25.5°C.

$ server-room-temps > test-data
$ threshold=25.5
$ awk 'NR > 1 && $3 > $threshold {print $2}' test-data
hot_aisle_1
hot_aisle_2

Wait, that’s not right. There are three lines with temperatures over 25.5°C, not two. Where’s cold_aisle_1?

Looking at the code more carefully, you realize that you assumed your shell variable would be accessible from within the AWK program, when of course, it isn’t; AWK’s variables are independent of shell variables. You don’t know why the hell it’s showing those two rows, though…

Maybe we need double quotes?

$ awk "NR > 1 && $3 > $threshold {print $2}" test-data
awk: cmd. line:1: NR > 1 &&  > 25.5 {print}
awk: cmd. line:1:            ^ syntax error

Hmm. Nope. Maybe we need to expand the variable inside the quotes?

$ awk 'NR > 1 && $3 > "$threshold" {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1
cold-aisle-2
outer

That’s not right, either. It seems to have printed all the locations, as if it didn’t test the threshold at all.

Maybe it should be outside the single quotes?

$ awk 'NR > 1 && $3 > '$threshold' {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1

The results look right, now … ah, but wait, we still need to quote it to stop spaces expanding

$ awk 'NR > 1 && $3 > '"$threshold"' {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1

Cool, that works. Let’s submit it to the security team and go to lunch.

Caught out

To your surprise, the script is rejected. The security officer says you have an unescaped variable that allows arbitrary code execution. What? Where? It’s just AWK, not SQL…!

To your horror, the security officer demonstrates:

$ threshold='0;{system("echo rm -fr /*");exit}'
$ echo 'NR > 1 && $3 > '"$threshold"' {print $2}'
NR > 1 && $3 > 0;{system("echo rm -fr /*");exit} {print $2}
$ awk 'NR > 1 && $3 > '"$threshold"' {print $2}' test-data
rm -fr /bin /boot /dev /etc /home /initrd.img ...

Oh, hell… if that were installed, and someone were able to set threshold to an arbitrary value, they could execute any AWK code, and thereby shell script, that they wanted to. It’s AWK injection! How embarrassing—good thing that was never going to run as root (…right?) Back to the drawing board …

Validating the data

One approach that might come readily to mind is to ensure that no unexpected characters appear in the value. We could use a case statement before interpolating the variable into the AWK program to check it contains no characters outside digits and a decimal:

case $threshold in
    *[!0-9.]*) exit 2 ;;
esac

That works just fine, and it’s appropriate to do some data validation at the opening of the script, anyway. It’s certainly better than leaving it as it was. But we learned this lesson with PHP in the 90s; you don’t just filter on characters, or slap in some backslashes—that’s missing the point. Ideally, we need to safely pass the data into the AWK process without ever parsing it as AWK code, sanitized or nay, so the situation doesn’t arise in the first place.

Environment variables

The shell and your embedded AWK program may not share the shell’s local variables, but they do share environment variables, accessible in AWK’s ENVIRON array. So, passing the threshold in as an environment variable works:

$ THRESHOLD=25.5
$ export THRESHOLD
$ awk 'NR > 1 && $3 > ENVIRON["THRESHOLD"] {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1

Or, to be a little cleaner:

$ THRESHOLD=25.5 \
    awk 'NR > 1 && $3 > ENVIRON["THRESHOLD"] {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1

This is already much better. AWK will parse our data only as a variable, and won’t try to execute anything within it. The only snag with this method is picking a name; make sure that you don’t overwrite another, more important environment variable, like PATH, or LANG

Another argument

Passing the data as another argument and then reading it out of the ARGV array works, too:

$ awk 'BEGIN{ARGC--} NR > 1 && $3 > ARGV[2] {print $2}' test-data 25.5

This method is also safe from arbitrary code execution, but it’s still somewhat awkward because it requires us to decrease the argument count ARGC by one so that AWK doesn’t try to process a file named “25.5” and end up upset when it’s not there. AWK arguments can mean whatever you need them to mean, but unless told otherwise, AWK generally assumes they are filenames, and will attempt to iterate through them for lines of data to chew on.

Here’s another way that’s very similar; we read the threshold from the second argument, and then blank it out in the ARGV array:

$ awk 'BEGIN{threshold=ARGV[2];ARGV[2]=""}
    NR > 1 && $3 > threshold {print $2}' test-data 25.5

AWK won’t treat the second argument as a filename, because it’s blank by the time it processes it.

Pre-assigned variables

There are two lesser-known syntaxes for passing data into AWK that allow you safely to assign variables at runtime. The first is to use the -v option:

$ awk -v threshold="$threshold" \
    'NR > 1 && $3 > threshold {print $2}' \
    test-data

Another, perhaps even more obscure, is to set them as arguments before the filename data, using the var=value syntax:

$ awk 'NR > 1 && $3 > threshold {print $2}' \
    threshold="$threshold" test-data

Note that in both cases, we still quote the $threshold expansion; this is because the shell is expanding the value before we pass it in.

The difference between these two syntaxes is when the variable assignment occurs. With -v, the assignment happens straight away, before reading any data from the input sources, as if it were in the BEGIN block of the program. With the argument form, it happens when the program’s data processing reaches that argument. The upshot of that is that you could test several files with several different temperatures in one hit, if you wanted to:

$ awk 'NR > 1 && $3 > threshold {print $2}' \
    threshold=25.5 test-data-1 threshold=26.0 test-data-2

Both of these assignment syntaxes are standardized in POSIX awk.

These are my preferred methods for passing runtime data; they require no argument count munging, avoid the possibility of trampling on existing environment variables, use AWK’s own variable and expression syntax, and most importantly, the chances of anyone reading the script being able to grasp what’s going on are higher. You can thereby avoid a mess of quoting and back-ticking that often plagues these sorts of embedded programs.

Safety not guaranteed

If you take away only one thing from this post, it might be: don’t interpolate shell variables in AWK programs, because it has the same fundamental problems as interpolating data into query strings in PHP. Pass the data in safely instead, using either environment variables, arguments, or AWK variable assignments. Keeping this principle in mind will serve you well for other embedded programs, too; stop thinking in terms of escaping and character whitelists, and start thinking in terms of passing the data safely in the first place.

Elegant Awk usage

For many system administrators, Awk is used only as a way to print specific columns of data from programs that generate columnar output, such as netstat or ps. For example, to get a list of all the IP addresses and ports with open TCP connections on a machine, one might run the following:

# netstat -ant | awk '{print $5}'

This works pretty well, but among the data you actually wanted it also includes the fifth word of the opening explanatory note, and the heading of the fifth column:

and
Address
0.0.0.0:*
205.188.17.70:443
172.20.0.236:5222
72.14.203.125:5222

There are varying ways to deal with this.

Matching patterns

One common way is to pipe the output further through a call to grep, perhaps to only include results with at least one number:

# netstat -ant | awk '{print $5}' | grep '[0-9]'

In this case, it’s instructive to use the awk call a bit more intelligently by setting a regular expression which the applicable line must match in order for that field to be printed, with the standard / characters as delimiters. This eliminates the need for the call to grep:

# netstat -ant | awk '/[0-9]/ {print $5}'

We can further refine this by ensuring that the regular expression should only match data in the fifth column of the output, using the ~ operator:

# netstat -ant | awk '$5 ~ /[0-9]/ {print $5}'

Skipping lines

Another approach you could take to strip the headers out might be to use sed to skip the first two lines of the output:

# netstat -ant | awk '{print $5}' | sed 1,2d

However, this can also be incorporated into the awk call, using the NR variable and making it part of a conditional checking the line number is greater than two:

# netstat -ant | awk 'NR>2 {print $5}'

Combining and excluding patterns

Another common idiom on systems that don’t have the special pgrep command is to filter ps output for a string, but exclude the grep process itself from the output with grep -v grep:

# ps -ef | grep apache | grep -v grep | awk '{print $2}'

If you’re using Awk to get columnar data from the output, in this case the second column containing the process ID, both calls to grep can instead be incorporated into the awk call:

# ps -ef | awk '/apache/ && !/awk/ {print $2}'

Again, this can be further refined if necessary to ensure you’re only matching the expressions against the command name by specifying the field number for each comparison:

# ps -ef | awk '$8 ~ /apache/ && $8 !~ /awk/ {print $2}'

If you’re used to using Awk purely as a column filter, the above might help to increase its utility for you and allow you to write shorter and more efficient command lines. The Awk Primer on Wikibooks is a really good reference for using Awk to its fullest for the sorts of tasks for which it’s especially well-suited.