Passing runtime data to AWK

Shell script and AWK are very complementary languages. AWK was designed from its very beginnings at Bell Labs as a pattern-action language for short programs, ideally one or two lines long. It was intended to be used on the Unix shell interactive command line, or in shell scripts. Its feature set filled out some functionality that shell script at the time lacked, and often still lacks, as is the case with floating point numbers; it thereby (indirectly) brings much of the C language’s expressive power to the shell.

It’s therefore both common and reasonable to see AWK one-liners in shell scripts for data processing where doing the same in shell is unwieldy or impossible, especially when floating point operations or data delimiting are involved. While AWK’s full power is in general tragically underused, most shell script users and developers know about one of its most useful properties: selecting a single column from whitespace-delimited data. Sometimes, cut(1) doesn’t, uh, cut it.

In order for one language to cooperate with another usefully via embedded programs in this way, data of some sort needs to be passed between them at runtime, and here there are a few traps with syntax that may catch out unwary shell programmers. We’ll go through a simple example showing the problems, and demonstrate a few potential solutions.

Easy: Fixed data

Embedded AWK programs in shell scripts work great when you already know before runtime what you want your patterns for the pattern-action pairs to be. Suppose our company has a vendor-supplied program that returns temperature sensor data for the server room, and we want to run some commands for any and all rows registering over a certain threshold temperature. The output for the existing server-room-temps command might look like this:

$ server-room-temps
ID  Location    Temperature_C
1   hot_aisle_1 27.9
2   hot_aisle_2 30.3
3   cold_aisle_1    26.0
4   cold_aisle_2    25.2
5   outer       23.9

The task for the monitoring script is simple: get a list of all the locations where the temperature is above 28°C. If there are any such locations, we need to email the administrator the full list. Easy! It looks like every introductory AWK example you’ve ever seen—it could be straight out of the book. Let’s type it up on the shell to test it:

$ server-room-temps | awk 'NR > 1 && $3 > 28 {print $2}'
hot_aisle_2

That looks good. The script might end up looking something like this:

#!/bin/sh
alerts=/var/cache/temps/alerts
server-room-temps |
    awk 'NR > 1 && $3 > 28 {print $2}' > "$alerts" || exit
if [ -s "$alerts" ] ; then
    mail -s 'Temperature alert' sysadmin < "$alerts"
fi

So, after writing the alerts data file, we test if with [ -s ... ] to see whether it’s got any data in it. If it does, we send it all to the administrator with mail(1). Done!

We set that running every few minutes with cron(8) or systemd.timer(5), and we have a nice stop-gap solution until the lazy systems administrator gets around to fixing the Nagios server. He’s probably just off playing ADOM again…

Hard: runtime data

A few weeks later, our sysadmin still hasn’t got the Nagios server running, because his high elf wizard is about to hit level 50, and there’s a new request from the boss: can we adjust the script so that it accepts the cutoff temperature data as an argument, and other departments can use it? Sure, why not. Let’s mock that up, with a threshold of, let’s say, 25.5°C.

$ server-room-temps > test-data
$ threshold=25.5
$ awk 'NR > 1 && $3 > $threshold {print $2}' test-data
hot_aisle_1
hot_aisle_2

Wait, that’s not right. There are three lines with temperatures over 25.5°C, not two. Where’s cold_aisle_1?

Looking at the code more carefully, you realize that you assumed your shell variable would be accessible from within the AWK program, when of course, it isn’t; AWK’s variables are independent of shell variables. You don’t know why the hell it’s showing those two rows, though…

Maybe we need double quotes?

$ awk "NR > 1 && $3 > $threshold {print $2}" test-data
awk: cmd. line:1: NR > 1 &&  > 25.5 {print}
awk: cmd. line:1:            ^ syntax error

Hmm. Nope. Maybe we need to expand the variable inside the quotes?

$ awk 'NR > 1 && $3 > "$threshold" {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1
cold-aisle-2
outer

That’s not right, either. It seems to have printed all the locations, as if it didn’t test the threshold at all.

Maybe it should be outside the single quotes?

$ awk 'NR > 1 && $3 > '$threshold' {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1

The results look right, now … ah, but wait, we still need to quote it to stop spaces expanding

$ awk 'NR > 1 && $3 > '"$threshold"' {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1

Cool, that works. Let’s submit it to the security team and go to lunch.

Caught out

To your surprise, the script is rejected. The security officer says you have an unescaped variable that allows arbitrary code execution. What? Where? It’s just AWK, not SQL…!

To your horror, the security officer demonstrates:

$ threshold='0;{system("echo rm -fr /*");exit}'
$ echo 'NR > 1 && $3 > '"$threshold"' {print $2}'
NR > 1 && $3 > 0;{system("echo rm -fr /*");exit} {print $2}
$ awk 'NR > 1 && $3 > '"$threshold"' {print $2}' test-data
rm -fr /bin /boot /dev /etc /home /initrd.img ...

Oh, hell… if that were installed, and someone were able to set threshold to an arbitrary value, they could execute any AWK code, and thereby shell script, that they wanted to. It’s AWK injection! How embarrassing—good thing that was never going to run as root (…right?) Back to the drawing board …

Validating the data

One approach that might come readily to mind is to ensure that no unexpected characters appear in the value. We could use a case statement before interpolating the variable into the AWK program to check it contains no characters outside digits and a decimal:

case $threshold in
    *[!0-9.]*) exit 2 ;;
esac

That works just fine, and it’s appropriate to do some data validation at the opening of the script, anyway. It’s certainly better than leaving it as it was. But we learned this lesson with PHP in the 90s; you don’t just filter on characters, or slap in some backslashes—that’s missing the point. Ideally, we need to safely pass the data into the AWK process without ever parsing it as AWK code, sanitized or nay, so the situation doesn’t arise in the first place.

Environment variables

The shell and your embedded AWK program may not share the shell’s local variables, but they do share environment variables, accessible in AWK’s ENVIRON array. So, passing the threshold in as an environment variable works:

$ THRESHOLD=25.5
$ export THRESHOLD
$ awk 'NR > 1 && $3 > ENVIRON["THRESHOLD"] {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1

Or, to be a little cleaner:

$ THRESHOLD=25.5 \
    awk 'NR > 1 && $3 > ENVIRON["THRESHOLD"] {print $2}' test-data
hot-aisle-1
hot-aisle-2
cold-aisle-1

This is already much better. AWK will parse our data only as a variable, and won’t try to execute anything within it. The only snag with this method is picking a name; make sure that you don’t overwrite another, more important environment variable, like PATH, or LANG

Another argument

Passing the data as another argument and then reading it out of the ARGV array works, too:

$ awk 'BEGIN{ARGC--} NR > 1 && $3 > ARGV[2] {print $2}' test-data 25.5

This method is also safe from arbitrary code execution, but it’s still somewhat awkward because it requires us to decrease the argument count ARGC by one so that AWK doesn’t try to process a file named “25.5” and end up upset when it’s not there. AWK arguments can mean whatever you need them to mean, but unless told otherwise, AWK generally assumes they are filenames, and will attempt to iterate through them for lines of data to chew on.

Here’s another way that’s very similar; we read the threshold from the second argument, and then blank it out in the ARGV array:

$ awk 'BEGIN{threshold=ARGV[2];ARGV[2]=""}
    NR > 1 && $3 > threshold {print $2}' test-data 25.5

AWK won’t treat the second argument as a filename, because it’s blank by the time it processes it.

Pre-assigned variables

There are two lesser-known syntaxes for passing data into AWK that allow you safely to assign variables at runtime. The first is to use the -v option:

$ awk -v threshold="$threshold" \
    'NR > 1 && $3 > threshold {print $2}' \
    test-data

Another, perhaps even more obscure, is to set them as arguments before the filename data, using the var=value syntax:

$ awk 'NR > 1 && $3 > threshold {print $2}' \
    threshold="$threshold" test-data

Note that in both cases, we still quote the $threshold expansion; this is because the shell is expanding the value before we pass it in.

The difference between these two syntaxes is when the variable assignment occurs. With -v, the assignment happens straight away, before reading any data from the input sources, as if it were in the BEGIN block of the program. With the argument form, it happens when the program’s data processing reaches that argument. The upshot of that is that you could test several files with several different temperatures in one hit, if you wanted to:

$ awk 'NR > 1 && $3 > threshold {print $2}' \
    threshold=25.5 test-data-1 threshold=26.0 test-data-2

Both of these assignment syntaxes are standardized in POSIX awk.

These are my preferred methods for passing runtime data; they require no argument count munging, avoid the possibility of trampling on existing environment variables, use AWK’s own variable and expression syntax, and most importantly, the chances of anyone reading the script being able to grasp what’s going on are higher. You can thereby avoid a mess of quoting and back-ticking that often plagues these sorts of embedded programs.

Safety not guaranteed

If you take away only one thing from this post, it might be: don’t interpolate shell variables in AWK programs, because it has the same fundamental problems as interpolating data into query strings in PHP. Pass the data in safely instead, using either environment variables, arguments, or AWK variable assignments. Keeping this principle in mind will serve you well for other embedded programs, too; stop thinking in terms of escaping and character whitelists, and start thinking in terms of passing the data safely in the first place.

Default grep options

When you’re searching a set of version-controlled files for a string with grep, particularly if it’s a recursive search, it can get very annoying to be presented with swathes of results from the internals of the hidden version control directories like .svn or .git, or include metadata you’re unlikely to have wanted in files like .gitmodules.

GNU grep uses an environment variable named GREP_OPTIONS to define a set of options that are always applied to every call to grep. This comes in handy when exported in your .bashrc file to set a “standard” grep environment for your interactive shell. Here’s an example of a definition of GREP_OPTIONS that excludes a lot of patterns which you’d very rarely if ever want to search with grep:

GREP_OPTIONS=
for pattern in .cvs .git .hg .svn; do
    GREP_OPTIONS="$GREP_OPTIONS --exclude-dir=$pattern
done
export GREP_OPTIONS

Note that --exclude-dir is a relatively recent addition to the options for GNU grep, but it should only be missing on very legacy GNU/Linux machines by now. If you want to keep your .bashrc file compatible, you could apply a little extra hackery to make sure the option is available before you set it up to be used:

GREP_OPTIONS=
if grep --help | grep -- --exclude-dir &>/dev/null; then
    for pattern in .cvs .git .hg .svn; do
        GREP_OPTIONS="$GREP_OPTIONS --exclude-dir=$pattern"
    done
fi
export GREP_OPTIONS

Similarly, you can ignore single files with --exclude. There’s also --exclude-from=FILE if your list of excluded patterns starts getting too long.

Other useful options available in GNU grep that you might wish to add to this environment variable include:

  • --color — On appropriate terminal types, highlight the pattern matches in output, among other color changes that make results more readable
  • -s — Suppresses error messages about files not existing or being unreadable; helps if you find this behaviour more annoying than useful.
  • -E, -F, or -P — Pick a favourite “mode” for grep; devotees of PCRE may find adding -P for grep‘s experimental PCRE support makes grep behave in a much more pleasing way, even though it’s described in the manual as being experimental and incomplete

If you don’t want to use GREP_OPTIONS, you could instead simply set up an alias:

alias grep='grep --exclude-dir=.git'

You may actually prefer this method as it’s essentially functionally equivalent, but if you do it this way, when you want to call grep without your standard set of options, you only have to prepend a backslash to its call:

$ \grep pattern file

Commenter Andy Pearce also points out that using this method can avoid some build problems where GREP_OPTIONS would interfere.

Of course, you could solve a lot of these problems simply by using ack … but that’s another post.

Tmux environment variables

The user configuration file for the tmux terminal multiplexer, .tmux.conf, supports defining and using environment variables in the configuration, with the same syntax as most shell script languages:

TERM=screen-256color
set-option -g default-terminal $TERM

This can be useful for any case in which it may be desirable to customise the shell environment when inside tmux, beyond setting variables like default-terminal. However, if you repeat yourself in places in your configuration file, it can also be handy to use them as named constants. An example could be establishing colour schemes:

TMUX_COLOUR_BORDER="colour237"
TMUX_COLOUR_ACTIVE="colour231"
TMUX_COLOUR_INACTIVE="colour16"

set-window-option -g window-status-activity-bg $TMUX_COLOUR_BORDER
set-window-option -g window-status-activity-fg $TMUX_COLOUR_ACTIVE
set-window-option -g window-status-current-format "#[fg=$TMUX_COLOUR_ACTIVE]#I:#W#F"
set-window-option -g window-status-format "#[fg=$TMUX_COLOUR_INACTIVE]#I:#W#F"

The explicit commands to work with environment variables in .tmux.conf are update-environment, set-environment, and show-environment, and are featured in the manual.

Unix as IDE: Introduction

This entry is part 1 of 7 in the series Unix as IDE.

This series has been independently translated into Chinese, Russian, Turkish, and Korean, and formatted as an ebook.

Newbies and experienced professional programmers alike appreciate the concept of the IDE, or integrated development environment. Having the primary tools necessary for organising, writing, maintaining, testing, and debugging code in an integrated application with common interfaces for all the different tools is certainly a very valuable asset. Additionally, an environment expressly designed for programming in various languages affords advantages such as autocompletion, and syntax checking and highlighting.

With such tools available to developers on all major desktop operating systems including GNU/Linux and BSD, and with many of the best free of charge, there’s not really a good reason to write your code in Windows Notepad, or with nano or cat.

However, there’s a minor meme among devotees of Unix and its modern-day derivatives that “Unix is an IDE”, meaning that the tools available to developers on the terminal cover the major features in cutting-edge desktop IDEs with some ease. Opinion is quite divided on this, but whether or not you feel it’s fair to call Unix an IDE in the same sense as Eclipse or Microsoft Visual Studio, it may surprise you just how comprehensive a development environment the humble Bash shell can be.

How is UNIX an IDE?

The primary rationale for using an IDE is that it gathers all your tools in the same place, and you can use them in concert with roughly the same user interface paradigm, and without having to exert too much effort to make separate applications cooperate. The reason this becomes especially desirable with GUI applications is because it’s very difficult to make windowed applications speak a common language or work well with each other; aside from cutting and pasting text, they don’t share a common interface.

The interesting thing about this problem for shell users is that well-designed and enduring Unix tools already share a common user interface in streams of text and files as persistent objects, otherwise expressed in the axiom “everything’s a file”. Pretty much everything in Unix is built around these two concepts, and it’s this common user interface, coupled with a forty-year history of high-powered tools whose users and developers have especially prized interoperability, that goes a long way to making Unix as powerful as a full-blown IDE.

The right idea

This attitude isn’t the preserve of battle-hardened Unix greybeards; you can see it in another form in the way the modern incarnations of the two grand old text editors Emacs and Vi (GNU Emacs and Vim) have such active communities developing plugins to make them support pretty much any kind of editing task. There are plugins to do pretty much anything you could really want to do in programming in both editors, and any Vim junkie could spout off at least three or four that they feel are “essential”.

However, it often becomes apparent to me when reading about these efforts that the developers concerned are trying to make these text editors into IDEs in their own right. There are posts about never needing to leave Vim, or never needing to leave Emacs. But I think that trying to shoehorn Vim or Emacs into becoming something that it’s not isn’t quite thinking about the problem in the right way. Bram Moolenaar, the author of Vim, appears to agree to some extent, as you can see by reading :help design-not. The shell is only ever a Ctrl+Z away, and its mature, highly composable toolset will afford you more power than either editor ever could.

EDIT October 2017: New versions of Vim 8.x now include an embedded terminal accessible with the :terminal command. It works a lot better than previous plugin-based attempts to do this. Even with this new feature, I still strongly recommend the approach discussed in these posts instead.

About this series

In this series of posts, I will be going through six major features of an IDE, and giving examples showing how common tools available in GNU/Linux allow you to use them together with ease. This will by no means be a comprehensive survey, nor are the tools I will demonstrate the only options.

What I’m not trying to say

I don’t think IDEs are bad; I think they’re brilliant, which is why I’m trying to convince you that Unix can be used as one, or at least thought of as one. I’m also not going to say that Unix is always the best tool for any programming task; it is arguably much better suited for C, C++, Python, Perl, or Shell development than it is for more “industry” languages like Java or C#, especially if writing GUI-heavy applications. In particular, I’m not going to try to convince you to scrap your hard-won Eclipse or Microsoft Visual Studio knowledge for the sometimes esoteric world of the command line. All I want to do is show you what we’re doing on the other side of the fence.