Stream processing with sed and awk

Learning outcome

After this chapter, the students can basic pattern search and replacement with sed and are aware of its more advanced features. They can construct simple scripts using awk, incorporating conditions and counter variables and the END section. Finally, they can design and build analysis pipelines that consist of separate steps for data formatting, merging and analysis.

sed and awk are the central tools for the manipulation of table-formatted data. I once got an email from a student thanking me for my evolutionary genomics course. She didn’t mention anything about the population genetics or sequence analysis (which are the topics of the course) but said that she had learned about awk first time in my course, and now her skills with that had got her a job in data analysis. It took it as a compliment.

`sed`, the stream editor

sed is a surprisingly complex program. For a beginner, it’s reasonable to aim at doing simple pattern replacements, possibly with matched patterns. We’ll learn these first. However, it is useful to be aware of what else sed could do. One day you may have a task that you recognise as suitable for sed and can then go and study how to resolve it.

The name of the program comes from stream editor and one can see being used in pipes, possibly multiple times, like this:

  flowchart LR
    A[cat file] --> B[sed 'cmd']
    B --> C[sed 'cmd']
    C --> D(output)

However, sed can read in files directly:

  flowchart LR
    A[sed 'cmd' file] --> B[sed 'cmd']
    B --> C(output)

and even modify the input file:

  flowchart LR
    A[sed -i 'cmd' file]

In scientific projects, one should not edit the raw data but create modified copies of the data. If a file is edited in place, it’s a good practice to test the command with sed 'cmd' file | less and, once certain about the outcome, edit the input file with sed -i 'cmd' file.

`sed` in pattern replacement

Earlier we worked with temperature measurements from four locations in Helsinki. The data files looked like this:

> cd ~/IntSciCom/Helsinki/
> head -6 Kaisaniemi.tsv

Helsinki_Kaisaniemi 2024    1   1   00:00   -13.9   -13.7   -14
Helsinki_Kaisaniemi 2024    1   1   01:00   -14.1   -14     -14.3
Helsinki_Kaisaniemi 2024    1   1   02:00   -14.4   -14.1   -14.6
Helsinki_Kaisaniemi 2024    1   1   03:00   -14.7   -14.5   -14.9
Helsinki_Kaisaniemi 2024    1   1   04:00   -14.6   -14.4   -14.8
Helsinki_Kaisaniemi 2024    1   1   05:00   -14.3   -14.1   -14.6

One annoying thing is that the location takes so much space. We could reduce that by replacing “Helsinki” with “HKI”:

> cat Kaisaniemi.tsv | sed 's/Helsinki/HKI/' | head -6

HKI_Kaisaniemi  2024    1   1   00:00   -13.9   -13.7   -14
HKI_Kaisaniemi  2024    1   1   01:00   -14.1   -14     -14.3
HKI_Kaisaniemi  2024    1   1   02:00   -14.4   -14.1   -14.6
HKI_Kaisaniemi  2024    1   1   03:00   -14.7   -14.5   -14.9
HKI_Kaisaniemi  2024    1   1   04:00   -14.6   -14.4   -14.8
HKI_Kaisaniemi  2024    1   1   05:00   -14.3   -14.1   -14.6

The command we used was sed 's/Helsinki/HKI/': here, the command is within single quotes and consists of s (for substitute) and /<old pattern>/<new pattern>/. As soon as one learns to read the forward slash signs as separators, the structure of the command is very clear. The replacement pattern can be be empty:

> cat Kaisaniemi.tsv | sed 's/Helsinki_//' | head -2

Kaisaniemi  2024    1   1   00:00   -13.9   -13.7   -14
Kaisaniemi  2024    1   1   01:00   -14.1   -14     -14.3

That kind of replacements one could do with text processing programs. The replacement command becomes more powerful if we utilise the patterns that are found in the data. We can keep the found pattern in memory by wrapping the search pattern in brackets: in /(Helsinki)_(Kaisaniemi)/, the text “Helsinki” will be assigned to variable 1 and “Kaisaniemi” to variable 2; we can reuse them with \1 and \2 such that the command:

> cat Kaisaniemi.tsv | sed -r 's/(Helsinki)_(Kaisaniemi)/\2_\1/' | head -2

Kaisaniemi_Helsinki 2024    1   1   00:00   -13.9   -13.7   -14
Kaisaniemi_Helsinki 2024    1   1   01:00   -14.1   -14     -14.3

swaps the positions of “Helsinki” and “Kaisaniemi”, writing \2_\1 as the replacement pattern.

This could also be done with text processing programs. We can improve our search pattern by writing it using wildcards: [A-Z] means one capital letter and [a-z]+ means multiple lower-case letters. (To do that, we have to use extended sed with the argument -r.) Now the same command can be used to edit the Kaisaniemi records:

> cat Kaisaniemi.tsv | sed -r 's/(Helsinki)_([A-Z][a-z]+)/\2_\1/' | head -2

Kaisaniemi_Helsinki 2024    1   1   00:00   -13.9   -13.7   -14
Kaisaniemi_Helsinki 2024    1   1   01:00   -14.1   -14     -14.3

as well as the Kumpula records:

> cat Kumpula.tsv | sed -r 's/(Helsinki)_([A-Z][a-z]+)/\2_\1/' | head -2

Kumpula_Helsinki    2024    1   1   00:00   -15     -14.8   -15.1
Kumpula_Helsinki    2024    1   1   01:00   -14.9   -14.8   -15

However, it still has to match”Helsinki” as the first pattern. We can make that more generic and simplify the command elsewhere:

> cat Kumpula.tsv | sed -r 's/([a-z]+)_([a-z]+)/\2_\1/I' | head -2

Kumpula_Helsinki    2024    1   1   00:00   -15     -14.8   -15.1
Kumpula_Helsinki    2024    1   1   01:00   -14.9   -14.8   -15

Here, the pattern is simply [a-z]+ and it matches both upper- and lowercase letters as we have added I (for “case Insensitive”) at the end of the command. Another common argument is g for global matching, i.e. making the pattern replacement for each case on each line.

The pattern replacement introduced above is the most common use of sed and a beginner can be happy for mastering the basics of that, starting with a replacement of one word with another one. The regular expressions to match different patterns can be scary-looking but they are also extremely powerful. If one needs to find and edit specific patterns in large text files, it may be useful to learn the basics of regular expressions and try using sed and other bash tools for the task. A relatively easy-to-read tutorial for sed can be found at https://www.grymoire.com/Unix/Sed.html.

sed vs. perl

Perl is a comprehensive programming language and was very popular in the early 2000s before going out of fashion with the arrival of Python and other more modern alternatives. Despite that, the language is installed by default on most Linux systems and can thus be easily used.

Among other things, Perl has functionality highly similar to that of sed and its syntax for regular expression matching and replacement is both wider and easier to understand. Of the examples above, one could be replicated with Perl like this:

> cat Kumpula.tsv | perl -pe 's/(\w+)_(\w+)/$2_$1/' | head -2

Kumpula_Helsinki    2024    1   1   00:00   -15     -14.8   -15.1
Kumpula_Helsinki    2024    1   1   01:00   -14.9   -14.8   -15

Here, -pe means “print every line” and “execute the following command”; \w matches all characters [a-z, A-Z, 0-9] (Perl would also allow using [a-z] and flag i to ignore the case) and $1 and $2 are the pattern fields similar to \1 and \2. I personally prefer perl for all more complex regular expression cases and don’t even know how to do specific tasks with sed.

If one finds sed useful but getting limited in the more complex cases, it may be worth looking for a tutorial on Perl regex. Switching from sed to perl is trivial. The only challenge is not to get confused by the programming side of Perl. As shown above, it can be used as one-liners similar to sed.

Exercise: Pattern replacement with ‘sed’

The command sed can be used to edit the input stream (sed 'cmd') or a file (sed 'cmd' file) and then write the result to the output stream – that can be either piped to another command or written to a file. sed can also edit a file directly (sed -i 'cmd' file) but that should be used with caution as the operation cannot be reversed.

Exercise 1 in Moodle.

`sed` in line editing

Many of the other functionalities of sed can be performed with combinations of other commands, but some are unique to sed. The next two subsections introduce some more advanced features of sed which can be useful in specific tasks and which an ambitious data analyst is good to be aware of.

The power of sed is that one can control the pattern replacement to happen in specific parts of the file or certain positions on the line. Here we look at the print and delete operations within sed. The command sed 1,12d deletes the lines 1-12, as shown by the hour column:

> sed 1,12d Kaisaniemi.tsv | head -6

Helsinki_Kaisaniemi 2024    1   1   12:00   -15.9   -15.8   -16.1
Helsinki_Kaisaniemi 2024    1   1   13:00   -15.7   -15.7   -15.8
Helsinki_Kaisaniemi 2024    1   1   14:00   -15.7   -15.7   -15.8
Helsinki_Kaisaniemi 2024    1   1   15:00   -15.9   -15.7   -16
Helsinki_Kaisaniemi 2024    1   1   16:00   -16.3   -16     -16.7
Helsinki_Kaisaniemi 2024    1   1   17:00   -16.9   -16.7   -17.1

We can similarly delete everything from a specific line till the end of the file. The end of the file is marked with character “$” which has a special meaning in bash (it indicates that the word is a variable). Because of that, the dollar sign has to be escaped with a backslash:

> sed 7,\$d Kumpula.tsv

Helsinki_Kumpula    2024    1   1   00:00   -15     -14.8   -15.1
Helsinki_Kumpula    2024    1   1   01:00   -14.9   -14.8   -15
Helsinki_Kumpula    2024    1   1   02:00   -15.4   -15     -15.6
Helsinki_Kumpula    2024    1   1   03:00   -15.8   -15.6   -16.1
Helsinki_Kumpula    2024    1   1   04:00   -16.1   -15.9   -16.4
Helsinki_Kumpula    2024    1   1   05:00   -15.8   -15.7   -16.1

Alternatively, one could write the range within single quotes and thus avoid interpreting the dollar sign in the bash context: sed '7,$d'.

Instead of deleting (d), we can print (p) but then the default behaviour of printing every line has to be turned off. That is done with the argument -n and thus the command:

> sed -n 25,30p Kaisaniemi.tsv

Helsinki_Kaisaniemi 2024    1   2   00:00   -16.5   -16.4   -16.5
Helsinki_Kaisaniemi 2024    1   2   01:00   -16.6   -16.4   -16.7
Helsinki_Kaisaniemi 2024    1   2   02:00   -16.7   -16.5   -16.8
Helsinki_Kaisaniemi 2024    1   2   03:00   -16.4   -16.3   -16.7
Helsinki_Kaisaniemi 2024    1   2   04:00   -15.9   -15.5   -16.3
Helsinki_Kaisaniemi 2024    1   2   05:00   -15.3   -15.1   -15.5

prints the observations for the beginning of the second day, or the lines 25-30.

Alternatively, we could print, starting from the 1st line, every 6th line:

> sed -n '1~6p' Kaisaniemi.tsv | head -8

Helsinki_Kaisaniemi 2024    1   1   00:00   -13.9   -13.7   -14
Helsinki_Kaisaniemi 2024    1   1   06:00   -14.2   -14.1   -14.3
Helsinki_Kaisaniemi 2024    1   1   12:00   -15.9   -15.8   -16.1
Helsinki_Kaisaniemi 2024    1   1   18:00   -17     -16.9   -17.2
Helsinki_Kaisaniemi 2024    1   2   00:00   -16.5   -16.4   -16.5
Helsinki_Kaisaniemi 2024    1   2   06:00   -15.1   -15     -15.2
Helsinki_Kaisaniemi 2024    1   2   12:00   -15.5   -15.2   -15.8
Helsinki_Kaisaniemi 2024    1   2   18:00   -17.7   -17     -18.3

`sed` in line editing with patterns

Instead of line numbers, the same commands can be done with patterns. In another directory, we have a copy of Hamlet:

> cd ~/IntSciCom/Hamlet/
> less Shakespeare_Hamlet.txt

We can check that it contains the phrase:

> cat Shakespeare_Hamlet.txt | grep -C2 'To be, or not to be'


HAMLET.
To be, or not to be, that is the question:
Whether ’tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,

grep finds pattern matches, in this case for the pattern “To be, or not to be”, and prints the lines with a hit. Often it is useful to get some context for the hit: -A 2 prints the line with a hit and two lines After it, and -B 2 prints the line with a hit and two lines Before it; quite naturally, -C 2 then means two lines before plus two lines after. Above, one could also write grep -A 2 -B 2 'To be, or not to be'. Other useful arguments are e.g. -v to make the opposite (print lines that do not match), -i to ignore upper/lowercase, -n to print the line numbers of hits, -e to include multiple patterns, and -f to read patterns from a file. In bash scripts, it is often useful to know (silently) if a file contains a match or not. This is achieved with the argument -q.

The meaning of the grep command

I’m so used to the command that I had started to think “grep” as a proper English word, meaning something like “catch” (the word is actually “grip”). When studying the material for this course, I was surprised to learn that “grep” is actually not a proper word and comes from a sed-like command g/re/p. Here, g means “global” matching (find all hits), re is the abbreviation for “regular expression” and p is the “print” command familiar from sed. This command was so commonly used that it was packaged into an easier-to-use program, but there’s still a similar functionality in sed:

> cat Shakespeare_Hamlet.txt | sed -n '/To be, or not to be/p'

To be, or not to be, that is the question:

In fact, pattern matching is so central in stream processing that the same task can be done in many different ways. These commands are equivalent:

> cat Shakespeare_Hamlet.txt | grep 'To be, or not to be'
> cat Shakespeare_Hamlet.txt | sed -n '/To be, or not to be/p'
> cat Shakespeare_Hamlet.txt | awk '/To be, or not to be/'

Above, we printed two lines of context for the hit, but we probably would like to read all of Hamlet’s words in that scene, i.e. all the way until the next break or empty line. For that, we use sed -nr, meaning the extended version and not printing every line, and then give the start pattern /To be, or not to be/ and the end pattern /^\r$/ and p for the command “print”:

> cat Shakespeare_Hamlet.txt | sed -nr '/To be, or not to be/,/^\r$/p' | less

Note that we had to use ^\r$ as the empty line as the text was written on Windows: here, ^ means the beginning of the line, \r is carriage return (CR) and $ means the end of the line. If the text were in Unix format, an empty line would match /^$/, i.e. nothing between the start and the end.

“To be” is the start of the famous sentence in Hamlet but how often do those words individually appear in the text? We can resolve that with bash commands, though we need quite many of them. In the box below, the numbers on the left are just line numbers and you can copy-paste the command from cat all the way to head in the terminal:

> cat Shakespeare_Hamlet.txt \
 | sed 's/\r$//' \
 | sed -n '/THE TRAGEDY OF HAMLET, PRINCE OF DENMARK/,$p' \
 | sed '/^\*/,$d' \
 | grep -v '*' \
 | tr [A-Z] [a-z] \
 | tr ' ' '\n' \
 | grep -v ^$ \
 | sort \
 | uniq -c \
 | sort -k1,1nr \
 | grep -n . \
 | head

1:   1107 the
2:    968 and
3:    719 to
4:    673 of
5:    545 a
6:    516 my
7:    514 i
8:    438 in
9:    429 you
10:    379 hamlet.

Unsurprisingly, we find that the words “the” and “and” are the most frequent in a piece of English text! But how did we do the analysis? Here’s the description of each line:

print the text
remove the Windows end-of-line (CR)
delete everything before the title
delete everything after line starting with “*” (about the Gutenberg project)
remove the last line with asterisks
convert everything to lowercase
convert spaces to newlines, thus printing each word on a line of its own
ignore empty lines (Unix)
sort alphabetically
count occurrences of each unique word
sort numerically in reverse order by 1st column
grep every line printing the line number
show the first 10 lines

We could replace the last head with grep -w -e to -e be -e or -e not and thus capture the famous words specifically:

> cat Shakespeare_Hamlet.txt | sed 's/\r$//' | sed -n '/THE TRAGEDY OF HAMLET, PRINCE OF DENMARK/,$p' | sed '/^\*/,$d' | grep -v '*' | tr [A-Z] [a-z] | tr ' ' '\n' | grep -v ^$ | sort | uniq -c | sort -k1,1nr | grep -n . | grep -w -e to -e be -e or -e not

3:    719 to
15:    272 not
23:    204 be
38:    112 or
197:     19 not,
271:     13 not.
367:      9 be,
605:      5 be.
678:      5 not?
1188:      3 not;
1297:      3 to.
1301:      3 to’t
1302:      3 to’t.
2159:      2 to,
2587:      1 be.’
2588:      1 be:
2589:      1 be;
3391:      1 die—to
3592:      1 else,—be
4275:      1 highly—not
5076:      1 ‘not
5168:      1 one.—to
5200:      1 or,—not
6486:      1 think,—or
6546:      1 to:
6570:      1 to’t,
6571:      1 to’t;

Here, the number on the left is the position of the word in the ordered list of most common words, the number in the middle is the count of how many times it was found in the text, and the word on the right is the actual word. The results reveal that our code has problems and we should remove the punctuation before doing the word counting. Nevertheless, the word “or” seems to be the least frequent word of the famous sentence.

The example above is a bit artificial and the role of sed is secondary. In real language analyses, one would probably clean the input data as the first step (for which sed is used here) and then analyse the cleaned data with more advanced tools e.g. in the Python programming language. However, this semi-artificial example is a great example of the Unix philosophy and the use of pipes to perform complex combinations of operations on data streaming from left to right. The final combination looks like this:

flowchart LR
  A[cat] --> B[sed]
  B --> C[sed]
  C --> D[sed]
  D --> E[grep]
  E --> F[tr]
  F --> G[tr]
  G --> H[grep]
  H --> I[sort]
  I --> J[uniq]
  J --> K[sort]
  K --> L[grep]
  L --> M[grep]

Probably this could be much simplified and the task could certainly be performed by numerous different combinations of bash tools. There is rarely – if ever – a single correct way of doing the task on bash!

`awk` for pattern scanning and processing

awk is defined as a pattern scanning and processing language. It is a true programming language and one can write meaningful programs with awk alone. However, its functionality is pretty limited and it is best suited for the processing of table-formatted data. For that, awk is exceptionally powerful and, with its small number of features, relatively easy to learn.

`awk` in column processing

We look at the use of awk with the familiar temperature data but first convert it to Unix format by removing the CR characters:

> cd ~/IntSciCom/Helsinki/
> tr -d '\r' < Kaisaniemi.tsv > Kaisaniemi_LF.tsv
> head -2 Kaisaniemi_LF.tsv

Helsinki_Kaisaniemi 2024    1   1   00:00   -13.9   -13.7   -14
Helsinki_Kaisaniemi 2024    1   1   01:00   -14.1   -14     -14.3

With the simple measurements, there is not very much computation that we can do row-wise. As an example, we recompute the mean temperature as the average of the minimum (7th column) and maximum (8th column) temperature; in the data, that is already given in the 6th column.

By default, awk expects the data columns to be separated by TAB characters, as in our data; awk then reads the data one row at a time and assigns the value of column 1 to variable $1, column 2 to variable $2 and so on. With that, we print the columns 6-8 and average of columns 7 and 8:

> cat Kaisaniemi_LF.tsv | awk '{print $6,$7,$8,($7+$8)/2}' | head -5

-13.9 -13.7 -14 -13.85
-14.1 -14 -14.3 -14.15
-14.4 -14.1 -14.6 -14.35
-14.7 -14.5 -14.9 -14.7
-14.6 -14.4 -14.8 -14.6

The awk command has the form:

awk 'pattern { action }'

or if there is no pattern to match, simply:

awk '{ action }'

An example of pattern matching would be to replicate the previous computation of the average temperature for 12 noon each day:

> cat Kaisaniemi_LF.tsv | awk '$5=="12:00"{print $2"-"$3"-"$4,($7+$8)/2}' | head -5

2024-1-1 -15.95
2024-1-2 -15.5
2024-1-3 -15.3
2024-1-4 -17.85
2024-1-5 -15.9

Above, we have replaced the comma separators with "-": this “glues” the fields together with the character within the double quotes (with "" one can glue two columns together). The default output field separator is space but that can be redefined, e.g. as OFS='\t' to make it TAB.

In addition to the action for each line, the awk command may include a block that is performed before anything else and a block that is performed after everything else:

awk '
BEGIN { do first }
      { action   }
END   { do last  }
'

As an example of the BEGIN and END blocks, we can add a header line, sum the average temperatures at noon (avg) and then divide the sum (sum) by the number of records (NR) to get the average of the month:

> cat Kaisaniemi_LF.tsv | awk '
      BEGIN{ OFS="\t"; print "Date\t\tAvg.temp" }
$5=="12:00"{ avg=($7+$8)/2; sum+=avg; print $2"-"$3"-"$4,avg }
      END  { print "average:",sum/NR }
' | head -5

Date        Avg.temp
2024-1-1    -15.95
2024-1-2    -15.5
2024-1-3    -15.3
2024-1-4    -17.85

The output ends with:

2024-1-30   0.7
2024-1-31   1.55
average:    -0.250538

Above, the spacing is just for better readability and everything could be written on one line:

> cat Kaisaniemi_LF.tsv | awk 'BEGIN{OFS="\t";print "Date\t\tAvg.temp"}$5=="12:00"{ avg=($7+$8)/2; sum+=avg;print $2"-"$3"-"$4,avg}END{print "average:",sum/NR}' | less

`awk` with condition statements

In awk, the column values are stored in $1, $2 etc. One can define and use also other variables and, somewhat confusingly, these have no symbol such $ in bash, or \ in sed to indicate what they are. (In fact, this is the “normal” practice and the languages indicating variables with a specific character are a minority.) They are nevertheless variables and can be used as such. Earlier we used variables to store the average temperature and the sum of averages. Below, we use sum for the sum of average hourly temperatures; n_obs for the count of observations; and prev_day for the day of the month of the previous measurements:

> cat Kaisaniemi_LF.tsv | \
awk '{ 
 if( $4!=prev_day && n_obs>0 ) {
  print $2"-"$3"-"$4,sum/n_obs;
  sum=0; n_obs=0; 
 }
 sum+=$6; n_obs+=1; prev_day=$4;  
}' | head -15

2024-1-2 -15.6417
2024-1-3 -16.4083
2024-1-4 -16.5583
2024-1-5 -17.1583
2024-1-6 -15.975
2024-1-7 -14.8833
2024-1-8 -19.325
2024-1-9 -7.13333
2024-1-10 0.0666667
2024-1-11 1.325
2024-1-12 -3.24583
2024-1-13 -7.79167
2024-1-14 -9.02083
2024-1-15 -5.84583
2024-1-16 -8.46667

We have an if-condition on lines 3-6: if the condition on line 3 is true, we perform the action within the curly brackets on lines 4-5. Line 4 prints the output (seen below) and line 5 sets the sum and the counter to null. Note that the condition on line 3 has two parts: $4!=prev_day tests if the day in field 4 is different from the day on the previous line (if it is, the day has changed and we need to calculate and print the average of the previous day); however, we also need the test n_obs>0 for not to do the calculation on the very first line (where the first test would be true). In condition statements, && means “AND” and || means “OR”. Line 7 is performed for every row of the data file: sum+=$6 sums the temperature measurements, n_obs+=1 counts the observations (we could have missing data and not exactly 24 observations per day) and prev_day=$4 keeps the information of the day (and allows to detect the change of the day). Note that these three commands have to be separated by a semicolon; if they all were written on a line of their own, no semicolons would be needed.

We had earlier a brief look at the command printf. One could replace the fourth line like this:

> cat Kaisaniemi_LF.tsv | \
awk '{ 
 if( $4!=prev_day && n_obs>0 ) {
  printf $2"-"$3"-"$4" %.1f\n",sum/n_obs;
  sum=0; n_obs=0; 
 }
 sum+=$6; n_obs+=1; prev_day=$4;  
}' | head -5

Here, the formatting "%.1f\n" tells to print the result of sum/n_obs as a floating point number with one decimal place. As printf doesn’t automatically print a newline, we have to add \n in the command. The output is now:

2024-1-2 -15.6
2024-1-3 -16.4
2024-1-4 -16.6
2024-1-5 -17.2
2024-1-6 -16.0

Exercise: Stream processing with ‘awk’

The program awk can be used to process the input stream (awk 'cmd') or a file (awk 'cmd' file) and then write the result to the output stream – that can be either piped to another command or written to a file. awk is useful both in conditional selection (select only the lines where a value in a specific column matches a rule) and in manipulation of columns of a row – and in a combination of these. awk can also compute summary statistics that are then output after finishing the input (using END{cmd}.

Exercise 2 in Moodle.

`awk` vs. R

Those familiar with R may think that why bother learning another language as the same analyses can be done with R. That is true and even for an experienced user there are cases where one starts with awk and then realises that the analysis is getting so advanced that it would be better done with R. However, there are cases where `awk´ is far superior and it is good to be aware of that alternative.

The main difference is that awk is a stream processing language and takes the input one line at a time. Above, we have used some variables to store the information in the previous line as the language itself doesn’t remember it. Because of that, awk can easily handle data files consisting of millions of lines. R is different and normally reads the full data into the memory and then manipulates the full data table. Reading millions of lines of data with R takes a long time and lots of computer memory.

If one needs the full data for the analysis, then R is the only choice. However, often one doesn’t need the full data and it may be helpful to preprocess the data with awk and then import the reduced data set into R. For example, if only the temperature at 12 noon would be used in the analysis, one could reduce the data with the command:

> cat Kaisaniemi_LF.tsv | awk '$5=="12:00"' > Kaisaniemi_12noon.tsv

Similarly, one could drop columns that are not needed in the analysis before importing the data to R.

Hamlet’s word count with `awk`

awk can be used for the analysis of non-table-formatted data. Then, the variables storing the column data are less useful and awk becomes more like any other programming language. We can implement much of the word counting of Hamlet within one awk program. However, here we print the program code into a file called count_words_in_hamlet.awk:

> cat > count_words_in_hamlet.awk << 'EOF'
{
  if(/THE TRAGEDY OF HAMLET, PRINCE OF DENMARK/) { use = 1}
  if($1~/^*/) { use = 0}
  if(use == 1) {
    sub("\r", "", $NF)
    for(i = 1; i <= NF; i+=1) {
      counts[tolower($i)] += 1
    }
  }
}
END {
  for (word in counts) {
    if(word != "") {
      print counts[word],word
    }
  }
}
EOF

We can then use the argument awk -f to read in the program from the file. The data can be read from the stream (given by cat) or from the data file directly.

> cat Shakespeare_Hamlet.txt | awk -f count_words_in_hamlet.awk | sort -k1,1nr | head

> awk -f count_words_in_hamlet.awk Shakespeare_Hamlet.txt | sort -k1,1nr | head

1107 the
968 and
719 to
673 of
545 a
516 my
514 i
438 in
429 you
379 hamlet.

In the program code, the lines 3 and 4 catch the begin and end of the text to be processed and set the variable use either 1 or 0, meaning TRUE or FALSE. That variable is evaluated on line 5 and if TRUE, the code on lines 6-9 is executed. In awk, variable NF is the number of fields on a line. We first substitute \r with "" (i.e. nothing) in the last word (line 6), and then go through the words (or fields) in a line from 1 to NF (line 7). We convert the word to lowercase and add to its counter 1 (line 8). In the END block, executed after finishing with the file, we go through all the words in the array counts (line 13): if the word is not empty (line 14), we print the count value and the actual word. We could have written the functionality of sort andhead in the awk code but that would have made it unnecessarily difficult.

Heredoc revisited

Those with sharp eyes noticed we used the Heredoc functionality to write the code directly into a file. However, the command cat > count_words_in_hamlet.awk << 'EOF' has single quotes around EOF. The quotes prevent bash from expanding the variables when paste the text in the terminal. If that were not prevented, bash wouldn’t write $1 but would search for the variable “$1” and use its value (which is probably ““, i.e. nothing).

The official documentations for the GNU awk and sed programs are available at the GNU website:

Many tutorials are available in the net. One easy-to-read tutorial for awk can be found at https://www.grymoire.com/Unix/Awk.html.

`join` for joining data by shared fields

We looked earlier how to paste two files next to each other with the command paste. The paste command is rather simple and takes the files as they are given without any matching of the rows. If we have identical files (as we originally did), that’s fine and we could compare temperatures in different locations. I meant to describe another program, called join, in the same context but then learned that it doesn’t work well FIFOs and also benefits from preliminary data processing with awk. So now we return to the comparison of two files.

You can confirm that paste fails when we have missing data (browse towards the end to see the differences):

> cd ~/IntSciCom/Helsinki/
> paste Kaisaniemi_new.tsv Kumpula.tsv | less -S

join needs a unique identifier column to identify the common lines. We don’t have such but can create one with awk:

> awk '{OFS="-";print $1"\t"$2,$3,$4,$5"\t"$6}' Kaisaniemi_new.tsv > Kaisaniemi_dates.tsv
> awk '{OFS="-";print $1"\t"$2,$3,$4,$5"\t"$6}' Kumpula.tsv > Kumpula_dates.tsv
> head -2 Kumpula_dates.tsv

Helsinki_Kumpula    2024-1-1-00:00  -15
Helsinki_Kumpula    2024-1-1-01:00  -14.9

Now, we can join the files using the second field (-j2) with the command:

> join -j2 Kaisaniemi_dates.tsv Kumpula_dates.tsv | less

Although everything looks fine, we get an error :join: input is not in sorted order. The reason for that is the format of the date field: in an alphabetic comparison, 2024-1-10-00:00 comes before 2024-1-2-01:00 and, for the computer, the correct order of the dates would be something completely different. We could prevent the warning with the argument --nocheck-order:

> join --nocheck-order -j2 Kaisaniemi_dates.tsv Kumpula_dates.tsv | less

but it is generally a bad practice to silence the warnings.

It would be better to fix our input files and we can do that with an improved awk command:

> awk '{OFS="-";printf "%s\t%d-%02d-%02d-%s\t%.1f\n",$1,$2,$3,$4,$5,$6}' Kaisaniemi_new.tsv > Kaisaniemi_dates2.tsv
> awk '{OFS="-";printf "%s\t%d-%02d-%02d-%s\t%.1f\n",$1,$2,$3,$4,$5,$6}' Kumpula.tsv > Kumpula_dates2.tsv

Here, we use the printf command and specify the formats of the columns with %s (string), %02d (number of two digits with leading zeros) and %.1f (floating point number with one decimal). Then, the join command works perfectly:

> join -j2 Kaisaniemi_dates2.tsv Kumpula_dates2.tsv | less

Now, we can do something useful with the joined data like piping it to an awk script that compares the temperatures in the two locations:

> join -j2 Kaisaniemi_dates2.tsv Kumpula_dates2.tsv | \
awk '{
  if($3>$5){warmer++}
  if($3==$5){same++}
  if($3<$5){colder++}
}
END{
    print "Kaisaniemi is warmer "warmer" times"
    print "Kaisaniemi and Kumpula are the same "same" times"
    print "Kaisaniemi is colder "colder" times"
}'

Kaisaniemi is warmer 649 times
Kaisaniemi and Kumpula are the same 30 times
Kaisaniemi is colder 34 times

The results are different from our earlier analysis as the file for Kaisaniemi is corrupted (in purpose, to justify the use of the join command) and misses some measurements. If it wouldn’t be corrupted, we could have used paste instead of join.

( For the R-purists: Yes, I know that this could have been done very easily in R. However, join and awk could do this with files consisting of millions of lines. We play with small files just to learn the usage of bash, and I’m not claiming that these would be the ideal tools for this specific task. )

Joining metadata

Above, we expected the lines in the two files to match such that the measurements are from exactly the same moment. The command join can also join files of vastly different sizes and join the same line from one of the files with multiple lines from the other. One usage for this is e.g. to add metadata to the primary data.

We can create metadata for our Helsinki temperature data in the form of the location coordinates:

> cat > Helsinki_coordinates.tsv << 'EOF'
Helsinki_Kaisaniemi 60.18 24.94
Helsinki_Kumpula 60.2 24.96
Helsinki_Malmi_lentokenttä 60.25 25.05
Helsinki_Vuosaari_satama 60.21 25.2
EOF

We could join this with a large file but, for better readability, collect the measurements for the 12 noon on every 10th, 20th and 30th day of the month:

> grep "12:00" Helsinki.tsv | grep -P "\t[123]0\t" | cut -f1,4,5,6 > Helsinki_tenth_noon.tsv
> head -5 Helsinki_tenth_noon.tsv

Helsinki_Kaisaniemi 10  12:00   1.7
Helsinki_Kaisaniemi 20  12:00   -10.8
Helsinki_Kaisaniemi 30  12:00   0.6
Helsinki_Kumpula    10  12:00   1.4
Helsinki_Kumpula    20  12:00   -11.4

The command join can now find the matching locations in the first column and add the coordinates on each line:

> join -j1 Helsinki_tenth_noon.tsv Helsinki_coordinates.tsv | column -t

Helsinki_Kaisaniemi         10  12:00  1.7    60.18  24.94
Helsinki_Kaisaniemi         20  12:00  -10.8  60.18  24.94
Helsinki_Kaisaniemi         30  12:00  0.6    60.18  24.94
Helsinki_Kumpula            10  12:00  1.4    60.2   24.96
Helsinki_Kumpula            20  12:00  -11.4  60.2   24.96
Helsinki_Kumpula            30  12:00  0.8    60.2   24.96
Helsinki_Malmi_lentokenttä  10  12:00  1.4    60.25  25.05
Helsinki_Malmi_lentokenttä  20  12:00  -15.5  60.25  25.05
Helsinki_Malmi_lentokenttä  30  12:00  -2.9   60.25  25.05
Helsinki_Vuosaari_satama    10  12:00  1.5    60.21  25.2
Helsinki_Vuosaari_satama    20  12:00  -11.8  60.21  25.2
Helsinki_Vuosaari_satama    30  12:00  -0.3   60.21  25.2

Note that the locations have to be in the same order: if we want to sort the data e.g. by the mean temperature, we have to do that after joining the files first.

Exercise: Field joining with ‘join’

Earlier, the command paste was used to show two files next to each other. However, paste doesn’t guarantee that the two files are in sync. The command join does that and allows matching two files according to a shared field. It’s important that the input files are sorted according to the field used for the join.

Exercise 3 in Moodle.

Take-home message

sed and awk are the powertools of Unix systems. sed can be used to make simple word replacements but its true power is in regular expressions that allow identifying patterns of characters with combinations of “wildcards”. awk is a simple but surprisingly powerful programming language that is especially useful in the processing of table-formatted data.

sed, the stream editor

sed in pattern replacement

sed in line editing

sed in line editing with patterns

awk for pattern scanning and processing

awk in column processing

awk with condition statements

awk vs. R

Hamlet’s word count with awk

join for joining data by shared fields