Basics of bash

Learning outcome

After this chapter, the students can construct chains of commands where the data stream is piped from one command to another. They can utilise and redirect the input, output and error streams and create meaningful combinations of commands for processing table-formatted text data. The students can utilise Heredoc-format to create reproducible parameter files and apply the basic commands for file comparison.

The previous chapter already introduced some bash commands and programs. Here. the most central ones are first listed and then these are extended with some practical commands. Afer that, we have a closer look on how to combine these commands in loops and with conditions.

Table of commonly-used commands

Central commands for command-line working

Command Function
cd change directory
pwd print working (current) directory
ls list content
tree list content as a tree structure
ln create a link to a file
echo display a line of text
cat catenate files, print file or stream content
less browse file or stream content
head print head of file or stream
tail print tail of file or stream
cut select columns
wc character, word and line count
grep get rows with matching patterns
sort sort by different criteria
uniq select and count unique lines
column format data into columns
jobs list jobs
kill kill jobs
diff show differences between files
paste merge lines of files
join join lines on a common field
tr translate (convert) characters
sed edit stream (or file)
awk scripting language

There is a built-in manual for the commands that can be opened with the command man <command_name>. The manual mostly lists the arguments and, for a beginner with bash, more informative help can probably be found on the net.

The Unix-world is full is full of insider jokes, puns and plays with words.

The first ever shell language was called Thompson shell (sh) after its author Ken Thompson, the developer of the first Unix operating system, released in 1971. Eight years later, an improved shell language was released and called Bourne shell (sh) after its developer Stephen Bourne. The shell was constantly improved and remained in use for a long time.

The Bourne shell was part of the commercial (and very expensive) Unix OS and the free software project GNU, aiming to create a free Unix-like operating system, needed to develop an open-source alternative for it. As a reference to the commercial shell and a pun on the renewal of Born-again Christians, the new shell was named Bourne-Again Shell, or bash.



Pipes and streams: input, output and error

Pipes and streams

Unix programs work with three streams, more specifically with standard in (known as STDIN), standard out (STDOUT) and standard error (STDERR). The first two are easy to understand and a program takes information in through standard in and writes the processed information out through standard out. We can look at this with an example:

> cd ~/IntSciCom/Helsinki/
> cat Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv | less

Here, the input file contains temperature measurements for Kaisaniemi, Helsinki, in January 2024. The cat command writes the temperature data to STDOUT and less then reads this from STDIN and displays it on the screen. The concept of streams becomes clearer with more complex combinations of programs. We can add tr -d to delete all the confusing double quotes:

> cat Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv | tr -d '"' | less

or

> cat Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv | tr -d \" | less

Note that the double quote has to be written inside single quotes ('"') or escaped with a backslash sign (\"). That is because the double quote character has a special meaning in bash and would otherwise be interpreted as the start of a string.

Now, it is more obvious that tr takes the data from STDIN and writes it to STDOUT.

flowchart LR
  A[cat data.csv ] --> B[  tr -d \'']
  B --> C[ less ]

We can add three more commands, each reading from STDIN and writing to STDOUT:

> cat Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv | tr -d \" | tr ',' '\t' | tr ' ' '_' | tail -n+2 | less

Here, the second tr converts all commas to TAB signs (\t) and the third one all spaces to underscores; the command tail shows the end of the file starting from line two.

flowchart LR
  A[cat data.csv ] --> B[  tr -d \'']
  B --> C[ tr ',' '\t' ]
  C --> D[ tr ' ' '_' ]
  D --> E[ tail -n+2 ]
  E --> F[ less ]

Showing “the end of the file starting from line two may” sound strange for a command named as tail. However, if one leaves out the plus sign, the command only shows the last two lines. That’s the “default” behaviour of tail and the exact opposite of the behaviour of the command head.

The individual commands are separated by | signs, known as pipes. These pipes take the STDOUT from the left side and give it as STDIN on the right side. We can have multiple commands piped together as shown above. If we want to direct the STDOUT to a file, we have to use another symbol, > filename, for that. Here, we are happy with the result and can direct the modified data to the file Kaisaniemi.tsv:

> cat Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv | tr -d \" | tr ',' '\t' | tr ' ' '_' | tail -n+2 > Kaisaniemi.tsv

We can do the same manipulation for the other files:

> cat Helsinki_Kumpula_1.1.2024-31.1.2024.csv | tr -d \" | tr ',' '\t' | tr ' ' '_' | tail -n+2 > Kumpula.tsv
> cat Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv | tr -d \" | tr ',' '\t' | tr ' ' '_' | tail -n+2 > Malmi.tsv
> cat Helsinki_Vuosaari_satama_1.1.2024-31.1.2024.csv | tr -d \" | tr ',' '\t' | tr ' ' '_' | tail -n+2 > Vuosaari.tsv

Note that we don’t edit the original files but create new files that contain the changes that we want. As all the changes are made with one combination of bash commands, they would be fairly easy to replicate on a new source data, e.g. if we decided to study a longer time period or the temperature data for another month. We could simplify the re-use of these commands by writing them into a file that then could be used like a regular program. We’ll look at that later.

One can use the opposite of > outfile, namely < infile, to read data from a file. The command:

> cat < Helsinki.tsv > cat_Helsinki.tsv

reads the data from Helsinki.tsv and writes it to cat_Helsinki.tsv. This is equivalent of just copying the file:

> cp Helsinki.tsv cp_Helsinki.tsv
> wc cat_Helsinki.tsv cp_Helsinki.tsv 
  2976  23808 160440 cat_Helsinki.tsv
  2976  23808 160440 cp_Helsinki.tsv
  5952  47616 320880 total
> diff cat_Helsinki.tsv cp_Helsinki.tsv

In practice, the Unix pipe format with all the data flowing from left to right is simpler to read and therefore strongly recommended.


Exercise: Commands ‘cat’, ‘head’ and ‘tail’

The simplest commands to investigate the contents of a file are cat (show everything), head (show the beginning) and tail (show the end).

Commands head and tail are typically accompanied by a number that indicates how many lines to show, the default being -n 10 as in head -n 10 infile.txt or tail -n 10 infile.txt. Often, the argument n can be left out as in head -10 infile.txt and tail -10 infile.txt. The command tail has a few more useful arguments: for example, tail -n+10 infile.txt shows the end of the file starting from line 10, and tail -f infile.txt keeps showing the end of a file that is continuously extended by another program.

head and tail can be combined to show a part of a file, e.g. head -12 infile.txt | tail -2 shows lines 11 and 12 only.

Exercise 1 in Moodle.


Stream manipulation with cut and sort

From the original csv-file we know that the place is in the 1st column and the minimum temperature in the 8th column.

> head -1 Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv | tr ',' '\n' | grep -n '"'
1:"Havaintoasema"
2:"Vuosi"
3:"Kuukausi"
4:"Päivä"
5:"Aika [Paikallinen aika]"
6:"Lämpötilan keskiarvo [°C]"
7:"Ylin lämpötila [°C]"
8:"Alin lämpötila [°C]"

We can select specific columns with cut and then sort the data with sort:

> cat Kaisaniemi.tsv | cut -f1,8 | sort -k2n | less

Here, cut -f1,8 selects the columns 1 and 8, and sort -k2n sorts the data based on the second column (-k2) numerically (n). By default, sorting is done alphabetically; the sorting can be reversed with -r and all sort parameters can be written together such as sort -k2nr.

Note that we wouldn’t have needed to convert the data to tab-separated format and could also process the comma-separated information with spaces in the location names. However, the default field separator for cut is TAB and one have to use argument -d, to use commas; the default filed separator for sort is space and one have to use argument -t, to use commas. The double quotes have to be removed for sort to recognise the values as numbers. If that is done, we find the minimum temperature for Kaisaniemi using the original data:

> cat Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv | cut -d, -f1,8 | tr -d \" | sort -t, -k2n | head -1

Sorting can be done using multiple columns. If we combine the four locations and then sort the data, we see that Malmi seems to be the coldest place:

> cat Kaisaniemi.tsv Kumpula.tsv Malmi.tsv Vuosaari.tsv > Helsinki.tsv
> cat Helsinki.tsv | cut -f1,8 | sort -k2n | less

We can confirm this by using uniq to count how many times each location appears in row in the sorted list of coldest measurements:

> cat Helsinki.tsv | cut -f1,8 | sort -k2n | cut -f1 | uniq -c | head
     50 Helsinki_Malmi_lentokenttä
      2 Helsinki_Kumpula
      2 Helsinki_Malmi_lentokenttä
      1 Helsinki_Kumpula
      3 Helsinki_Malmi_lentokenttä
      2 Helsinki_Kumpula
      1 Helsinki_Malmi_lentokenttä
      1 Helsinki_Kumpula
      1 Helsinki_Malmi_lentokenttä
      1 Helsinki_Kaisaniemi

Based on that, Malmi takes the first 50 places! This is not caused by Malmi having more measurements and each location appears 744 times in the combined file:

> cat Helsinki.tsv | cut -f1 | uniq -c 
    744 Helsinki_Kaisaniemi
    744 Helsinki_Kumpula
    744 Helsinki_Malmi_lentokenttä
    744 Helsinki_Vuosaari_satama

What happened in the above command? One should be familiar with cat by now and can peek into the manual (with commands man cut and man uniq) to see the meaning of the arguments in the two other commands. If that doesn’t help, one can “break” the command into bits and see what each step does. Normally it’s helpful to inspect the intermediate output using less:

> cat Helsinki.tsv | less

> cat Helsinki.tsv | cut -f1 | less

Seeing the output of the last command and learning that the argument -c for uniq means prefix lines by the number of occurrences the meaning of the command should be clear.

Our combined file Helsinki.tsv has the problem that the location names in the first column are of different lengths, complicating the reading of the output:

> grep "12:00" Helsinki.tsv | sort -k6,6n | head -n4
Helsinki_Malmi_lentokenttä  2024    1   7   12:00   -23.2   -22.3   -24.7
Helsinki_Kumpula    2024    1   7   12:00   -19.8   -19.1   -20.6
Helsinki_Malmi_lentokenttä  2024    1   4   12:00   -19.3   -19.1   -19.4
Helsinki_Kaisaniemi 2024    1   7   12:00   -19.2   -18.4   -20.2

The command column helps and can, for example, format the data into nice columns:

> grep "12:00" Helsinki.tsv | sort -k6,6n | head -n4 | column -t
Helsinki_Malmi_lentokenttä  2024  1  7  12:00  -23.2  -22.3  -24.7
Helsinki_Kumpula            2024  1  7  12:00  -19.8  -19.1  -20.6
Helsinki_Malmi_lentokenttä  2024  1  4  12:00  -19.3  -19.1  -19.4
Helsinki_Kaisaniemi         2024  1  7  12:00  -19.2  -18.4  -20.2

column -t – the argument specifying to learn the dimensions and the column widths from the data – doesn’t work for huge files as it has to read through the data firstq.

Our data files are relatively small and it’s easy to count the number of columns and then access the Xth column with cut -fX. If the table-formatted file were much wider and we knew that we want to access one of the last columns but didn’t know the exact dimensions of the table, there’s a neat trick to reverse the table, select the column and then reverse it back. Reversing the columns is done with the command rev and thus the last three columns can be extracted with:

> rev Helsinki.tsv | cut -f1-3 | rev | head -3
-13.9   -13.7   -14
-14.1   -14     -14.3
-14.4   -14.1   -14.6

Check the actual behaviour of rev with:

> rev Helsinki.tsv | less


Exercise: Command ‘cut’

The most widely used arguments for the command cut are

  • -c select these characters (e.g., cut -c1,2,3, cut -c6-10)
  • -f select these fields (e.g., cut -f1, cut -f2-3)
  • -d DELIM use DELIM instead of TAB for field delimiter (e.g., cut -d,)

Arguments -f and -d can be used together, and cut can both process the STDIN (cat infile.txt | cut -f2) and read the data from a file (cut -f2 infile.txt).

Exercise 2 in Moodle.

Exercise: Command ‘sort’

The most widely used arguments for the command sort are

  • -k select the column(s) for sorting (e.g., sort -k2, sort -k2,3, sort -k2,2 -k4,4). Specifying one number uses all subsequent columns; in case of identical values, original order is retained.
  • -n, -d, -h use specific sorting rules (numeric, dictionary, human-numeric; e.g., sort -k3 -n, sort -k3n)
  • -r reverse the output order (e.g., sort -k3nr)
  • -t DELIM use DELIM instead of TAB for field delimiter (e.g., `sort -t, -k3nr)

Many arguments can be written together, and sort can both process the STDIN (cat infile.txt | sort -k2) and read the data from a file (sort -k2 infile.txt). The default sorting is lexicographical order according to locale settings (i.e. may produce a different output e.g. on a Finnish and British computer).

Exercise 3 in Moodle.


Hints of some more advanced tricks

By doing the sorting first alphabetically (default criterion) on column 1 (-k1,1; we also have to specify that the sorting criterion ends at column 1) and then numerically on column 2 (-k2n), we get the information for different measurement locations in order:

> cat Helsinki.tsv | cut -f1,8 | sort -k1,1 -k2n | less

To get the minimum temperatures of each location, we need some more advanced tricks that we’ll learn more about later. There are many way to do it but here are two. Using the combined file, we can make an awk condition that checks if the value in the first column is different from the previous line. If the data are sorted first by location (-k1,1) and then numerically by the temperature (-k2n), this gives the lowest temperature for each location:

> cat Helsinki.tsv | cut -f1,8 | sort -k1,1 -k2n | awk '{if($1!=prev){print $0}prev=$1}'

Another approach is to loop over the individual files and then pick the minimum temperature from each. This is easier to understand when written out:

> for file in $(ls [KMV]*tsv); do 
  cat $file | cut -f1,8 | sort -k2n | head -1 
done

Here, the first line creates a list of files (you can test it by typing ls [KMV]*tsv) in the terminal) and assigns these, one at a time, in the variable $file; this variable is then used in a very familiar-looking command in the middle; do and done set the boundaries of the loop.

Experienced users would write this as a one-liner:

> for file in $(ls [KMV]*tsv); do cat $file | cut -f1,8 | sort -k2n | head -1; done

File comparison with diff

File comparison is a central part of many analyses and a surprisingly common task is to find differences between two files. One could think that all files are different but in data analysis, one often modifies one part of the processing or analysis pipeline and is then interested in understanding how this affects the final output. One would expect the output to be highly similar and one wants to find the lines that differ.

We can create another data file for Kaisaniemi. The details of the command are not important here, the awk command just forces the number in the 6th column to have one digit and then grep -v catches all other lines but those that contain the string 12:00:

> less Kaisaniemi.tsv | awk '{OFS="\t";$6=sprintf("%.1f",$6);print $0}' | grep -v 12:00 > Kaisaniemi_new.tsv

Now, the command diff outputs the differences:

> diff Kaisaniemi.tsv Kaisaniemi_new.tsv | head -9
12,13c12
< Helsinki_Kaisaniemi   2024    1   1   11:00   -16     -15.9   -16.1
< Helsinki_Kaisaniemi   2024    1   1   12:00   -15.9   -15.8   -16.1
---
> Helsinki_Kaisaniemi   2024    1   1   11:00   -16.0   -15.9   -16.1
19c18
< Helsinki_Kaisaniemi   2024    1   1   18:00   -17     -16.9   -17.2
---
> Helsinki_Kaisaniemi   2024    1   1   18:00   -17.0   -16.9   -17.2

The format may look cryptic as it was designed for other computer programs to read. Basically, each difference starts with the location and 12,13c12 means that below is shown data from lines 12-13 in the first file and from line 12 in the second file; then come the lines on the left file (<), separator (---) and the line on the right file (>). We can indeed see that the difference is the additional digit in the 6th column and the observation for time 12:00 completely missing.

There are arguments for diff that make the comparison more readable. diff -c1 adds one line of context and marks the differing lines with exclamation marks:

> diff -c1 Kaisaniemi.tsv Kaisaniemi_new.tsv | head -12
*** Kaisaniemi.tsv  2024-02-08 12:16:36.683998118 +0200
--- Kaisaniemi_new.tsv  2024-02-08 15:04:48.387346228 +0200
***************
*** 11,14 ****
  Helsinki_Kaisaniemi   2024    1   1   10:00   -15.9   -15.7   -16
! Helsinki_Kaisaniemi   2024    1   1   11:00   -16     -15.9   -16.1
! Helsinki_Kaisaniemi   2024    1   1   12:00   -15.9   -15.8   -16.1
  Helsinki_Kaisaniemi   2024    1   1   13:00   -15.7   -15.7   -15.8
--- 11,13 ----
  Helsinki_Kaisaniemi   2024    1   1   10:00   -15.9   -15.7   -16
! Helsinki_Kaisaniemi   2024    1   1   11:00   -16.0   -15.9   -16.1
  Helsinki_Kaisaniemi   2024    1   1   13:00   -15.7   -15.7   -15.8

diff -u2 adds two lines of context and marks the differences (from the left file to the right file) with minus and plus signs:

> diff -u2 Kaisaniemi.tsv Kaisaniemi_new.tsv | head -10
--- Kaisaniemi.tsv  2024-02-08 12:16:36.683998118 +0200
+++ Kaisaniemi_new.tsv  2024-02-08 15:04:48.387346228 +0200
@@ -10,6 +10,5 @@
 Helsinki_Kaisaniemi    2024    1   1   09:00   -15.6   -15.4   -15.8
 Helsinki_Kaisaniemi    2024    1   1   10:00   -15.9   -15.7   -16
-Helsinki_Kaisaniemi    2024    1   1   11:00   -16     -15.9   -16.1
-Helsinki_Kaisaniemi    2024    1   1   12:00   -15.9   -15.8   -16.1
+Helsinki_Kaisaniemi    2024    1   1   11:00   -16.0   -15.9   -16.1
 Helsinki_Kaisaniemi    2024    1   1   13:00   -15.7   -15.7   -15.8
 Helsinki_Kaisaniemi    2024    1   1   14:00   -15.7   -15.7   -15.8

Sometimes command diff -y file1 file2 can be useful but we’ll return to that below in the context of files and streams.

Stream redirection

STDERR

Above, we mentioned STDERR but never really explained it. The standard error is a separate stream where the programs output information when something goes wrong. We’ve seen complex chains of commands and it’s easy to see situations where the error notification by a program would be lost in the pipeline of multiple pipes and steps; in the worst-case scenario, the pipeline would have failed badly but, by only seeing the final output, the user would not be aware of the errors and would believe that everything is fine.

STDERR allows programs to write the error notification to a different place from the actual processing output. A challenge is that the default output of both is the terminal. We can look at that with an example. This simple loop counts the number of hours when the mean temperature was exactly 0°C in different locations in Helsinki. To do that, it places the name of five locations into variable $place, reads the file $place.tsv, cuts the 1st and 6th column (the location and the mean temperature for the hour), greps lines that contain "\t0$"(TAB followed by 0 followed by end-of-line) and counts the number of such observations:

> for place in Kaisaniemi Kumpula Malmi Viikki Vuosaari; do
   cat $place.tsv | cut -f1,6  | grep -P "\t0$"| uniq -c;
done

Because we do not have temperature measurements for Viikki, the output includes one error message:

      2 Helsinki_Kaisaniemi 0
      8 Helsinki_Kumpula    0
      8 Helsinki_Malmi_lentokenttä  0
cat: Viikki.tsv: No such file or directory
      7 Helsinki_Vuosaari_satama    0

If we would write the output to a file:

> for place in Kaisaniemi Kumpula Malmi Viikki Vuosaari; do
   cat $place.tsv | cut -f1,6  | grep -P "\t0$"| uniq -c;
done > measurement_counts.out

and then read that file:

> cat measurement_counts.out 
      2 Helsinki_Kaisaniemi 0
      8 Helsinki_Kumpula    0
      8 Helsinki_Malmi_lentokenttä  0
      7 Helsinki_Vuosaari_satama    0

we wouldn’t notice anything suspicious (except if we count that there is the expected number of lines).

A safer option is to output STDERR to another file with 2> filename:

> for place in Kaisaniemi Kumpula Malmi Viikki Vuosaari; do
   cat $place.tsv | cut -f1,6  | grep -P "\t0$"| uniq -c;
done > measurement_counts.out 2> measurement_counts.err

Now, the error file contains the error information:

> cat measurement_counts.err 
wc: Viikki.tsv: No such file or directory

One can redirect STDERR to STDOUT. One way to write both streams to the same output file is to use &> outfile as in:

> for place in Kaisaniemi Kumpula Malmi Viikki Vuosaari; do
   cat $place.tsv | cut -f1,6  | grep -P "\t0$"| uniq -c;
done &> measurement_counts.out

Sometimes a poorly-desgined program writes tons of stuff either to STDOUT or STDERR. These can be directed to a file which is then deleted, but there’s a way to delete the output immediately by directing it to a file called /dev/null. This is Unix systems’ black hole where things go and never come back:

> for place in Kaisaniemi Kumpula Malmi Viikki Vuosaari; do
   cat $place.tsv | cut -f1,6  | grep -P "\t0$"| uniq -c;
done > measurement_counts.out 2> /dev/null

One should be careful with that approach, however, as it may hide serious errors in the analysis pipeline.

Appending files and Heredoc

The arrow symbol > creates a file and writes to it, or deletes the contents of an existing file and then writes to it. Appending to a file, i.e. writing new text at the end of an existing file, can be done with double arrows >> file.out and 2>> file.err. We can modify the previous commands and write to the output file inside the loop; as we do so multiple times we have to append to the existing file:

> > measurement_counts.out
> for place in Kaisaniemi Kumpula Malmi Vuosaari; do
   cat $place.tsv | cut -f1,6  | grep -P "\t0$"| uniq -c >> measurement_counts.out
done

Note that >> file can create a file if it doesn’t exist. Here, the command > measurement_counts.out in the beginning is used to empty the file. If it were not there, the commands could append the results at the end of some erroneous text.

As > measurement_counts.out empties a file, one can also do cat > file without any STDIN to direct there: in that case, the keyboard comes the STDIN until Control-d is pressed. Thus I could do:

> cat > notes.txt
remember to remove viikki from the script

and end this by pressing Ctrl and d simultaneously. The text is now in that file:

> cat notes.txt
remember to remove viikki from the script

A variant of the same approach is the so-called Heredoc. In scientific analyses, Heredoc allows showing what is written into a file and thus allowing others to replicate the same steps. The special feature Heredoc is the double backwards arrow << that tells the end signal, often EOF (end of file) as in << EOF. Then, we could write in our documentation:

> cat > locations.txt << EOF
Kaisaniemi
Kumpula
Malmi
Vuosaari
EOF

After doing that, the information is now in the file:

> cat locations.txt
Kaisaniemi
Kumpula
Malmi
Vuosaari

The power of this approach is that we can use variables that are then translated to their values. A hypothetical argument file for a program could be created like this:

> place=Kumpula
> cat > args.txt << EOF
input:  $place.tsv
output: $place.out
error:  $place.err
EOF
> cat args.txt
input:  Kumpula.tsv
output: Kumpula.out
error:  Kumpula.err

Above, we could write $place.tsv as the dot separates the variable name from the rest of the text. However, this would not work as intended:

> echo $place_2024.tsv
.tsv

The reason for this is that now the whole place_2024 is considered the variable name and, as it is not defined, nothing is outputted. We can get around that by placing the variable name in curly brackets:

> echo ${place}_2024.tsv
Kumpula_2024.tsv

This is the beginning of the human BRCA2 gene coding sequence.

> cat > BRCA2-201_start.txt << EOF
ATGCCTATTGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAAC
AAAGCAGATTTAGGACCAATAAGTCTTAATTGGTTTGAAGAACTTTCTTCAGAAGCTCCA
CCCTATAATTCTGAACCTGCAGAAGAATCTGAACATAAAAACAACAATTACGAACCAAAC
CTATTTAAAACTCCACAAAGGAAACCATCTTATAATCAGCTGGCTTCAACTCCAATAATA
TTCAAAGAGCAAGGGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGAATTAGATAAA
TTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGTAGACATAAAAGTCTTCGCACAGTG
AAAACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAGT
GAAAGTCCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCAGTGGTA
TGTGGGAGTTTGTTTCATACACCAAAGTTTGTGAAGGGTCGTCAGACACCAAAACATATT
EOF

There are many command-line tools for the manipulation of sequence files, including reverse-complementing the sequence. However, the same can be done with plain bash tools.

Reverse-complementing requires that the file is read backwards, starting from the last line and from the last character of each row. Then each base is converted to its Watson–Crick pair, A->T, T->A etc. cat reads the text from top to bottom so what could be the command to do the opposite? Surprisingly, it is called tac. The command for reversing the lines of text is more boring and simply called rev. Finally, we need to tr to translate characters to something else. Put altogether, the bash command could be this:

> cat BRCA2-201_start.txt | tac | rev | tr ACGT TGCA
AATATGTTTTGGTGTCTGACGACCCTTCACAAACTTTGGTGTATGAAACAAACTCCCACA
TACCACTGACTTATCTCTTTGTGGTGTTACATGTGTACATTGTAGAACAACAGGACTTTC
ACTAAGACAAGAATTTAGAAGTGGACAGGAAACATCATCTGCTTGATCCATTTTAGTTTT
CACTGTGCGAAGACTTTTATGTCTACTATTGGGAACATTCCTTCCTAAGTCTAATTTGAA
TTTATCTAATTCTTTTACAGGAGATTGGTACAGCGGCAGAGTCAGCCCTTGCTCTTTGAA
TATTATTGGAGTTGAAGCCAGCTGATTATAAGATGGTTTCCTTTGTGGAGTTTTAAATAG
GTTTGGTTCGTAATTGTTGTTTTTATGTTCAGATTCTTCTGCAGGTTCAGAATTATAGGG
TGGAGCTTCTGAAGAAAGTTCTTCAAACCAATTAAGACTTATTGGTCCTAAATCTGCTTT
GTTGCAGCGTGTCTTAAAAATTTCAAAAAATGTTGGCCTCTCTTTGGATCCAATAGGCAT

To check that the conversion went correctly, we can do it back to the forward frame:

> cat BRCA2-201_start.txt | tac | rev | tr ACGT TGCA > BRCA2-201_start.rev
> cat BRCA2-201_start.rev | tac | rev | tr ACGT TGCA
ATGCCTATTGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAAC
AAAGCAGATTTAGGACCAATAAGTCTTAATTGGTTTGAAGAACTTTCTTCAGAAGCTCCA
CCCTATAATTCTGAACCTGCAGAAGAATCTGAACATAAAAACAACAATTACGAACCAAAC
CTATTTAAAACTCCACAAAGGAAACCATCTTATAATCAGCTGGCTTCAACTCCAATAATA
TTCAAAGAGCAAGGGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGAATTAGATAAA
TTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGTAGACATAAAAGTCTTCGCACAGTG
AAAACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAGT
GAAAGTCCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCAGTGGTA
TGTGGGAGTTTGTTTCATACACCAAAGTTTGTGAAGGGTCGTCAGACACCAAAACATATT

Although many things can be done with bash, it’s not always meaningful to do so. There are many sophisticated command-line tools to manipulate DNA data and, if working in the field, one should learn to use them efficiently.


Exercise: Heredoc

“Heredoc” is useful in the documentation of analysis steps as it allows creating small text files with specific contents. One could write instructions “Open a text editor and write there the words ‘cat’, ‘dog’ and ‘cow’. Save the file as ‘animals.txt’.” Sounds straightforward but there’s always a risk that the reader makes an error and something in the pipeline then breaks.

Write the Heredoc code to do the same, i.e. create a file called ‘animals.txt’ that contains the words ‘cat’, ‘dog’ and ‘cow’ in that order, one word per line.

Exercise 4 in Moodle.


File vs. stream

Command diff has an option to show the two files next to each other with the differences annotated. For wide files that is clumsy, however. In our earlier example of two Kaisaniemi data files, we could have picked up the important columns into new files and compared those next to each other. We could have done e.g. like this:

> cut -f1,4,5,6 Kaisaniemi.tsv > old.txt
> cut -f1,4,5,6 Kaisaniemi_new.tsv > new.txt
> diff -y old.txt new.txt | head -24
Helsinki_Kaisaniemi 1   00:00   -13.9           Helsinki_Kaisaniemi 1   00:00   -13.9
Helsinki_Kaisaniemi 1   01:00   -14.1           Helsinki_Kaisaniemi 1   01:00   -14.1
Helsinki_Kaisaniemi 1   02:00   -14.4           Helsinki_Kaisaniemi 1   02:00   -14.4
Helsinki_Kaisaniemi 1   03:00   -14.7           Helsinki_Kaisaniemi 1   03:00   -14.7
Helsinki_Kaisaniemi 1   04:00   -14.6           Helsinki_Kaisaniemi 1   04:00   -14.6
Helsinki_Kaisaniemi 1   05:00   -14.3           Helsinki_Kaisaniemi 1   05:00   -14.3
Helsinki_Kaisaniemi 1   06:00   -14.2           Helsinki_Kaisaniemi 1   06:00   -14.2
Helsinki_Kaisaniemi 1   07:00   -14.5           Helsinki_Kaisaniemi 1   07:00   -14.5
Helsinki_Kaisaniemi 1   08:00   -15.2           Helsinki_Kaisaniemi 1   08:00   -15.2
Helsinki_Kaisaniemi 1   09:00   -15.6           Helsinki_Kaisaniemi 1   09:00   -15.6
Helsinki_Kaisaniemi 1   10:00   -15.9           Helsinki_Kaisaniemi 1   10:00   -15.9
Helsinki_Kaisaniemi 1   11:00   -16           | Helsinki_Kaisaniemi 1   11:00   -16.0
Helsinki_Kaisaniemi 1   12:00   -15.9         <
Helsinki_Kaisaniemi 1   13:00   -15.7           Helsinki_Kaisaniemi 1   13:00   -15.7
Helsinki_Kaisaniemi 1   14:00   -15.7           Helsinki_Kaisaniemi 1   14:00   -15.7
Helsinki_Kaisaniemi 1   15:00   -15.9           Helsinki_Kaisaniemi 1   15:00   -15.9
Helsinki_Kaisaniemi 1   16:00   -16.3           Helsinki_Kaisaniemi 1   16:00   -16.3
Helsinki_Kaisaniemi 1   17:00   -16.9           Helsinki_Kaisaniemi 1   17:00   -16.9
Helsinki_Kaisaniemi 1   18:00   -17           | Helsinki_Kaisaniemi 1   18:00   -17.0
Helsinki_Kaisaniemi 1   19:00   -17.3           Helsinki_Kaisaniemi 1   19:00   -17.3
Helsinki_Kaisaniemi 1   20:00   -17           | Helsinki_Kaisaniemi 1   20:00   -17.0
Helsinki_Kaisaniemi 1   21:00   -17           | Helsinki_Kaisaniemi 1   21:00   -17.0
Helsinki_Kaisaniemi 1   22:00   -16.8           Helsinki_Kaisaniemi 1   22:00   -16.8
Helsinki_Kaisaniemi 1   23:00   -16.5           Helsinki_Kaisaniemi 1   23:00   -16.5

The downside of this is that it creates temporary files that we have to recreate every time we want to make the diff and that slowly clutter the analysis directory. Wouldn’t it be great to have temporary files that would be created on the fly and that would disappear immediately after the command has been executed? Such “files” do exist and are created when the command is wrapped inside <( ). diff compares two files and expects the format diff [args] file1 file2. We can do that with the command:

> diff -y <(cut -f1,4,5,6 Kaisaniemi.tsv) <(cut -f1,4,5,6 Kaisaniemi_new.tsv) | less -S

According to Linux documentation (man fifo), the <( ) above is called FIFO, the name coming from First In, First Out, a type of data structure on computers. More specifically, a FIFO is a named pipe that behaves like a file: A FIFO special file (a named pipe) is similar to a pipe, except that it is accessed as part of the filesystem.

Unfortunately, FIFOs don’t work with all programs and sometimes temporary files are needed.

File comparison with paste

With the same trick, we can select parts of large tables and show them next to each other with paste. This is a simple command and simply reads two files and displays them next to each other:

> paste <(cut -f1,4,5,6 Kaisaniemi.tsv) <(cut -f1,4,5,6 Kumpula.tsv) | head
Helsinki_Kaisaniemi 1   00:00   -13.9   Helsinki_Kumpula    1   00:00   -15
Helsinki_Kaisaniemi 1   01:00   -14.1   Helsinki_Kumpula    1   01:00   -14.9
Helsinki_Kaisaniemi 1   02:00   -14.4   Helsinki_Kumpula    1   02:00   -15.4
Helsinki_Kaisaniemi 1   03:00   -14.7   Helsinki_Kumpula    1   03:00   -15.8
Helsinki_Kaisaniemi 1   04:00   -14.6   Helsinki_Kumpula    1   04:00   -16.1
Helsinki_Kaisaniemi 1   05:00   -14.3   Helsinki_Kumpula    1   05:00   -15.8
Helsinki_Kaisaniemi 1   06:00   -14.2   Helsinki_Kumpula    1   06:00   -15.4
Helsinki_Kaisaniemi 1   07:00   -14.5   Helsinki_Kumpula    1   07:00   -15.6
Helsinki_Kaisaniemi 1   08:00   -15.2   Helsinki_Kumpula    1   08:00   -16.1
Helsinki_Kaisaniemi 1   09:00   -15.6   Helsinki_Kumpula    1   09:00   -16.2

This becomes useful when we combine it with other commands, especially awk:

> paste <(cut -f1,4,5,6 Kaisaniemi.tsv) <(cut -f1,4,5,6 Kumpula.tsv) | awk '$4<$8' | wc -l
35
> paste <(cut -f1,4,5,6 Kaisaniemi.tsv) <(cut -f1,4,5,6 Kumpula.tsv) | awk '$4==$8' | wc -l
30
> paste <(cut -f1,4,5,6 Kaisaniemi.tsv) <(cut -f1,4,5,6 Kumpula.tsv) | awk '$4>$8' | wc -l
679

Here, we use awk to print out the lines where the condition is true, i.e. column 4 (temperature in Kaisaniemi) is smaller (awk '$4<$8'), equal (awk '$4==$8') or greater (awk '$4>$8') than column 8 (temperature in Kumpula). According to this quick analysis, Kumpula tends to be a colder place than Kaisaniemi.

The Unix pipes are such a good invention that other programming languages have adopted them. One of these is R and the latest versions now provide a native forward pipe syntax |>. These are the “classical” and “unix-way” of doing the same computation:

> vals <- c(1,2)
> mean(vals)
[1] 1.5

> vals |> mean()
[1] 1.5

The R library dplyr provides its own version of pipes that uses %>% as the pipe symbol (but |> is also accepted):

> library(dplyr)
> vals %>% mean()
[1] 1.5

More information about the dplyr package (which is highly recommended for anyone sing R) and its pipes can be found in the package documentation.

Note that the pipe symbol in R must be different from bash as the | character is already reserved (confusingly) for the OR operator.


Exercise: Command ‘paste’ (and FIFOs)

The command paste does what it says, pastes two files together, one line at a time. The files that are pasted can either be real files or FIFOs, that is temporary files created by other commands. For example, <(cut -f1 table.tsv) creates a temporary file that contains one column of a TAB-separated file; once the command has finished, this file disappears.

Use paste to merge columns of a table contained in a file. You can write the columns into actual files (cut -f1 table.tsv > col1.tsv) and paste these, or use FIFOs.

Exercise 5 in Moodle.

Take-home message

Streams and pipes are the core of the Unix philosophy and one should spend time learning the necessary skills. For a novice not knowing the bash commands and tools, it may be difficult to construct the required functionalities. A good principle is that nearly anything can be done using the command line, and for most things useful examples are available in the net.