Revisiting loops and conditions
After this chapter, the student can construct loops that iterate a specific task multiple times, possibly for different files or sets of data. They can build conditions that change the behaviour of the command depending on specific rules.
The point of automation is to easily replicate something several times. This replication can be over a list of things (e.g. multiple data files) or for a certain number of times (e.g. 50 replicates of a heuristic search). We’ve seen these used previously but revisit the concept more thoroughly here.
For-loop
The for-loop is familiar to everyone who has done any computer programming and a variant of that is available also on bash. The basic structure of the for-loop is:
for item in list; do
command [ $item ]
done
This is nearly human-readable: “For each item in a list, do the command, possibly providing the item as the argument; once ready, say ‘done’”.
We’ll first have a look at the “list”. That can literally be a list of words or numbers:
> for num in 1 2 3 4 5; do
echo value: $num
done
value: 1
value: 2
value: 3
value: 4
value: 5
However, writing a list of numbers is exactly the type of a task that should be given to the computer and there’s a bash command for that, seq
. We can specify the start, end and interval, but at the simplest, we can just write
> seq 5
1
2
3
4
5
We can then use command substitution and write a command inside the command to generate the list. The format for that is $(command)
such as:
> for num in $(seq -w 5 5 25); do
echo value: $num
done
value: 05
value: 10
value: 15
value: 20
value: 25
For the command seq
, the argument -w
is useful if the generated numbers have to be later sorted. See man seq
for details.
By default, many bash commands (e.g. ls
, sort
) use lexicographic order when listing the files. Lexicographically, ‘1’ is before ‘5’ and thus ‘10’ is sorted before ‘5’; ‘0’ comes before ‘1’ and writing ‘05’ sorts ‘five’ before ‘ten’.
Moreover, it is often easier and tidier if the numbers are equally long and the file names with numbers align nicely. Most commands producing lists of numbers or patterns have an option to make the running pattern of fixed length and to sort correctly in lexicographic order. seq -w
is one of these. Similarly, many commands manipulating lists of files have an option to specify an alternative sorting rule; however, the lexicographic order is the default and the easiest to use.
We can similarly generate lists of files with the ls
command:
> cd ~/IntSciCom/Helsinki/
> ls H*_*_*.csv
Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv
Helsinki_Kumpula_1.1.2024-31.1.2024.csv Helsinki_Vuosaari_satama_1.1.2024-31.1.2024.csv
and use command substitution for other things:
> for file in $(ls H*_*_*.csv); do
echo "$(echo $file | cut -d_ -f2) has $(cat $file | wc -l) observations"
done
Kaisaniemi has 745 observations
Kumpula has 745 observations
Malmi has 745 observations
Vuosaari has 745 observations
If you do not understand the function of the loop above, try to break it into pieces. First, test the plain loop:
> for file in $(ls H*_*_*.csv); do
echo $file
done
Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv
Helsinki_Kumpula_1.1.2024-31.1.2024.csv
Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv
Helsinki_Vuosaari_satama_1.1.2024-31.1.2024.csv
and then the internal commands:
> file=Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv
> echo $file | cut -d_ -f2
Kaisaniemi
and
> cat $file | wc -l
745
While-loop
An alternative for the for-loop is the while-loop. In principle, the while-loop iterates as long as a condition is true:
while [ condition ]; do
command
done
This allows replicating the previous functionality:
> num=5
> while [ $num -le 25 ]; do
echo value: $num
num=$(($num +5))
done
value: 5
value: 10
value: 15
value: 20
value: 25
However, it is stupid to do counting with while
as for
does it so much better. On the other hand, while
can be converted behave similarly to for-loop by doing the counting outside the loop and reading the value from STDIN:
> seq -w 5 5 25 | while read num; do
echo value: $num
done
value: 05
value: 10
value: 15
value: 20
value: 25
This works equally well with ls
and files:
> ls H*_*_*.csv | while read file ; do
echo "$(echo $file | cut -d_ -f2) has $(cat $file | wc -l) observations"
done
Kaisaniemi has 745 observations
Kumpula has 745 observations
Malmi has 745 observations
Vuosaari has 745 observations
The advantage of the while read
pair is that the input list can be piped in. I find it easier to construct first the command generating the list (possibly a combination of ls
and grep
commands) and then add the while
loop after that (separated by a pipe). The alternative would be to embed the complex command at the beginning of the for-loop.
read
with multiple variables
Note that read
can read multiple variables at a time.
Finnish is an inflected language and the word order within sentences is pretty flexible. For example, for the sentence ‘the dog bit the man’, the three words can be in any order (though, the emphasis of the sentence changes a bit). We can demonstrate this with a bash
command that outputs the three words in a random order, all orderings being valid Finnish sentences:
> words=("koira" "puri" "miestä")
> shuf -i 0-2 | xargs -n3 echo | while read a b c; do echo ${words[$a]} ${words[$b]} ${words[$c]} ; done
koira puri miestä
If one executes the command again – pressing the arrow up key and enter – it changes the output randomly.
> shuf -i 0-2 | xargs -n3 echo | while read a b c; do echo ${words[$a]} ${words[$b]} ${words[$c]} ; done
puri miestä koira
This is an unnecessarily complex command to demonstrate the ability of the read
command to take multiple arguments and it may need some clarification. words=("koira" "puri" "miestä")
defines an array (or a vector) of three words, called $words
. The new command shuf -i
shuffles the input, here the numbers 0-2, and prints them one at the time to STDOUT. xargs -n3
collects three lines together and outputs them all to STDOUT with echo
. The loop construct while read a b c; do ...; done
reads three variables per row, called $a
, $b
and $c
, and does something with them. And the last command echo ${words[$a]} ${words[$b]} ${words[$c]}
prints the words in a random order depending on the values of $a
, $b
and $c
.
We can add an outer loop to execute the commands multiple times and see that the outputs really are random:
> for i in $(seq 1 10); do
shuf -i 0-2 | xargs -n3 echo | while read a b c; do
echo ${words[$a]} ${words[$b]} ${words[$c]}
done
done
miestä puri koira
puri miestä koira
miestä puri koira
koira puri miestä
miestä koira puri
miestä puri koira
puri koira miestä
koira miestä puri
miestä koira puri
miestä puri koira
Iterating with find
At the time of Google search, it would seem obvious that every computer system has a powerful search method. However, the search (especially with the content) requires building and keeping up-to-date complex indexes or databases, and such systems aren’t widely used in the command line. Nevertheless, bash has an efficient program for searching directory structures and it can be extended to search also the contents of the files.
The find
command may look clumsy and complicated but it is also very powerful. The basic format of the command is find <dir> [arguments]
. Giving only the directory name goes recursively through the target and prints out everything found:
> find ~/IntSciCom/village/ | head -4
/users/username/IntSciCom/village/
/users/username/IntSciCom/village/house2
/users/username/IntSciCom/village/house2/bedroom
/users/username/IntSciCom/village/house2/bedroom/.hidden
The files starting with a dot are not shown by many programs and the leading dot is used to hide unnecessary details such as configuration files. They can be seen with ls
by adding the argument -a
:
> ls ~/IntSciCom/village/house2/bedroom
> ls -a ~/IntSciCom/village/house2/bedroom
. .. .hidden
Above, the single dot is the target directory and the double dot is the parent directory (that is why cd ..
goes one step backwards); the file .hidden
is empty and created just for git
to include the directory in the repository where the course material is copied from.
Typical arguments for find
specify the name (as pattern), type (e.g. f
for file, d
for directory) or modification/access times. Focusing on ~/IntSciCom/village/
, we can find the directories whose name starts with ‘b’ with the command:
> find ~/IntSciCom/village/ -name "b*" -type d
/users/username/IntSciCom/village/house2/bedroom
/users/username/IntSciCom/village/house2/office/bookshelf
/users/username/IntSciCom/village/house1/bedroom
/users/username/IntSciCom/village/house1/office/bookshelf
/users/username/IntSciCom/village/house3/bedroom
/users/username/IntSciCom/village/house3/office/bookshelf
/users/username/IntSciCom/village/shoppingcentre/bookstore
Alternatively, we can find all non-empty (sign !
negates the condition) files with the command:
> find ~/IntSciCom/village/ -type f ! -empty
/users/username/IntSciCom/village/house1/bedroom/notebook/Shakespeare_Hamlet.txt
/users/username/IntSciCom/village/house1/bedroom/notebook/Shakespeare_Macbeth.txt
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt
We could of course pipe the output to a while-loop to do something for the file (and often do so). However, find
allows executing commands for the hits found within the command itself. That is done with -exec <command> {} \;
where {}
is the position where the filename is put and \;
ends the command. Then, we could count the words in the non-empty files with the command:
> find ~/IntSciCom/village/ -type f ! -empty -exec wc -w {} \;
12 /users/username/IntSciCom/village/house1/bedroom/notebook/Shakespeare_Hamlet.txt
3 /users/username/IntSciCom/village/house1/bedroom/notebook/Shakespeare_Macbeth.txt
34988 /users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt
21427 /users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt
find
only finds files and directories but we can extend the search for the contents with grep
. Focusing on the same non-empty files, we can search for the words “To be” within them:
> find ~/IntSciCom/village/ -type f ! -empty -exec grep -H "To be" {} \;
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:To bear our hearts in grief, and our whole kingdom
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:To be contracted in one brow of woe;
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:To be commanded.
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:To be a preparation ’gainst the Polack;
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:To be, or not to be, that is the question:
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:To be forestalled ere we come to fall,
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:But never the offence. To bear all smooth and even,
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt: To be your Valentine.
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:May read strange matters. To beguile the time,
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:To be his purveyor: but he rides well;
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:To be the same in thine own act and valour
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:To be invested.
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:To be thus is nothing,
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:To bed, to bed. There’s knocking at the gate. Come, come, come, come,
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:give me your hand. What’s done cannot be undone. To bed, to bed, to
find
is often used to find files that are then deleted, so often that the program has the argument -delete
for that. It is generally safer to separate the search for files and their deletion and do the latter using the commands constructed by the former:
> find ~/IntSciCom/village/ -empty -type f -exec echo "rm -i "{} \; > delete_empty.sh
> # bash delete_empty.sh
Here, the find
command prints rm -i
commands that delete the empty files found. The collection of these rm -i
commands (a very simple “script”) can then be executed with bash delete_empty.sh
. The argument -i
makes the deletion interactive and asks the user confirmation for each step.
Iterating with xargs
The command xargs
is far more complex (and powerful) than the previous loop structures and one can do well without it. When your computational tasks require parallel processing or the traditional loop structures feel too verbose, it is worth having a closer look at this command. Those baffled by the contents can jump straight to Conditional commands.
A common alternative for find -exec
is to feed the output of find
to program xargs
. xargs
allows for building and executing commands or command combinations using the information provided through STDIN.
We can count the words of non-empty files by feeding the file names to xargs
and then providing the bash command wc -w
; xargs
places the file names as the argument of that command and produces the output:
> find ~/IntSciCom/village/ -type f ! -empty | xargs wc -w
12 /users/username/IntSciCom/village/house1/bedroom/notebook/Shakespeare_Hamlet.txt
3 /users/username/IntSciCom/village/house1/bedroom/notebook/Shakespeare_Macbeth.txt
34988 /users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt
21427 /users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt
56430 total
The last line of the output reveals that all four files were given simultaneously as arguments (as in wc -l file1 file2 file3 file4
) and the program therefore also outputs the total count. With argument -n1
, xargs
takes the file names one at a time and produces the output identical with the find -exec
one:
> find ~/IntSciCom/village/ -type f ! -empty | xargs -n1 wc -w | less
By default, xargs
places the filenames read from STDIN after the command provided – which is the behaviour we want for wc -w
. However, the filename (or word) can be also “named” and used explicitly, possibly multiple times. The argument -I%
names the filename variable as %
and we can then give that as the argument as in wc -w %
:
> find ~/IntSciCom/village/ -type f ! -empty | xargs -n1 -I% wc -w % | less
Parallelising with xargs
One big reason for learning the use of xargs
is its ability parallelise jobs. Let’s assume that one would have a Linux workstation with 16 CPUs and one hundred long analyses to run over the weekend. It would be inefficient (and possibly impossible) to start all 100 at the same time on Friday afternoon and it would be cumbersome to keep checking over the weekend if any of the jobs has finished and a new one should be started. xargs
can take the long list of jobs as the input and run exactly 16 of them in parallel, starting a new one when the previous finish. (The program parallel
is even better for this task but also more complex.)
To test that, we can create a small program and run copies of that in parallel. First, let’s make the program called program.sh
:
> cat > program.sh << 'EOF'
#!/usr/bin/bash
echo $1 starts
sleep $1
echo $1 ends
EOF
This expects a number as the argument, prints the number in the beginning and the end, and sleeps the number of seconds in between. We can test it:
> bash program.sh 2
2 starts
2 ends
and see that there’s a 2-second wait in the middle.
To see that xargs
runs a specific number of instances in parallel, we can start five copies of the program, each taking one second longer to run. We do that by providing numbers 1, 2, 3, 4, 5 through STDIN; xargs
reads these one at a time (-n1
) and then runs two copies (-P2
) of bash program.sh
in parallel, providing the number as the argument (-I%
and %
):
> seq 5 | xargs -n1 -P2 -I% bash program.sh %
1 starts
2 starts
1 ends
3 starts
2 ends
4 starts
3 ends
5 starts
4 ends
5 ends
From the output we see that this indeed happened: 1 and 2 were started at the same time but as 1 finished earlier, 3 was started before 2 finished, and so on.
In fact, we don’t necessarily need to write the commands in a program file but can execute them with the xargs
command using bash -c
:
> seq 5 | xargs -n1 -P2 -I% bash -c "echo % starts; sleep %; echo % done"
1 starts
2 starts
1 done
3 starts
2 done
4 starts
3 done
5 starts
4 done
5 done
Here, the part within the double quotes could be the commands for the time-consuming analysis.
Note that on large computing clusters, much of the parallelisation is done with the scheduling system. On the other hand, if one has lots of small analyses (hundreds or thousands, each running in minutes), they should not be sent as separate jobs to the queue. Sending them to the queue would massively complicate the scheduler’s task and quickly ruin the user’s “Priority” score, slowing down the progress of the jobs in the queue.
In such cases, it would be better e.g. to collect the many analyses in a bash script file, reserve a job with e.g. 20 CPUs and then run these commands with xargs -P20
. Details for this can be looked for in the CSC documentation or asked from the CSC support staff.
Multiple xargs
It may be useful to have multiple xargs
commands piped together. One useful trick is to use basename
to get rid of the suffix. The bash commands dirname
and basename
take a file path as the argument and output either the directory path (everything but the last) or the base part (the last) of it. The basename
can additionally remove a fixed suffix (for example “.csv” in the end) and thus give the base of the name on which we can add different suffixes. We could do that and iterate the conversion script to all “.csv” files, writing the output with the “.tsv” suffix:
> cd ~/IntSciCom/Helsinki/
> ls H*_*_*.csv | xargs basename -s .csv | xargs -n1 -I% bash -c "bash convert.sh %.csv > %.tsv"
We now have all the raw data nicely formatted as tab-separated files:
> ls H*_*_*
Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv
Helsinki_Kaisaniemi_1.1.2024-31.1.2024.tsv Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.tsv
Helsinki_Kumpula_1.1.2024-31.1.2024.csv Helsinki_Vuosaari_satama_1.1.2024-31.1.2024.csv
Helsinki_Kumpula_1.1.2024-31.1.2024.tsv Helsinki_Vuosaari_satama_1.1.2024-31.1.2024.tsv
Conditional commands
A script doing exactly the same commands for the same files can be controlled by copying variable files as the input and renaming the output files. A hypothetical setup could be like this:
> cp file1.tsv input.tsv
> bash script.sh
> mv output.txt file1.out
> cp file2.tsv input.tsv
> bash script.sh
> mv output.txt file2.out
Here, the script always reads the data from the file input.tsv
and writes the results to the file output.txt
.
A more advanced solution is to use input arguments for the script and thus apply the commands to different files. A hypothetical setup could be like this:
> bash script.sh file1.tsv file1.out
> bash script.sh file2.tsv file2.out
Here, the script takes two arguments and uses the first as the input data and the second as the output file.
Often it is necessary to be able to make alternative decisions within the bash scripts and adjust the behaviour depending on the circumstances. This is done with tests and conditions. Most of the tests work on string or numbers, or on files and directories.
Some of the commonly needed string and integer tests are:
Test | Function |
---|---|
-z string |
True if the length of string is zero |
string1 = string2 |
True if the strings are equal |
string1 != string2 |
True if the strings are not equal |
int1 -eq int2 |
True if int1 is equal to int2 |
int1 -ne int2 |
True if int1 is not equal to int2 |
int1 -lt int2 |
True if int1 is less than int2 |
int1 -gt int2 |
True if int1 is greater than int2 |
Some of the commonly needed file tests are:
Test | Function |
---|---|
-e file |
True if file exists |
-f file |
True if file exists and is a regular file |
-d file |
True if file exists and is a directory |
The full list of bash tests can be found at https://www.gnu.org/software/bash/manual/html_node/Bash-Conditional-Expressions.html.
bash
has the command test
to test different things. The result of the test is stored into a special variable $?
. As an example, we could test if the variable $val
equals a certain value:
> val=5
> test $val -eq 4
> echo $?
1
Here, the test produces FAIL/FALSE which in bash
is 1
. Another try is more successful:
> test $val -eq 5
> echo $?
0
In bash
, SUCCESS/TRUE is coded as 0
.
Using the command test
and the variable $?
is often cumbersome and it is much more straightforward to utilise the if-else structure common in most programming languages.
if [ condition ]; then
[commands for SUCCESS/TRUE]
else
[commands for FAIL/FALSE]
fi
> val=5
> for i in {3..7}; do
if [ $val -eq $i ]; then
echo value is $i
else
echo value is NOT $i
fi
done
value is NOT 3
value is NOT 4
value is 5
value is NOT 6
value is NOT 7
In an if-else condition, only the if
part is compulsory; even then, the condition has to be ended with fi
:
> for i in {3..7}; do
if [ $val -eq $i ]; then
echo value is $i
fi
done
value is 5
Sometimes it may be easier to have an empty TRUE case and do the actual work in the FALSE case. However, the TRUE case has to have something and then :
means “do nothing”:
> for i in {3..7}; do
if [ $val -eq $i ]; then
:
else
echo value is NOT $i
fi
done
value is NOT 3
value is NOT 4
value is NOT 6
value is NOT 7
On the other hand, the conditions are always boolean TRUE/FALSE cases and these can be converted to the opposite value with !
:
> for i in {3..7}; do
if [ ! $val -eq $i ]; then
echo value is NOT $i
fi
done
value is NOT 3
value is NOT 4
value is NOT 6
value is NOT 7
Finally, an if-else condition can have any number of if
cases, the subsequent cases given as elif
; the final else
(if exists) is executed if none of the previous cases is TRUE:
> for i in {3..7}; do
if [ $val -lt $i ]; then
echo value is less than $i
elif [ $val -gt $i ]; then
echo value is greater than $i
else
echo value must be $i
fi
done
value is greater than 3
value is greater than 4
value must be 5
value is less than 6
value is less than 7
Conditions within a script
To demonstrate the integration of tests in scripts, we can do a highly simplistic text analysis of the two books found in the directory ‘bookshelf’:
> cd ~/IntSciCom/village/house1/office/bookshelf/
> cat > text_analysis.sh << 'EOF'
books=($(ls *.txt))
len0=$(cat ${books[0]} | wc -w)
len1=$(cat ${books[1]} | wc -w)
echo -n "File ${books[0]} has $len0 words and is "
if [ $len0 -gt $len1 ]; then
echo -n "longer than"
elif [ $len0 -lt $len1 ]; then
echo -n "shorter than"
else
echo -n "as long as"
fi
echo " file ${books[1]} that has $len1 words."
EOF
> bash text_analysis.sh
File Shakespeare_Hamlet.txt has 34988 words and is longer than file Shakespeare_Macbeth.txt that has 21427 words.
Above, we wrote the script using the Heredoc. In that, on the second row, the filenames are stored in a vector called $books
. The contents of vectors are zero-indexed and we can access the first item as ${books[0]}
. On the third row, we do that and store the word count (the output of wc -w
) in the variable $len0
. The filename and the word count are used within the echo
on the fifth row, and the two word counts are compared in the tests on the sixth and eighth rows: the first condition tests whether $len0
is greater than $len1
and then if it is, performs the command on row 7; if it is not greater, we do another test and evaluate if $len0
is less than $len1
and depending on the result, may perform the command on row 9; if neither of the tests is true, the integers must be equal and we go for the default alternative behaviour (else
on row 10) and perform the command on row 11. The if-else condition has to be closed with fi
as on row 12.
Conditional commands within a script
As an example of a file test and conditional execution of commands, we revisit the Helsinki temperature data and embed the test condition within another command. When doing automated tasks, it is useful to check that existing files are not mistakenly overwritten. We incorporate this in the earlier csv-file conversion and add a check of the target file existence [ -e ${name}.tsv ]
. The csv-file is converted only if a similarly named target tsv-file doesn’t exist:
> cd ~/IntSciCom/Helsinki/
> rm Helsinki_K*_*.tsv
> ls H*_*_*.csv | while read csv; do
name=$(basename -s .csv $csv)
if [ ! -e ${name}.tsv ]; then
echo converting $csv
bash convert.sh $csv > ${name}.tsv
fi
done
converting Helsinki_Kaisaniemi_1.1.2024-31.1.2024
converting Helsinki_Kumpula_1.1.2024-31.1.2024
As we deleted the tsv-files matching the pattern Helsinki_K*_*.tsv
only the Kaisaniemi and Kumpula files are regenerated; the Malmi and Vuosaari files exist and they were not overwritten.
The unconditional script was introduced using the command xargs
(see above). Conditional execution can be integrated with that, too:
> cd ~/IntSciCom/Helsinki/
> rm Helsinki_K*_*.tsv
> ls H*_*_*.csv | xargs basename -s .csv | \
xargs -I% bash -c "if [ ! -e %.tsv ]; then echo converting %; bash convert.sh %.csv > %.tsv; fi"
converting Helsinki_Kaisaniemi_1.1.2024-31.1.2024
converting Helsinki_Kumpula_1.1.2024-31.1.2024
Tests of program return value
Above we used tests for variables and files. One could easily find ways of extending the same approach for the outputs of programs. For example, we could go through all files with the suffix “*.csv” and look for the word “Malmi”, reporting the name of the file with a hit:
> ls *csv | while read file; do
if [ $(grep Malmi $file | wc -l) -gt 0 ]; then
echo $file
fi
done
Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv
Here, the if
condition computes the value of $(grep Malmi $file | wc -l)
, that is the number of lines with the word “Malmi” in each file, and tests if that is greater than zero; if it is, the file name is printed.
One would think that programs should always output something. That is not the case and many programs have an option to run quietly and only produce the “return value”, stored in the variable $?
. The command grep
is one of these and runs quietly with the option -q
. We can thus simplify the command above and write it as:
> ls *csv | while read file; do
if grep -q Malmi $file; then
echo $file
fi
done
Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv
Note that we don’t even need the square brackets [ ]
for the test as grep
returns directly either true
or false
.
This can be simplified further. In the section about conditions in awk, we learned that &&
means “AND” and ||
means “OR”. On the other hand, in the section about scripts and jobs, we learned that lists of commands combined with double-ampersands &&
stop at the first failure. Using that information, we can further simplify the structure:
> ls *csv | while read file; do
grep -q Malmi $file && echo Found in $file || echo Not in $file
done
Not in Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv
Not in Helsinki_Kumpula_1.1.2024-31.1.2024.csv
Found in Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv
Not in Helsinki_Vuosaari_satama_1.1.2024-31.1.2024.csv
Here, grep -q Malmi $file
returns either true
or false
(0
or 1
, stored in $?
); if it is true
, the following command echo Found in $file
is executed and the name of the file with the hits is printed; if it is false
, the alternative command echo Not in $file
is executed.
A key part of programming is automation, the repetition of specific tasks multiple times or for multiple different inputs. bash
has many ways of creating loops: some are useful for simple iterations while others apply commands for files found using specific criteria. The commands may change their function depending on specific conditions: these can be e.g. arithmetic or test the properties of files.