Strings and numbers

Learning outcome

After this chapter, the students can do basic mathematical operations on integers using bash commands and understand why computer systems, built on binary numbers, struggle with decimal numbers and can avoid potential problems caused by that. They can also do basic operations on strings, laying the foundations on later material.

This section introduces two fundamental concepts of bash, namely strings and numbers, and we also touch on the concept of variables and the command substitution that will become more familiar later. The main focus of this course is on manipulation of data files and development of automated pipelines for their analysis. In those, the manipulation of strings and calculation on numbers aren’t central, though it is good to be aware of the concepts if such a need appears.

The main point of this section comes in the latter part about numbers: computers aren’t always very good at calculating on them!

Strings

In computer programming, a “string” is traditionally a sequence of characters. The string can be constant and not change, or stored in a variable such that it can be altered. As an example of this, here, “Matti” is a constant string and printed on the terminal screen with the command echo:

> echo "Matti"

Matti

(echo does what its name says, it “echoes” all its arguments.)

On the other hand, here, “Matti” is the value of the variable $name and the string can be altered:

> name="Matti"
> echo $name

Matti

> name=${name/tti/ija}
> echo $name

Maija

The bash programming language provides some basic string operations. Often the same result can be achieved with a combination of (multiple) bash commands and there’s no single correct way of doing things.

Concatenating two strings

Strings can be concatenated by joining them together. Here, we combine the values of variables $forename and $surname and assign the result to variable $name:

> forename="Matti"
> surname="Meikäläinen"
> name=$forename$surname
> echo $name

MattiMeikäläinen

White space is also a string and can be joined, and variables are evaluated (their value considered) when printed within double quotes. Thus, the first two lines are equivalent:

> name=$forename" "$surname
> name="$forename $surname"
> echo $name

Matti Meikäläinen

Sometimes it is necessary to wrap the variable name in curly brackets so that the variable name stays separate from the surrounding text:

> name="Matti"
> echo "${name}la"

Mattila

One could always write ${name} instead of $name but not necessarily $name instead of ${name} (one such case is shown above). However, writing the non-compulsory curly brackets is time-consuming and can make the commands more difficult to read and, in this course material, they are often left out.

Brackets vs. parentheses

The text may be a mixture of British and American spelling but in punctuation marks, the aim is to use British names.

( ) = brackets
[ ] = square brackets
{ } = curly brackets

Extracting substrings

Many programming languages have a specific command for extracting a part of a string. In bash, a substring is obtained by adding the start and possibly the end position as numbers: either ${<variable name>:<start pos>:<end pos>} or ${<variable name>:<start pos>}. (The latter gets everything till the end of the string.) Note that that bash counts characters from zero!

> echo ${surname:0:5}" "${forename}${surname:8}

Meikä Mattinen

If the command above looks incomprehensible, one can split it into parts and do first echo ${surname:0:5} and then echo ${surname:8}.

The length of string is evaluated with the hash sign:

> length=${#surname}
> echo $length

Replacing and deleting substrings

Above, the colon was used to specify the substring positions. In replacement and deletion, the separator is the forward slash. Two slashes separate the target and the replacement as in ${<variable name>/<target>/<replacement>}:

> surname="Meikäläinen"
> echo ${surname/läinen/mies}

Meikämies

If the target is deleted, the second slash can be left out (though it doesn’t harm to write it either):

> echo ${surname/läi}

Meikänen

The replacement or deletion is performed globally (as many times as it can be done) by writing the first slash twice:

> echo ${surname//i/}

Mekälänen

For more complex pattern search and replacement, there is a whole “editor language”, known as sed. We’ll have a whole section about sed and awk later.

Upper and lower-case

The first character of the string can be converted to upper and lower case with ^ and ,, respectively. If these characters are given twice, the replacement is done for all characters. So to convert to upper case ^ is used:

> name="matti"
> echo ${name^}

Matti

> echo ${name^^}

MATTI

and to convert to lower case , is used:

> name="MATTI"
> echo ${name,,}

matti

The target characters can be specified by listing them after the operator in square brackets:

> name="matti"
> echo ${name^^[at]}

mATTi

There are numerous other ways to do the conversion to upper or lower case and we’ll learn a few of those later.

Exercise: Strings

The value of a variable is evaluated when it is used in a command. One of the simplest ways to utilise a variable is to print out its value with the command ‘echo’.

Exercise 1 in Moodle.

Numbers

Most computer systems are based on binary numbers that consist of only two types of digits, ones and zeros. Although this fact is hidden deep inside the system and one can use computers without ever seeing binary numbers, it is good to be aware of the limitations set by this technical detail.

ASCII

The early computers used binary numbers consisting of eight bits, capable of representing numbers 0-255. As everything inside the computer had to be stored as binary numbers, a standard was defined to encode the printable characters (letters, numbers, punctuation etc.) and the control characters as specific numbers. This long-standing standard is known as ASCII (see Wikipedia) and contains only 128 characters (requiring seven bits). This set of characters does not include umlauts and other non-English letters. Although ASCII has been replaced by standards (especially UTF, see Wikipedia) capable of representing millions of different characters, some computer systems and programs still expect the ASCII character set (e.g. the UH usernames only contain ASCII characters). Although the ASCII set is not about numbers per se, its constraints are defined by the underlying number system and the ASCII system explains many of the limitations of command-line programs.

Integers and floating point numbers

Modern computers use many more bits to code numbers and do perfectly fine with very large positive and negative integers (i.e. “whole” numbers). However, the binary system is not ideally suited for representing decimal numbers and computers can make trivial-looking errors in calculations involving decimals. This can be seen e.g. when summing 0.1 and 0.2 using Python:

> python3 -c "print(0.1+0.2)"

0.30000000000000004

Such errors have no practical significance if the result is rounded to a precision of a few digits, but they may cause errors if not considered e.g. in comparisons. An inexperienced programmer could write pseudo-code like this:

x=0.1
y=0.2

if(x+y==0.3)
    print "True: x+y is 0.3"
else
    print "False: x+y is not 0.3"

and would then be surprised that the computer keeps making errors. There are ways to get around this problem, but the main point is that computers are prone to make errors with decimal numbers – or floating point numbers as they are typically called.

In bash, the calculation is done inside double brackets preceded by a dollar sign, $(( )). The bash language understand the basic operators (+, -, *, /, **, %) but it can only handle integers and truncates all decimal numbers:

> echo $((3*2))

> echo $((3/2))

Sometimes one can get around this limitation by using percentages, i.e. multiplying first by 100 and only then dividing the number:

> for i in {1..10}; do 
    echo "$(($i*100/10))% of analysis done"
done

10% of analysis done
20% of analysis done
30% of analysis done
40% of analysis done
50% of analysis done
60% of analysis done
70% of analysis done
80% of analysis done
90% of analysis done
100% of analysis done

As the decimals are cut out, the alternative using fractions is not very informative:

> for i in {1..10}; do 
    echo "$(($i/10)) of analysis done"
done

0 of analysis done
0 of analysis done
0 of analysis done
0 of analysis done
0 of analysis done
0 of analysis done
0 of analysis done
0 of analysis done
0 of analysis done
1 of analysis done

The calculation can be done with constant numbers or numbers stored in variables:

> for i in {1..10}; do
    echo "$i**$i is $(($i**$i))"; 
done

1**1 is 1
2**2 is 4
3**3 is 27
4**4 is 256
5**5 is 3125
6**6 is 46656
7**7 is 823543
8**8 is 16777216
9**9 is 387420489
10**10 is 10000000000

Decimal numbers

When working on the command line, one of the easiest programs to do “on the spot” calculation is to start the Python shell (or R shell) and do it there. This can be done with the command python or python3, and the shell is closed with Ctrl+d. Another option is the calculator language bc but that, by default, has the precision set at zero digits and needs more typing than Python for the same task. Nevertheless, bc should be present on all Linux systems and is often used to provide floating point calculation in bash scripts.

The trick of using bc is to create a text (written to STDOUT; read by bc from STDIN) that resembles the command that one would write manually in the bc interface. In bc, “scale=1” sets the output to have the precision of one digit. If we have the command:

> echo "scale=1; 26/3" | bc

8.6

we can get the same output by starting bc and then typing the commands scale=1 and 26/3 there. One can quit by typing “quit” or by pressing Ctrl+d.

With that, one can make meaningful calculations with decimal numbers within bash scripts:

> C=26
> F=$(echo "scale=2; $C * (9/5) + 32" | bc -l)
> echo "$C degrees Celsius is equal to $F degrees Fahrenheit."

26 degrees Celsius is equal to 78.80 degrees Fahrenheit.

Using the command scale=x, one can set the precision of bc output. Most programming languages have the printf command that allows for formatting the variable that is printed. The details of the command can be easily found elsewhere (starting from Wikipedia) and only a simple example for rounding a float is considered here. Below, %.3f indicates that the variable is “float” and should be printed with three (3) decimal places after the dot (.):

> num=0.99999996779
> val=$(printf "%.3f" $num)
> echo $val

1.000

One can have multiple variables of different types as arguments and the external content (coming from the arguments) can be mixed with constant text and control characters. Note that printf doesn’t write newline unless specified (\n below):

> printf "The values are: %.3f and %.1f \n" 0.12476 0.12476

The values are: 0.125 and 0.1

Unlike the bash operations that truncate the numbers, printf does the rounding to integers correctly:

> printf "%.0f \n" 1.6

Exercise: Numbers

The value of a variable is evaluated when it is used in a command. One of the simplest ways to utilise a variable is to print out its value with the command ‘echo’. bash can only do calculations on integers but even these can be highly useful in controlling the flow of pipelines. Mathematical equations are evaluated with $(( )).

Exercise 2 in Moodle.

Take-home message

If needed, bash provides some tools for the manipulation of strings. Bash can do basic calculations on integers and bc can be used to calculate on decimal numbers. Computers can’t represent all decimal numbers accurately and are prone to make errors in certain calculations.