Files and directories

Learning outcome

After this chapter, the students can move around in a computer’s directory system using command-line commands and understand the relation between its visualisations in the graphical interface and on the command-line line. They can also explain why moving a file/directory from one place to another is computationally different from copying a file/directory from one place to another, and can rationalise when it is better to create hard or symbolic links to a file instead of multiple copies of the same file.

Computers show the data in a structured file system consisting of directories and files. This is not how the data are actually stored in the computer’s storage systems, and the file system is just a layer that makes locating the data easier for the operating system and the human using it. For efficient use of computers, it is useful to understand the basics of how the computers store the data.

Moving around and listing the contents

Directories can be seen as boxes inside other boxes, each able to contain either files or other directories. Moving between directories may be easier to understand if one thinks them as actual objects with addresses, like houses in a village. Let’s assume that we have a village with three houses, each having four rooms, and a shopping centre with two shops:

village/
├── house1
│   ├── bedroom
│   ├── kitchen
│   ├── livingroom
│   └── office
├── house2
│   ├── bedroom
│   ├── kitchen
│   ├── livingroom
│   └── office
├── house3
│   ├── bedroom
│   ├── kitchen
│   ├── livingroom
│   └── office
└── shoppingcentre
    ├── bookstore
    └── grocerystore

A village with buildings and the rooms of one house.

A village with buildings and the rooms of one house.

On Unix, some characters (such $, {, } etc.) have special meanings and should not be used in the file or directory names. Secondly, the Unix commands consist of the program name (such as ls) and its arguments (e.g., -l house1 house2) which are separated by spaces (the full command would thus be ls -l house1 house2). As the space is a separator, it is a bad practice to use spaces in the file and directory names: one can use them but that makes life unnecessarily hard.

Above, I have simply left the spaces out and written e.g. shoppingcentre. A common practice is to replace space with the underscore and write shopping_centre or use the camel-case and write shoppingCentre or ShoppingCentre. It is a good practice to use only alphabet letters (a-z, A-Z), numbers (0-9), underscores (_) and dots (.) in the file and directory names. The operating system doesn’t care about the file endings but for us humans, it is good to use informative endings that follow common naming practices: .txt for text, .csv for comma-separated values etc.

We can get “in front of” this village with the command

> cd ~/IntSciCom/village

Here, cd means change to directory and the text after it gives the path (or address) of the directory starting from the home directory, short-handed with ~. We can list the contents of the directory with the command ls:

> ls
house1  house2  house3  shoppingcentre

we see the three houses and the shopping centre. We don’t need to enter the directory to see its contents but can list it by giving its name as an argument to the ls command:

> ls house1
bedroom  kitchen  livingroom  office

The forward slashes (/) separate the different levels (directories) in the file path. If the path ends at a directory (and the path could thus be extended with more directories or a file), one can either write a trailing slash or leave it out. This command is equivalent to one above:

> ls house1/
bedroom  kitchen  livingroom  office

In fact, we can list the contents of all three directories at once either by giving the name of each directory in the command:

> ls house1 house2 house3
house1:
bedroom  kitchen  livingroom  office

house2:
bedroom  kitchen  livingroom  office

house3:
bedroom  kitchen  livingroom  office

or using wildcards that match multiple different characters:

> ls house?
house1:
bedroom  kitchen  livingroom  office

house2:
bedroom  kitchen  livingroom  office

house3:
bedroom  kitchen  livingroom  office

Here, ? means any single character. Another often used wildcard is * and it matches either ‘nothing’, ‘any character’ or ‘any combination of multiple characters’. We can see that * matches everything with the command:

> ls *
house1:
bedroom  kitchen  livingroom  office

house2:
bedroom  kitchen  livingroom  office

house3:
bedroom  kitchen  livingroom  office

shoppingcentre:
bookstore  grocerystore

Efficient use of wildcards is a core skill in command-line working but one has to be aware of their dangerous sides as well. Let’s assume that we would like to remove the directory house1/ and all its content. One way to do that would be:

> rm -r house1/*

Technically the asterisk is unnecessary in the command, but it is nevertheless a valid command for the task and would remove the directory. However, if we make a typo and insert a space before the asterisk:

> rm -r house1/ *

we get a warning rm: cannot remove 'house1': No such file or directory and will find out that all directories have been removed! Here, the asterisk matched every directory and deleted them; then trying to remove house1 gives the error that no such directory exists any more.

For a beginner (and even for more experienced users), it’s a good practice to test the result of wildcard expansion with a safer command command first. One could first give the command

> ls -R house1/*

and check that these are indeed the files and directories that one wants to remove. If they are, one can then replace ls with rm -r in the command.

The Linux command-line environment has a built-in documentation system that can be accessed with the command man <prog_name>. For example, the command:

> man rm

reveals that rm is a program to “remove files or directories” and that its argument -r removes “directories and their contents recursively”. Press “q” to quit and then find out the meaning of the argument in ls -R.


Exercise: Command ‘ls’

The arguments for ls depend very much on the use case. Some widely used ones are

  • -l use a long listing format
  • -h with -l, print file sizes in human-readable format
  • -t sort by time, newest first

More important for ls is the use of “wildcards” that allow selecting the right combination of files. ? matches any single character. * matches any number of any characters, including “none”. Square brackets list all matching characters such that [ABC] matches “A”, “B” and “C”. All these can be combined, e.g. ls [AB]*day_??.txt matches Birthday_v3.txt.

Exercise 1 in Moodle.


Absolute and relative file paths

We can move directly into a directory inside another directory by giving its full path:

> cd house1/office/

When moving between directories, one may get confused about the current location. Depending on the system used and its settings, the shell program may tell the name of the directory in the command prompt. One can always print the absolute path to the current directory with the command pwd (print working directory):

> pwd
/users/username/IntSciCom/village/house1/office

Given that, one could move around by always giving the full path to the target directory. We could move from office to bedroom with the command:

> cd /users/$USER/IntSciCom/village/house1/bedroom/
> pwd
/users/username/IntSciCom/village/house1/bedroom

Above, we have used ‘username’ as a part of the file path. That is just a place holder and is actually replaced by one’s own username. Elsewhere, we have used ‘$USER’. That is a variable that holds the user’s username and, when executed, is replaced by that. Because of that, the command containing the variable should work for every user although the real file path is different for each of us

This looks very complicated and there’s a much easier way to refer to the parent directory – or one step backwards on the path. One step backwards is .. and these can be combined:

> cd ../kitchen/
> pwd
/users/username/IntSciCom/village/house1/kitchen
> cd ../../house2/office/../livingroom/
> pwd
/users/username/IntSciCom/village/house2/livingroom

Here, the second cd command moves out of kitchen (../); then out of house1 (../); then into house2 (house2/) and into office (office/); then out of office (../); and finally into livingroom (livingroom). It is of course unnecessary to go first to office, come out of there (..), and then go to livingroom, and one would normally go directly to the correct destination. This detour is shown just to demonstrate that it can be done.

As .. means one step backwards on the path, . means this directory:

> cd ..
> pwd
/users/username/IntSciCom/village/house2/
> ls .
bedroom  kitchen  livingroom  office
> ls ..
house1  house2  house3  shoppingcentre

I have a UH Linux computer (running Cubbli, a variant of Ubuntu Linux) and on my system, the root of the file system looks like this:

> ls /
bin   cdrom  cubbli22-gold  etc   lib    lib64   lost+found  mnt  proc  run   snap  sys  usr
boot  cs     dev            home  lib32  libx32  media       opt  root  sbin  srv   tmp  var

One doesn’t need to care about this except for the path symbol / that represents the root of the file structure tree.

The work and life of a regular Linux user typically happens in the branch starting with /home or /users. (On my computer it is call /home but on the CSC computers it is called /users and we stick to that now.) In that directory, each user then has a directory of their own, known as home directory, and only they can see and manipulate the files in that directory. On the UH computers, this personal directory is named after the account name, used for email and other things in the UH systems. For the user called “fakename”, the home directory would be /users/fakename (or /home/fakename on UH Cubbli).

As the home directory is so important, there are shorthands that help to use it. The command:

cd

(with no additional arguments) changes to the home directory. The symbol ~ (called tilde) is a shorthand for the home directory:

cd ~

is equivalent to the previous command.

Similarly,

ls ~/

is equivalent to

ls /users/$USER/

Here, the variable $USER holds the username of the current user, and that is substituted differently in the commands of different users, e.g., as /users/fakename/.

On MacOS, the root of the file system is / and the home directory is /Users/fakename. Windows has no concept of a root; the home directory is typically C:\Users\fakename.

Reading files

Let’s move back to house1:

> cd ~/IntSciCom/village/house1

We learn that bedroom contains a directory called notebook and that contains two files:

> ls bedroom/
notebook
> ls bedroom/notebook/
Shakespeare_Hamlet.txt  Shakespeare_Macbeth.txt

The end .txt suggests (but doesn’t guarantee) that they are text files. We get some information e.g. with the commands ls -l (where -l indicates the long format), file and wc (meaning word count):

> ls -l bedroom/notebook/
total 8
-rw-rw---- 1 username pepr_username 56 Mar  6 15:48 Shakespeare_Hamlet.txt
-rw-rw---- 1 username pepr_username 18 Mar  6 15:48 Shakespeare_Macbeth.txt
> file bedroom/notebook/Shakespeare_Hamlet.txt 
bedroom/notebook/Shakespeare_Hamlet.txt: ASCII text
> wc bedroom/notebook/Shakespeare_Hamlet.txt 
 1 12 56 bedroom/notebook/Shakespeare_Hamlet.txt

On the output of ls -l, the first character - means that it is a regular file;the line would start with d if the target were a directory. The next three characters indicate what the owner of the file can do it with: rw- means that one can read the file and write to the file but not execute it (thus, it is not a program that runs and does something); the following three characters tell the permissions of the group members (read and write, no execute) and then all other users (here ---; we’ll revisit these later). Then comes the owner of the file and their group, and finally the date when the file was last modified; the numbers 56 and 18 in-between give the sizes of the files, 56 and 18 bytes (in this case 56 and 18 characters).

The command file determines the file type and here says that it is “ASCII text” (consisting of standard characters) that we can easily read.

The command wc tells that the file consists of one row (in fact, it has one newline character ending the row), 12 words and 56 characters. If one would like to get just one or two of the counts, we could specify that with additional arguments:

> wc -w  bedroom/notebook/Shakespeare_Hamlet.txt 
12 bedroom/notebook/Shakespeare_Hamlet.txt

See man wc to find the other arguments. (Press q to quit reading the manual.)

As we learned that the files are small text files, we can safely look at them more closely. cat (meaning catenate) is the most basic command to read and print the contents of files. The name of the command comes from its usage for concatenating the contents of files into new files:

> cd bedroom/notebook
> cat Shakespeare_Hamlet.txt Shakespeare_Macbeth.txt > my_notes.txt

Here, > my_notes.txt means that the output of concatenation is written to a new file called my_notes.txt. The properties of this new file are quite predictable given the input:

> ls -l my_notes.txt 
-rw-rw---- 1 username pepr_username 74 Mar  6 15:55 my_notes.txt
> file my_notes.txt 
my_notes.txt: ASCII text
> wc my_notes.txt 
 2 15 74 my_notes.txt

If we do not direct the contents of the concatenation command into a new file, it is printed on the screen:

> cat Shakespeare_Hamlet.txt 
A story about Danish bloke (and his dad who is a ghost)

cat is an important command but it is not practical for reading all text files. To see that, we can go to the directory bookshelf inside the directory office:

> cd ../../office/bookshelf/

If that fails for some reason, you can get there also through the full path:

> cd ~/IntSciCom/village/house1/office/bookshelf/

Now, we can see that the files are much larger:

> ls -l
total 332
-rw-rw---- 1 username pepr_username 206763 Mar  6 15:48 Shakespeare_Hamlet.txt
-rw-rw---- 1 username pepr_username 130397 Mar  6 15:48 Shakespeare_Macbeth.txt
> file *
Shakespeare_Hamlet.txt:  Unicode text, UTF-8 (with BOM) text, with CRLF line terminators
Shakespeare_Macbeth.txt: Unicode text, UTF-8 (with BOM) text, with CRLF line terminators
> wc *
  7079  34988 206763 Shakespeare_Hamlet.txt
  4544  21427 130397 Shakespeare_Macbeth.txt
 11623  56415 337160 total

Starting from the bottom, we see that the files have 7079 and 4454 lines and altogether 337,160 characters. They are text files using the Unicode characters and CRLF as the end-of-line mark, revealing that they were created on a Windows system.

We could print the contents of the file with cat but it is impossible to read the text as fast as it scrolls on the screen:

> cat Shakespeare_Hamlet.txt
(too much to show...)

To make readable, we pipe the output cat to the program less:

> cat Shakespeare_Hamlet.txt | less

You can now scroll up and down with arrow keys, press space for the next screenful of text and finally quit reading the text with “q”.

Important

If you missed it above and seem to have got stuck on less, press the key q on your keyboard to quit the program.

In fact, we don’t need two programs to read a file but can do it directly with less:

> less Shakespeare_Hamlet.txt

less can do much more than paginate the text and has, for example, a built-in text search function. If one opens the Hamlet file (the command above), one can then start the search with the character / and write the text after that, e.g. /To be. This finds the first occurence of those words; one can keep moving to the next occurence with n, browse around using arrows etc., or quit with q.

The cat command is unsuitable for reading long texts and very early programs were developed to show one screenful of text at a time. One of these programs was more, the name coming from the text such as --More--(1%) appearing at the bottom of the screen. The text here means that one has seen 1% of the contents and one can see more by pressing the spacebar. The early more was very simple and could only go forward in the text. Someone developed a better program for the same task and, instead of calling it something like better-more named it as less.

Nowadays, less is a massive program and, in addition of just text, may read compresssed text files, pdf documents and many more formats. As shown below, the less manual has 1510 lines of text and it takes time to learn all its features:

> man less | wc
  1510   12603   87719

However, the standard behaviour is sufficient for most tasks.

Why can’t we always use less for reading the files?

The two programs have their own strengths and it’s the Unix-way to combine different tools to get the intended outcome. For example, above we learned that the files have CRLF as the end-of-line mark. Linux typically reads them fine (as here), but specific programs may be more picky and wrong end-of-line characters may cause frustrating problems. cat has argument -A to show all characters, also those not typically printed as visible characters (these include e.g., space, backspace, tab and newline). Combining cat and less, we can do:

> cat -A Shakespeare_Hamlet.txt | less

and see that every line ends with ^M$, meaning CR and LF. In comparison, the file that we created above:

> cat -A ../../bedroom/notebook/my_notes.txt | less

has lines ending with $, meaning LF only, as is the standard on Unix systems. The combination cat -A filename | less can be useful when things do not work as intended and one suspects that the text file could have something wrong. These commands reveal e.g. the difference between a TAB character (shown as ^I) and multiple spaces.

And what do CR and LF stand for? In brief, computer systems evolved from mechanical devices and writing text evolved from mechanical typewriters. On those, starting a new line required returning the type element to the beginning of the line (carriage return, or CR) and then moving the paper up to the next line (line feed, or LF). Different operating systems then adopted CR, LF or CR+LF to indicate a newline in text files. Unix (LF) and Windows (CR+LF) use different end-of-line characters.

Yes, LF and CRLF are characters although we don’t typically see anything visible on the screen. Computer languages typically use \n to mark the newline (depending on the operating system, either LF or CRLF) and the same applies also for bash. The command:

> echo -e "First line.\nSecond line."
First line.
Second line.

has one continuous string (within double quotes) as an argument but then prints the text on two lines. The control character \n is executed and, as a result, the writing moves to the beginning of the next line. Similarly, a multi-line text file would be stored in memory as a long sequence of characters and its true appearance could only be seen when it’s printed out and all the control characters are converted to their true form.

Moving and copying files

In this example, the directories have descriptive names (such village, house1 and bedroom) but they are actually all technically similar and the whole house2 could be moved into notebook. Directories can hold many sub-directories and files (on a typical Linux system, a directory can hold approximately 4 billion files), but it is rarely practical and efficient to have lots of files and sub-directories inside one directory. It is easier for both humans and the computer operating system to find information when it is placed in structured directory systems consisting of several layers of sub-directories.

The directories can naturally hold files of different sizes, on a Linux system from 0 to (\(2^{44}-1\)) characters in size (that is, up to 16 terabytes or 16,000 gigabytes in size). One of the reasons for directories to be able to hold so large files is that the files are actually not held inside the directory, but the directory only contains a link to the actual data held elsewhere in the storage device.

We can clarify this with an example. Here we have our village, the buildings, the rooms, the bookshelf and finally the book files:

Schematic view of subdirectories of “village” and the filenames being pointers to memory locations on the storage device.

Schematic view of subdirectories of “village” and the filenames being pointers to memory locations on the storage device.

The directory names (and their contents) and the file names are stored in the storage device (SSD, hard drive, USB disk etc.) but they take little space; the files (here, the books) are pointers to locations in the storage device and the actual data are stored there, shown as solid blue and red blocks.

Given that, moving a file from one place to another is an easy operation and only needs to move the pointer. So, starting from this:

Start point for the move command.

Start point for the move command.
> tree house[12]/office/
house1/office/
└── bookshelf
    ├── Shakespeare_Hamlet.txt
    └── Shakespeare_Macbeth.txt
house2/office/
└── bookshelf

the command:

> cd /users/$USER/IntSciCom/village 
> mv house1/office/bookshelf/Shakespeare_Macbeth.txt house2/office/bookshelf/Shakespeare_Macbeth.txt

moves the book file to the bookshelf in the office of house2:

Result of the move command where only the pointer to the memory location, not the actual data, was moved.

Result of the move command where only the pointer to the memory location, not the actual data, was moved.
> tree house[12]/office/
house1/office/
└── bookshelf
    └── Shakespeare_Hamlet.txt
house2/office/
└── bookshelf
    └── Shakespeare_Macbeth.txt

Even if the book file is huge in size, the operation would be easy as the actual data are not moved anywhere, just the pointer to the data. The situation is very different if one copies the file:

> cp house2/office/bookshelf/Shakespeare_Macbeth.txt house1/office/bookshelf/Shakespeare_Macbeth.txt

as now the same data are stored twice and, depending on the size of the file, writing the copy can take a long time:

Result of the copy command where the data in the storage device has been duplicated and both memory locations are now pointed to.

Result of the copy command where the data in the storage device has been duplicated and both memory locations are now pointed to.
> tree house[12]/office/
house1/office/
└── bookshelf
    ├── Shakespeare_Hamlet.txt
    └── Shakespeare_Macbeth.txt
house2/office/
└── bookshelf
    └── Shakespeare_Macbeth.txt

Note that the behaviour is different when the directories are located on different storage devices (or on different partitions, to be precise), such as on the computer’s main storage device and a USB drive. Then, the data is first copied to the new location and then deleted from the old location (shown here with pale colours and dotted lines):

Result of the move command between two storage devices is identical to copying the file and then removing the original one.

Result of the move command between two storage devices is identical to copying the file and then removing the original one.

Often, it is practical to have the same data available in different places and this can be achieved without making multiple copies of it: one can make links to files and directories elsewhere in the same storage device. These links can be either hard or soft, also known as symbolic. A hard link creates a new pointer to the same location on the storage device:

> rm house2/office/bookshelf/Shakespeare_Macbeth.txt
> ln house1/office/bookshelf/Shakespeare_Macbeth.txt house2/office/bookshelf/Shakespeare_Macbeth.txt

Result of the hard link command where the same memory location in the storage device is pointed to by two pointers.

Result of the hard link command where the same memory location in the storage device is pointed to by two pointers.
> tree house[12]/office/
house1/office/
└── bookshelf
    ├── Shakespeare_Hamlet.txt
    └── Shakespeare_Macbeth.txt
house2/office/
└── bookshelf
    └── Shakespeare_Macbeth.txt
> ls -l house[12]/office/bookshelf
total 332
-rw-rw---- 1 username pepr_username 206763 Mar  6 15:48 Shakespeare_Hamlet.txt
-rw-rw---- 1 username pepr_username 130397 Mar  6 15:59 Shakespeare_Macbeth.txt

house2/office/bookshelf:
total 0
lrwxrwxrwx 1 username pepr_username 56 Mar  6 16:03 Shakespeare_Macbeth.txt -> ../../../house1/office/bookshelf/Shakespeare_Macbeth.txt

It would seem natural to prefer this option of hard-linking but it has some downsides: typically we want to be able to remove files (and directories) and free the space in the storage device for new data. In this situation, removing the Macbeth book in house1 wouldn’t free the space on the storage device because the Macbeth book in house2 still points to that data. The space would be only freed when both pointers are removed.

How do we know how many hard links a particular piece of data has? Those with sharp eyes may have spotted that the ls -l command above prints -rw------- 1 for Hamlet but -rw------- 2 for Macbeth. The numbers 1 and 2 tell the count of pointers to that particular data.

Instead of hard links, it is often practical to create symbolic links. Creating them is often easiest from the place where one wants to see the linked file; one can then give the target of the link using the relative path, that .. for each step backwards in the path (here, omitting the link name and thus using the target name):

> rm house2/office/bookshelf/Shakespeare_Macbeth.txt
> cd house2/office/bookshelf/
> ln -s ../../../house1/office/bookshelf/Shakespeare_Macbeth.txt 
> ls 
Shakespeare_Macbeth.txt

Result of the symbolic link command where the new file points to original file name.

Result of the symbolic link command where the new file points to original file name.
> cd ../../../
> tree house[12]/office/
house1/office/
└── bookshelf
    ├── Shakespeare_Hamlet.txt
    └── Shakespeare_Macbeth.txt
house2/office/
└── bookshelf
    └── Shakespeare_Macbeth.txt -> ../../../house1/office/bookshelf/Shakespeare_Macbeth.txt

Alternatively, symbolic links can be created using the absolute path (here, renaming the link as Another_Macbeth.txt)

> cd house2/office/bookshelf/
> ln -s /users/$USER/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt Another_Macbeth.txt
> ls
Another_Macbeth.txt  Shakespeare_Macbeth.txt

They can also be created outside the target directory:

> cd ../../../
> ln -s /users/$USER/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt house2/office/bookshelf/Third_Macbeth.txt

Symbolic links are clearly indicated in the output of ls -l command, but otherwise, they work (nearly) as any regular files:

> cd /users/$USER/IntSciCom/village
> ls -l house[12]/office/bookshelf/
house1/office/bookshelf/:
total 332
-rw-rw---- 1 username pepr_username 206763 Mar  6 15:48 Shakespeare_Hamlet.txt
-rw-rw---- 1 username pepr_username 130397 Mar  6 15:59 Shakespeare_Macbeth.txt

house2/office/bookshelf/:
total 8
lrwxrwxrwx 1 username pepr_username 90 Mar  6 16:12 Another_Macbeth.txt -> /users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt
lrwxrwxrwx 1 username pepr_username 56 Mar  6 16:03 Shakespeare_Macbeth.txt -> ../../../house1/office/bookshelf/Shakespeare_Macbeth.txt
lrwxrwxrwx 1 username pepr_username 90 Mar  6 16:14 Third_Macbeth.txt -> /users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt
> wc -l house[12]/office/bookshelf/*
  7079 house1/office/bookshelf/Shakespeare_Hamlet.txt
  4544 house1/office/bookshelf/Shakespeare_Macbeth.txt
  4544 house2/office/bookshelf/Another_Macbeth.txt
  4544 house2/office/bookshelf/Shakespeare_Macbeth.txt
  4544 house2/office/bookshelf/Third_Macbeth.txt
 25255 total

So, why would one want to use symbolic links instead of hard links?

One good reason is that symbolic links break when the target file is removed. This is useful if one, e.g., has links to a particular dataset in different places and then needs to update this dataset with a new one (e.g., because of errors in the previous version or having additional observations in the new version). When this is done with symbolic links, the removal of the original file is noticed and the erroneous reference to an outdated data file cannot create errors in the analysis in other directories.