Nelle Nemo, a marine biologist, has just returned from a six-month survey of the North Pacific Gyre, where she has been sampling gelatinous marine life in the Great Pacific Garbage Patch. She has 1520 samples that she’s run through an assay machine to measure the relative abundance of 300 proteins. She needs to run these 1520 files through an imaginary program called goostats
she inherited. On top of this huge task, she has to write up results by the end of the month so her paper can appear in a special issue of Aquatic Goo Letters.
The bad news is that if she has to run goostats
by hand using a GUI, she’ll have to select and open a file 1520 times. If goostats
takes 30 seconds to run each file, the whole process will take more than 12 hours of Nelle’s attention. With the shell, Nelle can instead assign her computer this mundane task while she focuses her attention on writing her paper.
The next few lessons will explore the ways Nelle can achieve this. More specifically, they explain how she can use a command shell to run the goostats
program, using loops to automate the repetitive steps of entering file names, so that her computer can work while she writes her paper.
What does the command ls
do when used
with the -l
option?
What about if you use both the -l
and the -h
option?
Solution
`-l` - long listing format, showing not only the file/directory names but also additional information such as the file size and the time of its last modification. Some of its output is about properties that we do not cover in this lesson (such as file permissions and ownership), but the rest should be useful nevertheless.-h
+ -l
- makes file size ‘Human readable’, i.e. 5.3K
instead of 5369
.
By default ls lists the contents of a directory in alphabetical order by name. The command ls -t
lists items by time of last change instead of alphabetically. The command ls -r
lists the contents of a directory in reverse order. What happens when you combine the -t
and -r
flags? Hint: You may need to use the -l
flag to see the last changed dates.
Solution
`-t` - most recently changed file first.-rt
- most recently changed file last.
This can be very useful for finding your most recent edits or checking to see if a new output file was written.
Starting from /Users/amanda/data
, which of the following commands could Amanda use to navigate to her home directory, which is /Users/amanda
?
cd .
cd /
cd /home/amanda
cd ../..
cd ~
cd home
cd ~/data/..
cd
cd ..
Solution
- No:
.
stands for the current directory.
2. No: `/` stands for the root directory.
3. No: Amanda's home directory is `/Users/amanda`.
4. No: this goes up two levels, i.e. ends in `/Users`.
5. Yes: `~` stands for the user's home directory, in this case `/Users/amanda`.
6. No: this would navigate into a directory `home` in the current directory if it exists.
7. Yes: unnecessarily complicated, but correct.
8. Yes: shortcut to go back to the user's home directory.
9. Yes: goes up one level.
If pwd
displays /Users/thing
,
what will ls -F ../backup
display?
../backup: No such file or directory
2012-12-01 2013-01-08 2013-01-27
2012-12-01/ 2013-01-08/ 2013-01-27/
original/ pnas_final/ pnas_sub/
Solution
1. No: there *is* a directory `backup` in `/Users`.2. No: this is the content of `Users/thing/backup`, but with `..` we asked for one level further up.
3. No: see previous explanation.
4. Yes: `../backup/` refers to `/Users/backup/`.
If pwd
displays /Users/backup
,
and -r
tells ls
to display things in reverse order,
what command(s) will result in the following output:
pnas_sub/ pnas_final/ original/
-
ls pwd
-
ls -r -F
-
ls -r -F /Users/backup
Solution
1. No: `pwd` is not the name of a directory.2. Yes: `ls` without directory argument lists files and directories in the current directory.
3. Yes: uses the absolute path explicitly.
Jamie realizes that she put the files sucrose.dat
and maltose.dat
into the wrong folder.
The files should have been placed in the raw
folder. She runs these commands to explore the file system.
$ ls -F
analyzed/ raw/
$ ls -F analyzed
fructose.dat glucose.dat maltose.dat sucrose.dat
$ cd analyzed
Fill in the blanks to move these files to the raw/
folder to correct her mistake
$ mv sucrose.dat maltose.dat ____/____
Solution
$ mv sucrose.dat maltose.dat ../raw
Suppose you created a text file called statstics.txt
After creating and saving this file you realize you misspelled the filename! You want to correct the mistake, which of the following commands could you use to do so?
cp statstics.txt statistics.txt
mv statstics.txt statistics.txt
mv statstics.txt .
cp statstics.txt .
Solution
1. No. While this would create a file with the correct name, the incorrectly named file still exists in the directory and would need to be deleted.2. Yes
3. No, the period(.) indicates where to move the file, but does not provide a new file name; identical file names cannot be created.
4. No, the period(.) indicates where to copy the file, but does not provide a new file name; identical file names cannot be created.
What is the output of the closing ls
command in the sequence shown below?
$ pwd
/Users/jamie/data
$ ls
proteins.dat
$ mkdir recombined
$ mv proteins.dat recombined/
$ cp recombined/proteins.dat ../proteins-saved.dat
$ ls
proteins-saved.dat recombined
recombined
proteins.dat recombined
proteins-saved.dat
Solution
2.Starting in the `/Users/jamie/data` directory
$ mkdir recombined ----> create new folder
$ mv proteins.dat recombined/ -----> move proteins.dat to the new folder
$ cp recombined/proteins.dat ../proteins-saved.dat -----> copies this file to the parent directory of our current location
(Examples from data-shell/molecules
directory)
*
matches zero or more characters.
*.pdb
matches ethane.pdb
, propane.pdb
, and every file that ends with .pdb
.
p*.pdb
only matches pentane.pdb
and propane.pdb
?
matches exactly one character.
?ethane.pdb
would match methane.pdb
*ethane.pdb
matches both ethane.pdb
, and methane.pdb
.
???ane.pdb
matches three characters followed by ane.pdb
, giving cubane.pdb
ethane.pdb
octane.pdb
.
In the molecules
directory which ls
command(s) will
produce this output?
ethane.pdb methane.pdb
ls *t*ane.pdb
ls *t?ne.*
ls *t??ne.pdb
ls ethane.*
Solution
3.Jamie is working on a project and she sees that her files aren't very well organized:
$ ls -F
analyzed/ fructose.dat raw/ sucrose.dat
The fructose.dat
and sucrose.dat
files contain output from her data
analysis. How could you use wildcards with the mv
command to move both files to the analyzed
directory at the same time?
Solution
mv *.dat analyzedIf we run sort
on a file containing the following lines:
10
2
19
22
6
the output is:
10
19
2
22
6
If we run sort -n
on the same input, we get this instead:
2
6
10
19
22
Why?
The head
command prints lines from the start of a file and the tail
prints lines from the end of a file instead.
If we were to run these 2 commands:
$ head -n 3 animals.txt > animals-subset.txt
$ tail -n 2 animals.txt >> animals-subset.txt
what would animals.txt
contain?
- The first three lines of
animals.txt
- The last two lines of
animals.txt
- The first three lines and the last two lines of
animals.txt
- The second and third lines of
animals.txt
`
Solution
3.In our current directory, we want to find the 3 files which have the least number of lines. Which command would work?
wc -l * > sort -n > head -n 3
wc -l * | sort -n | head -n 1-3
wc -l * | head -n 3 | sort -n
wc -l * | sort -n | head -n 3
Solution
4. The pipe character | is used to connect the output from one command to the input of another. > is used to redirect standard output to a fileA file called animals.txt looks like this:
2012-11-05,deer
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer
2012-11-06,fox
2012-11-07,rabbit
2012-11-07,bear
If we run this command, what lines will end up in final.txt
?
$ cat animals.txt | head -n 5 | tail -n 3 | sort -r > final.txt
Solution
2012-11-06,rabbit2012-11-06,deer
2012-11-05,raccoon
The general form of a loop:
for thing in list_of_things
do
operation_using $thing # Indentation within the loop is not required, but aids legibility
done
This exercise refers to the data-shell/molecules
directory. ls
gives the following output:
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
What is the output of the following code?
$ for datafile in *.pdb
> do
> ls *.pdb
> done
Now, what is the output of the following code?
$ for datafile in *.pdb
> do
> ls $datafile
> done
Why do these two loops give different outputs?
What would be the output of running the following loop in the data-shell/molecules
directory?
$ for filename in c*
> do
> ls $filename
> done
- No files are listed.
- All files are listed.
- Only
cubane.pdb
,octane.pdb
andpentane.pdb
are listed. - Only
cubane.pdb
is listed.
Solution
4 is the correct answer. *
matches zero or more characters, so any file name starting with the letter c, followed by zero or more other characters will be matched.
How would the output differ from using this command instead?
$ for filename in *c*
> do
> ls $filename
> done
- The same files would be listed.
- All the files are listed this time.
- No files are listed this time.
- The files cubane.pdb and octane.pdb will be listed.
- Only the file octane.pdb will be listed.
Solution
4 is the correct answer. *
matches zero or more characters, so a file name with zero or more characters before a letter c and zero or more characters after the letter c will be matched.
In the data-shell/molecules
directory, what is the effect of this loop?
for alkanes in *.pdb
do
echo $alkanes
cat $alkanes > alkanes.pdb
done
- Prints
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
,pentane.pdb
andpropane.pdb
, and the text frompropane.pdb
will be saved to a file calledalkanes.pdb
. - Prints
cubane.pdb
,ethane.pdb
, andmethane.pdb
, and the text from all three files would be concatenated and saved to a file calledalkanes.pdb
. - Prints
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
, andpentane.pdb
, and the text frompropane.pdb
will be saved to a file calledalkanes.pdb
. - None of the above.
Solution
1 is the correct answer. The text from each file in turn gets written to the alkanes.pdb
file. However, the file gets overwritten on each loop interaction, so the final content of alkanes.pdb
is the text from the propane.pdb
file.
Also in the data-shell/molecules
directory, what would be the output of the following loop?
for datafile in *.pdb
do
cat $datafile >> all.pdb
done
- All of the text from
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
, andpentane.pdb
would be concatenated and saved to a file calledall.pdb
. - The text from
ethane.pdb
will be saved to a file calledall.pdb
. - All of the text from
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
,pentane.pdb
andpropane.pdb
would be concatenated and saved to a file calledall.pdb
. - All of the text from
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
,pentane.pdb
andpropane.pdb
would be printed to the screen and saved to a file calledall.pdb
.
Solution
3 is the correct answer. >>
appends to a file, rather than overwriting it with the redirected output from a command. Given the output from the cat
command has been redirected, nothing is printed to the screen.
A loop is a way to do many things at once — or to make many mistakes at once if it does the wrong thing. One way to check what a loop would do is to echo
the commands it would run instead of actually running them.
Suppose we want to preview the commands the following loop will execute without actually running those commands:
$ for datafile in *.pdb
> do
> cat $datafile >> all.pdb
> done
What is the difference between the two loops below, and which one would we want to run?
# Version 1
$ for datafile in *.pdb
> do
> echo cat $datafile >> all.pdb
> done
# Version 2
$ for datafile in *.pdb
> do
> echo "cat $datafile >> all.pdb"
> done
Solution
The second version is the one we want to run. This prints to screen everything enclosed in the quote marks, expanding the loop variable name because we have prefixed it with a dollar sign.
The first version appends the output from the command echo cat $datafile
to the file, all.pdb
. This file will just contain the list; cat cubane.pdb
, cat ethane.pdb
, cat methane.pdb
etc.
Try both versions for yourself to see the output! Be sure to open the all.pdb
file to view its contents.
Suppose we want to set up a directory structure to organize some experiments measuring reaction rate constants with different compounds and different temperatures. What would be the result of the following code:
$ for species in cubane ethane methane
> do
> for temperature in 25 30 37 40
> do
> mkdir $species-$temperature
> done
> done
Solution
We have a nested loop, i.e. contained within another loop, so for each species in the outer loop, the inner loop (the nested loop) iterates over the list of temperatures, and creates a new directory for each combination.
Try running the code for yourself to see which directories are created!
Leah has several hundred data files, each of which is formatted like this:
2013-11-05,deer,5
2013-11-05,rabbit,22
2013-11-05,raccoon,7
2013-11-06,rabbit,19
2013-11-06,deer,2
2013-11-06,fox,1
2013-11-07,rabbit,18
2013-11-07,bear,1
An example of this type of file is given in data-shell/data/animal-counts/animals.txt
.
We can use the command cut -d , -f 2 animals.txt | sort | uniq
to produce the unique species in animals.txt
. In order to avoid having to type out this series of commands every time, a scientist may choose to write a shell script instead.
Write a shell script called species.sh
that takes any number of filenames as command-line arguments, and uses a variation of the above command to print a list of the unique species appearing in each of those files separately.
$ history | tail -n 5 > recent.sh
If you run the above command the last command in the file is the history
command itself, i.e., the shell has added history
to the command log before actually running it. In fact, the shell always adds commands to the log before running them. Why do you think it does this?
Solution
If a command causes something to crash or hang, it might be useful to know what that command was, in order to investigate the problem. Were the command only be recorded after running it, we would not have a record of the last command run in the event of a crash
In the molecules directory, imagine you have a shell script called script.sh
containing the following commands:
head -n $2 $1
tail -n $3 $1
While you are in the molecules
directory, you type the following command:
bash script.sh '*.pdb' 1 1
Which of the following outputs would you expect to see?
- All of the lines between the first and the last lines of each file ending in
.pdb
in the molecules directory - The first and the last line of each file ending in
.pdb
in themolecules
directory - The first and the last line of each file in the
molecules
directory - An error because of the quotes around
*.pdb
Solution
The correct answer is 2.
The special variables $1, $2 and $3 represent the command line arguments given to the script, such that the commands run are:
The shell does not expand '*.pdb'
because it is enclosed by quote marks. As such, the first argument to the script is '*.pdb'
which gets expanded within the script by head
and tail
.
Write a shell script called longest.sh
that takes the name of a directory and a filename extension as its arguments, and prints out the name of the file with the most lines in that directory with that extension. When the script is run as below, it should print the name of the .pdb
file in /tmp/data
that has the most lines.
$ bash longest.sh /tmp/data pdb
Solution
The first part of the pipeline, `wc -l $1/*.$2 | sort -n`, counts the lines in each file and sorts them numerically (largest last). When there’s more than one file, `wc` also outputs a final summary line, giving the total number of lines across all files. We use `tail -n 2 | head -n 1` to throw away this last line.
With `wc -l $1/*.$2 | sort -n | tail -n 1` we’ll see the final summary line: we can build our pipeline up in pieces to be sure we understand the output.
For this question, consider the data-shell/molecules
directory once again. This contains a number of .pdb
files in addition to any other files you may have created. Explain what each of the following three scripts would do when run as bash script1.sh *.pdb
, bash script2.sh *.pdb
, and bash script3.sh *.pdb
respectively.
# Script 1
echo *.*
# Script 2
for filename in $1 $2 $3
do
cat $filename
done
# Script 3
echo $@.pdb
Solution
In each case, the shell expands the wildcard in *.pdb
before passing the resulting list of file names as arguments to the script.
Script 1 would print out a list of all files containing a dot in their name. The arguments passed to the script are not actually used anywhere in the script.
Script 2 would print the contents of the first 3 files with a .pdb
file extension. $1
, $2
, and $3
refer to the first, second, and third argument respectively.
Script 3 would print all the arguments to the script (i.e. all the .pdb
files), followed by .pdb
. $@
refers to all the arguments given to a shell script.
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb.pdb
Suppose you have saved the following script in a file called do-errors.sh
in Nelle’s north-pacific-gyre/2012-07-03
directory:
# Calculate stats for data files.
for datafile in "$@"
do
echo $datfile
bash goostats $datafile stats-$datafile
done
When you run it:
$ bash do-errors.sh NENE*[AB].txt
the output is blank. To figure out why, re-run the script using the -x
option:
bash -x do-errors.sh NENE*[AB].txt
What is the output showing you? Which line is responsible for the error?
Solution
The -x
option causes bash
to run in debug mode. This prints out each command as it is run, which will help you to locate errors. In this example, we can see that echo
isn’t printing anything. We have made a typo in the loop variable name, and the variable datfile
doesn’t exist, hence returning an empty string.
Which command would result in the following output:
and the presence of absence:
grep "of" haiku.txt
grep -E "of" haiku.txt
grep -w "of" haiku.txt
grep -i "of" haiku.txt
Solution
The correct answer is 3, because the -w
option looks only for whole-word matches. The other options will also match ‘of’ when part of another word.
Leah has several hundred data files saved in one directory, each of which is formatted like this:
2013-11-05,deer,5
2013-11-05,rabbit,22
2013-11-05,raccoon,7
2013-11-06,rabbit,19
2013-11-06,deer,2
She wants to write a shell script that takes a species as the first command-line argument and a directory as the second argument. The script should return one file called species.txt
containing a list of dates and the number of that species seen on each date. For example using the data shown above, rabbit.txt
would contain:
2013-11-05,22
2013-11-06,19
Put these commands and pipes in the right order to achieve this:
cut -d : -f 2
>
|
grep -w $1 -r $2
|
$1.txt
cut -d , -f 1,3
Hint: use man grep
to look for how to grep text recursively in a directory and man cut
to select more than one field in a line.
An example of such a file is provided in data-shell/data/animal-counts/animals.txt
Solution
grep -w $1 -r $2 | cut -d : -f 2 | cut -d , -f 1,3 > $1.txt
You would call the script above like this:
$ bash count-species.sh bear .
You and your friend, having just finished reading Little Women by Louisa May Alcott, are in an argument. Of the four sisters in the book, Jo, Meg, Beth, and Amy, your friend thinks that Jo was the most mentioned. You, however, are certain it was Amy. Luckily, you have a file LittleWomen.txt
containing the full text of the novel (data-shell/writing/data/LittleWomen.txt
). Using a for
loop, how would you tabulate the number of times each of the four sisters is mentioned?
Hint: one solution might employ the commands grep
and wc
and a |
, while another might utilize grep
options. There is often more than one way to solve a programming task, so a particular solution is usually chosen based on a combination of yielding the correct result, elegance, readability, and speed.
Solution
This solution is inferior because grep -c
only reports the number of lines matched. The total number of matches reported by this method will be lower if there is more than one match per line.
Perceptive observers may have noticed that character names sometimes appear in all-uppercase in chapter titles (e.g. ‘MEG GOES TO VANITY FAIR’). If you wanted to count these as well, you could add the -i
option for case-insensitivity (though in this case, it doesn’t affect the answer to which sister is mentioned most frequently).
![find-file-tree](fig/find-file-tree.svg)
The -v
option to grep
inverts pattern matching, so that only lines which do not match the pattern are printed. Given that, which of the following commands will find all files in /data
whose names end in s.txt
but whose names also do not contain the string net
? (For example, animals.txt
or amino-acids.txt
but not planets.txt
.) Once you have thought about your answer, you can test the commands in the data-shell
directory.
find data -name "*s.txt" | grep -v net
find data -name *s.txt | grep -v net
grep -v "net" $(find data -name "*s.txt")
- None of the above.
Solution
The correct answer is 1. Putting the match expression in quotes prevents the shell expanding it, so it gets passed to the find
command.
Option 2 is incorrect because the shell expands *s.txt
instead of passing the wildcard expression to find
.
Option 3 is incorrect because it searches the contents of the files for lines which do not match ‘net’, rather than searching the file names.
Write a short explanatory comment for the following shell script:
wc -l $(find . -name "*.dat") | sort -n
Solution
- Find all files with a .dat extension recursively from the current directory
- Count the number of lines each of these files contains
- Sort the output from step 2. numerically