First steps towards automation¶
One of the first steps towards automating repetitive tasks is to become familiar with the command line. In this chapter we will accomplish this by using the command line to extract UniProt identifiers for all human proteins in Swiss-Prot.
More precisely we will be using a terminal emulator to run Bash. A terminal emulator is an application that gives you access to another program known as the shell. The shell allows you to interact with the operating system’s services and programs. The most widely used shell and the default on Mac and most Linux based distributions is Bash.
You will get the most benefit from this chapter if you work through the example as you go along. You are therefore encouraged to open up a terminal now (Fig. 2)!
On Macs the command line is available through the Terminal application. There are a number of terminal emulators available for Linux. If you are using the Gnome-based desktop the default is likely to be the Gnome Terminal.
On Windows things are a little bit more complicated. Windows does not come bundled with Bash by default. I would recommend installing VirtualBox and running a Linux distribution such as BioLinux in it. VirtualBox is a so called hypervisor that lets you run a virtual machine. In this scenario you would run BioLinux as a virtual machine. For more information on how to run BioLinux as a virtual machine see the BioLinux installation notes.
The benefit of learning how to use Linux/Mac and Bash is that most bioinformatics software are developed on and designed to run on these platforms. Furthermore, the terminal and Bash provides an excellent way of creating analysis pipelines.
First things first, how to find help¶
Most of the commands on the command line have built in help that can be accessed by
providing either the argument
--help. For example to access help for
gunzip command, which we will use later, enter the command below into the shell
and press enter:
$ gunzip --help
More descriptive documentation can usually be found using the
command. For example to view the manual page of the
ls command you can run
the command below.
$ man ls
To get out of the “man-page” press the “q” key. Just like I encourage you to try out the examples outlined in this chapter, I also encourage you to examine the help and man-page documentation for the commands that we use to get a better understanding of what the commands do.
Creating a new directory for our project¶
First of all let us make sure that we are working in our home directory.
cd command, short for change directory, is used to move between
directories. If called without a path to a directory it will move you into your
We can print out the name of the current working directory using the
command. Furthermore we can list the contents of a directory using the
$ pwd /home/olsson $ ls
Now that we know where we are and what files and directories are present let us
create a new directory for our project. This is achieved using the
command, short for make directory. After having created the directory move
into it using the
$ mkdir first_step_towards_automation $ cd first_step_towards_automation
When using the command line one learns to avoid using white spaces in
file and directory names. This is because white spaces are used to separate
arguments. In the example above we used underscores instead of white spaces.
However, one could just as well have used hyphens. This comes down to personal
preference. It is possible to represent file names with spaces in them on the
command line by using the backlash character (
\) to “escape” the
whitespace, for example
first\ steps\ towards\ automation or by surrounding
the text in quotes
"first steps towards automation".
Downloading the Swiss-Prot knowledge base¶
UniProt (Universal Protein Resource) is a comprehensive resource of protein sequences and annotations. The UniProt Knowledgebase (UniProtKB) consists of Swiss-Prot and TrEMBLE. Both are annotated. However, the procedure in which they are annotated differ. TrEMBLE uses an automatic annotation system, whereas the annotation in SwissProt is manual and includes a review process.
It is time to download the Swiss-Prot knowledge base from UniProt. We will
curl program to do this. The
curl command is a C program that
allows us to stream data from URLs and FTP sites. By default the
program writes the content of the URL to the standard output stream.
To see this in action try running the command:
$ curl www.bbc.com
You should see a whole lot of HTML text appearing in your terminal window.
However, because we are going to download a large file we would like to
write it to disk for future use. Many command line programs allow the user to
specify additional options. In this particular case we can use the
--output option to specify a file name that the output should be
written to. To exemplify this let us download the BBC home page to a file named
$ curl --output bbc.html www.bbc.com
Here we will use a URL shortened using bitly to save on typing. The shortened URL contains a redirect to the relevant Swiss-Prot FASTA file hosted on the UniProt FTP site. To find out where the shortned URL redirects to run the command:
$ curl http://bit.ly/1l6SAKb
To allow the redirection to occur we need to use the
which will redirect the request to the new location.
Let us download the gzipped FASTA file from the UniProt FTP site:
$ curl --location --output uniprot_sprot.fasta.gz http://bit.ly/1l6SAKb
The downloaded file
uniprot_sprot.fasta.gz has been compressed using the
gzip protocol. We can extract it using the
gunzip command. However,
when extracted it more than doubles in size. So we will use the
option to extract the content to the standard output stream whilst leaving the
original file compressed.
Try running the command:
$ gunzip --to-stdout uniprot_sprot.fasta.gz
You should see a lot of FASTA lines printed to your terminal, or more formally the standard output stream.
Options starting with two dashes,
--, are known as long options. Many of
these long options also have abbreviated “short” options. For example, the
-c option of
gunzip is equivalent to the
--to-stdout option. Try
running the command:
$ gunzip -c uniprot_sprot.fasta.gz
From now on the text will use the short
-c option rather than the long
--to-stdout option to save on typing.
Remember that you can use the
-h option to get
information on the meanings of the various options available to you.
Creating a work flow using pipes¶
Now it is time to introduce one of the greatest features of the command line: pipes!
Pipes are a means to redirect the output from one command into another. The character
used to represent a pipe is the vertical bar:
To illustrate the use of pipes we will redirect the output of the previous
gunzip command to the word count program
wc. Try running the command
$ gunzip -c uniprot_sprot.fasta.gz | wc
It should give you three numbers, these are the line, word and character counts. To
only see the line count one could use the
$ gunzip -c uniprot_sprot.fasta.gz | wc -l
Pipes are powerful because they allow a set of simple commands to be combined to perform tasks that are beyond the scope of any of the individual commands. This has led to a central Unix philosophy of having simple programs that do one task well and a rich ecosystem of such programs. The user is then free to combine these programs to create personalised tools to automate repetitive processing tasks.
Another powerful feature of pipes is that the program being piped to gets access to the output stream of data from the program piping data into the pipe as soon as it is available. This means that the processing of data can happen in parallel.
Examining files, without modifying them¶
Unix-based systems make a distinction between programs that are used for examining files, known as pagers, and programs that are used for editing files, known as text editors. The reason for making this distinction is to help prevent accidental changes to files when reading them.
To view the beginning of a file one can use the
head command. Let us examine
the first lines of the
uniprot_sprot.fasta.gz file by pipeing the output of the
gunzip command into
$ gunzip -c uniprot_sprot.fasta.gz | head
You should see something like the output below being written to the terminal window.
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog ... MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD SFRKIYTDLGWKFTPL >sp|Q6GZX3|002L_FRG3G Uncharacterized protein 002L OS=Frog virus ... MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL AERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADC
The beauty of the
head command is that it allows you to quickly view the
beginning of a file without having to read in the content of the entire file.
The latter can present a real problem if working on “big data” files. In fact,
this is also the beauty of pipes, which allows downstream programs to work on
the stream of data without having to wait for it to be written to or read from
By default the
head command writes out the first ten lines. However, this
can be modified using the
-n option, for example to write out the first 20
$ gunzip -c uniprot_sprot.fasta.gz | head -n 20
Similarly, there is a
tail command for displaying the tail end of a file,
again ten lines by default.
$ gunzip -c uniprot_sprot.fasta.gz | tail
You may have noticed that the workflow above, to view the last ten lines, took a little longer to complete. That is because we needed to decompress the whole file before we could access the last ten lines of it.
To page through an entire file one can use the
$ gunzip -c uniprot_sprot.fasta.gz | less
One can use the “Up” and “Down” arrows to navigate through the file using
less. One can also use the “Space” key to move forward by an entire page,
hence the term pager. To page back one page press the “b” key. When you are
finished examining the file press “q” to quit
Finding FASTA identifier lines corresponding to human proteins¶
Now that we have an idea of what the file looks like it is time to extract the FASTA identifiers that correspond to human proteins.
A powerful command for finding lines of interest in text is the
program, which can be used to search for strings and patterns. Let us use it to
search for the string “Homo”:
$ gunzip -c uniprot_sprot.fasta.gz | grep Homo | less
To make the match more visible we can add the
--color=always option, which
will highlight the matched string as red.
$ gunzip -c uniprot_sprot.fasta.gz | grep --color=always Homo | less
If you scroll through the matches you will notice that we have some false
positives. We can highlight these by performing another
grep command that
finds lines that do not contain the string “sapiens”, using the
--invert-match option or the equivalent
-v short option.
$ gunzip -c uniprot_sprot.fasta.gz | grep Homo | grep -v sapiens
To make the search more specific we can search for the string “OS=Homo sapiens”. To do this we need to surround the search pattern by quotes, which tells the shell that the two parts separated by a white space should be treated as one argument.
$ gunzip -c uniprot_sprot.fasta.gz | grep "OS=Homo sapiens"
To work out how many lines were matched we can pipe the output of
$ gunzip -c uniprot_sprot.fasta.gz | grep "OS=Homo sapiens" | wc -l
Extracting the UniProt identifiers¶
Below are the first three lines identified using the
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens ... >sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=... >sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHA...
Now that we can identify description lines corresponding to human proteins we
want to extract the UniProt identifiers from them. In this instance we will use
cut to chop the line into smaller fragments, based on a
delimiter character, and print out the relevant fragment. The delimiter we are
going to use is the vertical bar (“|”). This has got nothing to do with
pipeing, it is simply the character surrounding the UniProt identifier. By
splitting the line by “|” the UniProt id will be available in the second
The command below makes use of the backslash character (
\) at the end of
the first line. This tells bash that the command continues on the next line.
You can use this syntax in your scripts and in the terminal. Alternatively, you
can simply include the content from both lines below in a single line, omitting
$ gunzip -c uniprot_sprot.fasta.gz | grep "OS=Homo sapiens" \ | cut -d '|' -f 2
In the above the
-d option specifies the delimiter to use to split
the line, in this instance the pipe symbol (
-f 2 option
specifies that we want to extract the second field.
Remember to try out these commands on your computer to see the actual output of the commands.
Using redirection to create an output file¶
By default the output from commands are written to the standard output stream.
Earlier we saw that we could use the pipes to redirect the output to another
command. However, it is also possible to redirect the output to a file, i.e.
save the output to a file. This is achieved using the greater than symbol
>). You can use the idea of an arrow as a mnemonic, the output is going
from the command into the file as indicated by the arrow.
$ gunzip -c uniprot_sprot.fasta.gz | grep "OS=Homo sapiens" \ | cut -d '|' -f 2 > human_uniprot_ids.txt
Now if you run the
ls command you will see the file
human_uniprot_ids.txt in the directory and you can view its contents using
$ less human_uniprot_ids.txt
Well done! You have just extracted the UniProt identifiers for all human proteins in Swiss-Prot. Have a cup of tea and a biscuit.
The remainder of this chapter will go over some more useful commands for working on the command line and reiterate some of the key take home messages.
Viewing the command history¶
Okay, so you have had a relaxing cup of tea and your head is no longer buzzing from information overload. However, you have also forgotten how you managed to extract those UniProt identifiers.
Not to worry. You can view the history of your previous commands using
Note that each command has a history number associated with it. You can use the number in the history to rerun a previous command without having to retype it. For example to rerun command number 597 you would type in:
Note that the exclamation mark (
!) in the above is required.
Clearing the terminal window¶
After having run the
history command the terminal window is full of information.
However, you find it distracting to have all those commands staring at you whilst
you are trying to think.
To clear the screen of output one can use the
Sometimes, for example if you try to view a binary file using a pager, your
shell can start displaying garbage. In these cases it may help to run the
In general it is advisable to use
clear as it only clears the terminal screen
reset reinitialises the terminal.
Copying and renaming files¶
You want to store a copy of your
human_uniprot_id.txt file in a backup
For this exercise let us start by creating a backup directory.
$ mkdir backup
Now we can copy the file into the backup directory using the
$ cp human_uniprot_id.txt backup/
The command above uses the original name of the file. However, we could have given it a different name, for example including the date.
$ cp human_uniprot_id.txt backup/human_uniprot_id_2015-11-10.txt
Finally, suppose that one wanted to rename the original file to use hyphens
rather than under scores. To to this one would use the
mv command, mnemonic
$ mv human_uniprot_id.txt human-uniprot-id.txt
Removing files and directories¶
Having experimented with the command line we want to clean up by removing unwanted files and directories.
One can remove files using the
$ rm backup/human_uniprot_id.txt
Empty directories can be removed using the
$ mkdir empty $ rmdir empty
To remove directories with files in them one can use the
rm command with
the recursive option:
$ rm -r backup
Think twice before deleting files, they will be deleted permanently.
rm there is no such thing as recycle bin from which
the files can be recovered.
- The command line is an excellent tool for automating repetitive tasks
- A terminal application provides access to a shell
- A shell allows you to interact with the operating system’s services and programs
- The most commonly used shell is Bash
- Pipes can be used to combine different programs into more complicated work flows
- In general it is better to create small tools that do one thing well
- Think twice before deleting files
- Use the
-helpoption to understand a command and its options