Creating scientific documents

Communication forms an important part of the scientific process. In Collaborating on projects we added a README.md file describing the purpose of the project and how to perform the analysis. This is a good thing. However it may not be clear how this process can be extended to produce scientific documents with figures, equations and reference management. In this chapter we will learn how to do this.

Starting the manuscript

Here we will build on the work from Data analysis, Data visualisation and Collaborating on projects chapters. Let’s start by going into the S.coelicolor-local-GC-content directory.

$ cd S.coelicolor-local-GC-content

Now use your favorite text editor to add the text below to a file named manuscipt.md.

## Introduction

*Streptomyces coelicolor* is a bacteria that can produce a range of natural
products of pharmaceutical relevance.

In 2002 the genome of *S. coelicolor*  A3(2), the model actinomycete
organism, was sequenced.  In this study we investigate the local GC content
of this organism.


## Methodology

The genome of *S. coelicolor*  A3(2) was downloaded from the Sanger Centre
ftp site.

The local GC content was then calculated using a sliding window of 100 KB
and a 50 KB step size.


## Results

The mean of the local GC content is 72.1% with a standard deviation of 1.1.


## Conclusion

There is little variation in the local GC content of *S. coelicolor* A3(2).

The manuscript above won’t make it into Nature, but it will do for us to illustrate how to build up a manuscript using plain text files.

In terms of the markdown syntax the two hashes ## are used to mark level 2 headings and the words surrounded by asterisk symbols (*) will be emphasized using italic font.

Converting the document to HTML using pandoc

Markdown is a lightweight syntax that can easily be converted to HTML. Here we will use of the tool Pandoc to convert the markdown document to HTML.

First of all you will need to install Pandoc. Generic information on how to install software can be found in Managing your system. On a Mac you can use Homebrew and on Linux based systems it should be available for install using the distributions package manager. For more detail have a look at the Pandoc installation notes.

Now that we have installed Pandoc, let’s use it to convert our manuscript.md file to a standalone HTML file.

$ pandoc -f markdown -t html -s manuscript.md > manuscript.html

In the above the -f markdown option means from markdown and the -t html option means to html. The -s option means standalone, i.e. encapsulate the content with appropriate appropriate headers and footers.

Pandoc writes to the standard output stream so we redirect it (>) to a file named manuscript.html. Have a look at the manuscript.html file using a web browser.

Alternatively, we could have used the -o option to specify the name of an output file. The command below produces the same outcome as the previous command.

$ pandoc -f markdown -t html -s manuscript.md -o manuscript.html

Now use a web browser to view the generated manuscript.html file.

Adding a figure

At this point it would be good to add the figure produced in Data visualisation to the “Results” section of the manuscript.

In markdown images can be added using the syntax below.

![Alternative text](path/to/image.png)

In HTML the intention of the alternative text (the “alt” attribute) is to provide a descriptive text in case the image cannot be displayed for some reason. Pandoc makes use of the alternative text attribute to create a caption for the image.

## Results

The mean of the local GC content is 72.1% with a standard deviation of 1.1.

![**Variation in the local GC content of *S. coelicolor* A3(2).** Using a
window size of 100 KB and a step size of 50 KB the local GC content has a
mean of 72.1% and a standard deviation of 1.1.](local_gc_content.png)

In the above the double asterix (**) is used as markup for bold text. This will serve as a title for the figure caption.

Now we can build the document again.

$ pandoc -f markdown -t html -s manuscript.md -o manuscript.html

Converting the document to PDF

HTML is great for websites. However, scientific documents tend to be read as PDF. Let us use Pandoc to convert our document to PDF.

However, before we can do this we need to install LaTeX. On Mac install MacTeX. On Linux use you package manager to install LaTeX, possibly known as “TeX Live”. See the section on Obtaining LaTeX on the LaTeX project website for more information.

Now that you have installed LaTeX you can convert the manuscript.md markdown file to PDF using the command below.

$ pandoc -f markdown -t latex -s manuscript.md -o manuscript.pdf

In the above we use the -t latex option to specify that the manuscript.pdf output file should be built using LaTeX.

Reference management

Reference management is a particularly prominent feature of scientific writing. Let us therefore look at how we can include references to websites and papers in our document.

Let’s start by creating a bibliography file. Copy and paste the content below into a file named references.bib.

@online{S.coelicolor-genome,
title={{S. coelicolor genome}},
url={ftp://ftp.sanger.ac.uk/pub/project/pathogens/S_coelicolor/whole_genome/},
urldate={2016-07-10}
}

This is a so called BibTex record. In this particular case it is a BibTex record for an online resource, as indicated by the @online type. You would also use the @online type to reference web pages.

The text S.coelicolor-genome is the “key” assigned to this record. The key could have been called anything as long as it is unique. This key will be used within our document when citing this record.

Now append the text below to the bottom of the references.bib file.

@article{Bentley2002,
title={Complete genome sequence of the model actinomycete
       Streptomyces coelicolor A3 (2)},
author={Bentley, Stephen D and Chater, Keith F and Cerdeno-Tarraga, A-M and
        Challis, Greg L and Thomson, NR and James, Keith D and
        Harris, David E and Quail, Michael A and Kieser, H and
        Harper, David and others},
journal={Nature},
volume={417},
number={6885},
pages={141--147},
year={2002},
publisher={Nature Publishing Group}
}

Do not type in BibTex records by hand. The entire Bentley2002 record was copied and pasted from Google Scholar. Note that in the record above the identifier was changed from bentley2002complete (key used by Google Scholar) to Bentley2002.

References managers such as Mendeley and Zotero can also be used to export BibTex records. More suggestions on how to access BitTex records can be found on the Tex StackExchange site.

Now let’s add some references to our manuscript.md file.

## Introduction

*Streptomyces coelicolor* is a bacteria that can produce a range of natural
products of pharmaceutical relevance.

In 2002 the genome of *S. coelicolor*  A3(2), the model actinomycete
organism, was sequenced [@Bentley2002].

In this study we investigate the local GC content of this organism.


## Methodology

The genome of *S. coelicolor*  A3(2) was downloaded from the Sanger Centre
ftp site [@S.coelicolor-genome].

Now we can add referenes using Pandoc’s built in pandoc-citeproc filter.

$ pandoc -f markdown -t latex -s manuscript.md -o manuscript.pdf   \
  --filter pandoc-citeproc --bibliography=references.bib

The --filter pandoc-citeproc argument results in automatically adding citations and a bibliography to the document. However, this requires some knowledge of where the bibliographic information is, this is specified using the --bibliography=references.bib argument.

“CiteProc” is in fact a generic name for a program that can be used to produce citations and bibliographies based on formatting rules using the Citation Style Langauge (CSL) syntax. Zotero provides CSL styles for lots of journals in the Zotero Style Repository.

Let’s download Zotero’s CSL file for Nature, copy and paste this text into a file named nature.csl.

$ curl https://www.zotero.org/styles/nature > nature.csl

We can now produce our document using Nature’s citation style.

$ pandoc -f markdown -t latex -s manuscript.md -o manuscript.pdf   \
  --filter pandoc-citeproc --bibliography=references.bib  \
  --csl=nature.csl

Have a look at the generated PDF file. Pretty neat right?! One thing that is missing is a title for the reference section. Let’s add that to the manuscript.md file.

## Conclusion

There is little variation in the local GC content of *S. coelicolor* A3(2).


## References

Adding meta data

To turn this into a research article we need to add a title, authors, an abstract and a date. In Pandoc this can be achieved by adding meta data to the top of the file, using a YAML syntax (see Useful plain text file formats for information on YAML).

Add the header below to the top of the manuscript.md file.

---
title: "*S. coelicolor* local GC content analysis"
author: Tjelvar S. G. Olsson and My Friend
abstract: |
  In 2002 the genome of *S. coelicolor*  A3(2), the model actinomycete
  organism, was sequenced.

  The local GC content was calculated using a sliding window of
  100 KB and a 50 KB step size.

  The mean of the local GC content was found to be 72.1% with a standard
  deviation of 1.1. We therefore conclude that there is little variation
  in the local GC content of *S. coelicolor* A3(2).
date: 25 July 2016
---
## Introduction

Let’s give some explanation of the meta data above. The YAML meta data is encapsulated using ---. The title string is quoted to avoid the * symbols confusing Pandoc. The pipe symbol at the beginning of the abstract allows for multi-line input with newlines, note that the multi-lines must be indented.

Let’s generate the document again.

$ pandoc -f markdown -t latex -s manuscript.md -o manuscript.pdf   \
  --filter pandoc-citeproc --bibliography=references.bib  \
  --csl=nature.csl

The manuscript.pdf document is now looking pretty good!

Anther useful feature of Pandoc’s meta data section is that we can add information for some of the data that we previously had to specify on the command line. Let’s add items for the --bibliograpy and --csl options (these options are in fact short hand for --metadata bibliograpy=FILE and --metadata csl=FILE).

date: 25 July 2016
bibliography: references.bib
csl: nature.csl
---
## Introduction

Now we can generate the documentation using the command below.

$ pandoc -f markdown -t latex -s manuscript.md -o manuscript.pdf   \
  --filter pandoc-citeproc

This is a good point to commit a snapshot to version control. Let’s look at the status of our repository first.

$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Untracked files:
  (use "git add <file>..." to include in what will be committed)

        bioinformatics.csl
        manuscript.html
        manuscript.md
        manuscript.pdf
        nature.csl
        references.bib

nothing added to commit but untracked files present (use "git add" to track)

We have created many new files. We want to track all of them except manuscript.pdf and manuscript.html as they can be generated by Pandoc. Let us therefore update the .gitignore file to look like the below.

manuscript.*
!manuscript.md

Sco.dna
local_gc_content.csv
local_gc_content.png

In the above the first two lines are new. Let’s explain what they do. The first line states that all files starting with manuscript. should be ignored. This includes the file we want to track manuscript.md. On the second line we therefore add an exception for this file, the exclamation mark (!) is used to indicate that the manuscript.md should be excluded from the previous rule to ignore it.

$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   .gitignore

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        bioinformatics.csl
        manuscript.md
        nature.csl
        references.bib

no changes added to commit (use "git add" and/or "git commit -a")

Now we can add the remaining files and commit the snapshot.

$ git add bioinformatics.csl manuscript.md nature.csl references.bib
$ git commit -m "Added draft manuscript"
[master 7b06d9d] Added draft manuscript
 4 files changed, 332 insertions(+)
 create mode 100644 bioinformatics.csl
 create mode 100644 manuscript.md
 create mode 100644 nature.csl
 create mode 100644 references.bib

Finally, let us also add and commit the updated .gitignore file.

$ git add .gitignore
$ git commit -m "Updated gitignore to ignore generated manuscript files"
[master bea89f4] Updated gitignore to ignore generated manuscript files
 1 file changed, 3 insertions(+)

Benefits of using Pandoc and plain text files

Why go through all this trouble to produce a PDF document? Would it not be easier to simply write it using a word processor and export it as PDF?

There are three main advantages to the methodology outlined in this chapter.

  1. There are lots of awesome tools for working with plain text files. If you decide to create your manuscript using plain text files you can take advantage of them. Worthy of special mention is Git which is one of the most powerful collaboration tools in the world, with the added advantage of giving you unlimited undo functionality and a transparent audit trial.

  2. Automation! We will go into this more in the next chapter, Automation is your friend. However, for now imagine that someone discovered that there was something wrong with the raw data that you had been using. How long would it take you to update your manuscript? Using plain text files it is possible to create automated work flows to build entire documents from raw data.

  3. Ability to convert to any file format. We have already seen how you can covert the document to HTML. What if your collaborator really needs Word? No problem.

    $ pandoc -f markdown -t docx -s manuscript.md -o manuscript.docx  \
      --filter pandoc-citeproc
    

Incidentally, another advantage of learning to use Pandoc is that it is not limited in going from markdown to other formats. It can take almost any file format and convert it to any other file format. For example, one could convert the Word document we just created to TeX.

$ pandoc -f docx -t latex manuscript.docx -o manuscript.tex

Alternative plain text approaches

There are other methods of creating scientific documents using plain text files.

One option is to write them using the LaTeX syntax. This is a less intuitive than using markdown. However, it gives you a bit more control and flexibility.

A good alternative for people that miss the benefits of a graphical user interface (GUI) is LyX. LyX allows you to write and format text using an intuitive GUI. It is different from many other word processing tools in that it places more focus on the structure of the document and less focus on the appearance. The outputs of LyX are also plain text files, a derivative of LaTeX files. However, LyX also has built-in functionality for exporting files in a variety of file formats including PDF.

Using non-plain text file formats

This chapter has outlined how you can work with plain text file formats to produce scientific documents. However, many people really like using Microsoft Word. You might be one of these people. In this case I hope that this chapter has given you some food for thought. The purpose of this chapter was not to try to force you to change your habits, but to outline alternative methods.

There are some disadvantages to using Microsoft Word. The biggest one probably being collaborative editing. You tend to end up with files named along the lines of manuscript_revision3_TO_edits.docx being emailed around. This is not ideal. However, on the other hand if it is the tool that you and/or your collaborators are familiar with it is easy to work with.

If you and/or your collaborators do not want to use plain text files, but want to avoid emailing Word documents around, you may want to consider Google Docs. It is similar to Word and LyX in that it has a GUI. Furthermore, it has built in support for collaborative working and version control.

For an extended discussion on the pros and cons of various ways of creating scientific documents I would recommend reading Good Enough Practices in Scientific Computing. One of the key take home messages from this paper is for everyone involved to agree on the tools before starting the process of writing.

Key concepts

  • Pandoc is a powerful tool that can be used to convert (almost) any document format to any other
  • Markdown can be used to add document structure to plain text files
  • The pandoc-citeproc filter can be used to add citations and a bibliography
  • The Citation Style Language (CSL) can be used to format the citations and the bibliography
  • You can access BibTex records from Google Scholar
  • Mendeley and Zotero are a free reference managers that can be used to export BibTex records (try to avoid having to type out BibTex records by hand)
  • Zotero provides CSL citation style files for lots and lots of journals
  • It is possible to add YAML meta data to markdown files specifying attributes such as title, author and abstract
  • Using plain text files for scientific documents allows you to make use of awesome tools such as Git
  • Speak to your collaborators and agree on the tools that best suit your needs